Data Quality Metrics
NDR2017, June 6th – 8th, Stavanger, Norway
Report of the NDR Data Quality Metrics Workgroup
2014-2017
Philip Lesslar
Helen Stephenson
Ugur Algan
Jill Lewis
- Development of the Data
Quality Guidelines for
implementation- Documentation of key
data quality issues
- Basic Data Quality
Primers prepared
2012-2014
2014-2017
2017-2019/20
2020++
Breakout to list and
categorize data
quality problems
-Set up of Working
Group
3 breakouts to cover
i) business rules
ii) tools and
dashboards
iii) data correction
process
- Define, design & carry
out implementation in
NOCs
- Data quality
transparency drive
- Define architectural and
quality requirements for
big data analytics
Ad hoc sessions
i) business rules
development,
ii) Implementation
planning
Inventorize &
Understand
Define Way
Forward
Implementation
Planning
Trusted Data
& Robust
Analytics
Data Quality Metrics – The Journey
October 2012
October 2014
June 2017
Cognitive
Computing
Trusted
Data
Asset
Fit-for-
purpose
quality
levels
Transparent
& ongoing
monitoring
Shorter
decision
making lead
time Project
ready data
Enhanced
reputation
to external
parties
Investment
attractive
Quality
metrics of
data
submitted
Data
management
workstreams
The EP value chain
External investments
Data asset
integrity
Data Quality in the context of NDRs
Business rules
library
DQ Tool &
target DBs
NDR Dashboard
Traffic lights
Data correction
“Going Green”
Objectives:
• Understand what business
rules are
• Review proposed structure
• Review existing rules list
• Define additional rules (eg 20
more)
• Agree on SIG virtual structure
and roadmap to next NDR
Deliverables:
• 20 additional business rules
• Agreed SIG working structure
• Roadmap to next NDR
Facilitator
• Helen Stephenson
Objectives:
• Understand the working model of
a DQ tool
• Understanding how business rules
are implemented and used
• Discuss the presentation of results
in a DQ dashboard
• Develop and agree on a dashboard
for NDR purposes
Deliverables:
• DQ Dashboard mockup for NDR
Facilitators
• Jill Lewis / Philip Lesslar
Objectives:
• Discuss the data correction
workflow
• Review the requirements for
business rules in order to
facilitate data correction
• Develop and agree the workflow
needed to ensure data
corrections (include roles &
responsibilities)
Deliverables:
• Data correction workflow
• Requirements for business rules
Facilitator
• Ugur Algan
1 2 3
HOVSAN AMBURAN SHARQ ZALI 1
NDR2014 Breakouts
NDR11 Data Quality Workgroup Team Members
Name
1 Helen Stephenson
2 Andrew Ochan
3 Marco Cota
4 Fanny Herawati
5 Melissa Amstelveen
6 Sarah Spinoccia
7 Jess Kozman
8 Ugur Algan
9 Richard Wylde
10 Cyril Dzreke
11 Lim TeckHuat
12 Hairel Dean
13 Deano Maling
14 Choo Chuan Heng
15 Iman Al-Farsi
16 Jill Lewis
17 Henri Blondelle
18 Armando Gomez
19 Kapil Joneja
20 Giuseppe Vitobello
21 Gareth Wright
22 Chan Kok Wah
23 Ali Alyahyaee (Scribe)
24 Philip Lesslar (Facilitator)
24
NDR2014 Data Quality Workgroup Team Members
3 DATA CORRECTION
WORKFLOW
1 Mehman Yusufov
2 Vahid Jafarov
3 Aleksa Shchorlich
4 Ngwako Maguai
5 Joseph Justin Soosai
6 Samit Sencurta
7 Julian Pickering
8 Ugur Algan
2 TOOLS &
DASHBOARDS
1 Kapil Jonjega
2 Jack Walten
3 Johanda du Toit
4 Gustavo Tinoco
5 Ferdinand Aniwa
6 David Atta-Peters
7 Angus Craig
8 Natalia Rakhmanina
9 Glab Khanuntin
10 Edem Mawuko
11 Alexander Kosolapov
12 Daniel Arthur
13 Eric Toogood
14 Tatiana Vassilieva
15 Henri Blondelle
16 Marianne Hansen
17 Jill Lewis
18 Mikhail Leypunsky
19 Aygun Mamedova
20 Rena Huseyn-zade
21 Irada Huseynova
22 Philip Lesslar
1 BUSINESS RULES
1 Helen Stephenson
2 Richard Salway
3 Abraham Oseng
4 Malcolm Flowers
5 Uffe Larsen
6 Calisto Nhatugues
7 Sylvester Nguessan
8 Gianluca Monachese
9 Jan Adolfssen
10 Lee Allison
40
• The WG will consolidate all the discussions and will upload it to the
collaboration site (Minutes of meetings)
• WG will work through Energistics to explore an improved collaboration space
• We all need to increase global participation and collaboration
• Business Rules:• Ongoing buildup of the library
• Separate into logical categories (data types, activity type etc)
• Develop documentation and “soft version”
• Compilation of a DQ Metrics Starter “Kit”• Include work done to date
• Guidelines on how to get started
• Example implementation & experiences
• What are the essentials and pitfalls
• Tools assessment and fitting to your needs etc
NDR2014 Closing– What we planned to do
Did not manage to do
What we did : 2014-2017
Part 1: Background and
Case for Change
Part 2: Business Rules
Fundamentals
Part 3: Implementation
Context and justification
to Management
Data quality dimensions,
key concepts around
business rules, 18 data
types, 241 rules
Metrics, dashboards,
implementing rules as
queries, understanding
results, getting the
program going
NDR2017 Activities for Data Quality Metrics
• No data quality metrics breakouts planned in view of full
program
• Completed documents available for review
• The Working Group (WG) is available to discuss
implementation of data quality metrics
Why implement data quality metrics?
Quality, fit-
for-purpose
data
Streamlines the
business and its
workflows
Increases data
asset value and
investor
confidence
Builds essential
data condition for
effective use of
new technologies
Perspective
Business
NDR
Data Management
Enabler for improving
data efficiency by up
to 90%
Data Science &
Analytics
• Without metrics, we cannot
measure the quality of the
data we have
• Consequently, we cannot
show how much quality, fit-
for-purpose data there is…
Investment Trends
Across all businesses, there will be a
greater than 300% increase in
investment in artificial intelligence in
2017 compared with 2016.
Across all businesses, there will be a
greater than 300% increase in
investment in artificial intelligence in
2017 compared with 2016.
Open
OriginalFormat Data
Reference Data/Metadata
Master Data/Corporate
“Single Source of Truth”
Derived Data Data Collections
Raw SeismicRaw Logs
Units of measure- Linear measures- Pressure
Static (hard) data- Well header- Deviation- Checkshot- Temperature- Pressure
Processed data- Seismic deconvolution- Seismic filtering- Seismic processing- Edited logs- Spliced logs
Composite data- Completion log- Mud log- Paleontological
composites- TRAPIS
Abbreviations- TD, DFE, KB etc
Interpreted (soft) data- Geological markers- Seismic horizons
Interpreted data- Geological markers- Seismic horizons
Data hoards- Projects en masse- Personal stores- Team folders
Valid Lists Data archive- Projects en masse
Range indicators
Comments
Requires:- Official data repository
Requires:- Standards- Implementation
across all impacted tools and databases
Requires:- Clear processes, workflows
and checkpoints- Proper & official repository- Management and security
processes around repository and data access
Requires:- Standard workflows- Standard algorithms- Standard processes- Housekeeping procedures
Requires:- Standard display and
formatting templates- Procedures
Secondary DataPrimary Data
Data Classification – Digital Data (>100 types in Upstream)
Open
…. ……. …..
Exploration workflow
Facilities workflow
Production Geology workflow
Analytics workflow
Data Type Building Blocks along the Exploration & Production (EP) Value Chain
PRIMARY DATA SECONDARY DATA
Quality Data Envelope
Open
…. ……. …..
Exploration workflow
Facilities workflow
Production Geology workflow
Analytics workflow
Data Type Building Blocks along the Exploration & Production (EP) Value Chain
PRIMARY DATA SECONDARY DATA
Packaging Quality Data – The Building Blocks
Quality Data Envelope
Google Analytics
Trend Plot (IQM)
Heat Map
Bubble Plots
Yahoo Web Analytics
Data Science / Analytics – Typical Deliverables
Data
ExtractionData Viz &
Analytics
Corporate
Databank
Stack
Right
Decision
Timely
Intervention
Requires quality
data to be
available
-Mapping
-Extraction
-Cleansing
-Standardising
-Quality control
Cleansed / Qced data does
not flow back to the
corporate banks
Note: The Data Mart starts to have better quality data than the official
corporate databank
Data Analytics Conceptual Architecture
Corporate
Databank
The analytics may not
indicate quality levels
Eg. Annulus Pressure
These errors will only be recognised if you are tracking the
quality levels in the source databank
Data Quality Error Persistence
Business Rule:
Well must have
annulus pressure
defined
Official
Corporate
Databanks
Poorer Quality Better Quality
Data Quality – Progressive Lopsidedness + Hidden Risks
Right
Decision?
Timely
Intervention?
Quality throughout the life cycle
Official
Corporate
Databanks
Data Quality Metrics – Tackling Quality at the Source
Right
Decision
Timely
Intervention
Data Quality Metrics
Dashboard
Checkshot
Well H
eader
Devia
tion
Well L
ogs (
8)
Well I
nte
grity
Pip
elines
• Understand our DATA
INVENTORY
• Measure and KNOW
how much FIT-FOR-
PURPOSE data there is
BUSINESS
NDR
DATA MGT
• We solve business problems
and create new opportunities
• Address data types as
building blocks across all
100+ EP types
Towards data science and big data analytics,
by putting science into data management
Concluding Remarks
• Implement METRICS to
improve QUALITY
• While measuring and knowing
where we are at all times
Thank You
Background Slides
Concluding Remarks
• Data Quality Metrics is key to measuring the quality of data we
manage
• Quality, fit-for-purpose data will positively impact business
workflows (up to 90% in cases)
• We have to know where we are, how far we have come and how
much further we need to go – metrics provides that transparency
• Data science / Analytics is meaningless if we do not put science
into data management.
Data quality is all about monetizing data
Build up of
the NDR
business
rules
library
Extract to
required
data typesNDR
Operators
Service
companies
Better internal data
quality(track quantified benefits)
Better quality of
data submitted(track quantified benefits)
Build utilities into
tools for these(track quantified benefits)
Data Quality Metrics Tool / DashboardBusiness Rules
Once standardized, it makes it easier for compliance
Business Rules in DQM – How it makes sense
3 dimensions
Cluster Analysis – Separating Variables in n-Dimensions
Visualization
2 dimensions
4, 5, ……, n dimensions?
Through the use of dendrograms
Data Science – Frontier Opportunities in O&G
Graph Databases to investigate
data relationships
Modeling the Top 5 Securities using Neo4Jhttps://neo4j.com/graphgist/aad2c4f2-06c7-40a4-
8b20-bca2f2a4ca92
Semantic Text Extraction example:
Final well report – Basic well data
http://www.npd.no/engelsk/cwi/pbl/wellbore_do
cuments/1878_2_7_27_S_Completion_report.pdf
The main goal of data visualization is to communicate information clearly and effectively through graphical means
Typical topics in Data Visualization:- Exploratory Data Analysis- Information design- Descriptive statistics- Inferential statistics- Statistical graphics- Plot graphics- Data analysis- Infographics
Examples of application:- Is there a correlation between
carbohydrates and fat?- Age distribution of shoppers- Trends in production
performance- Porosities versus depth plots- Sand distribution in a basin
Exploring Data Science - Visualization
Gantt Chart Network Chart Tree MapStreamgraphBar ChartScatterplot
Statistics is the science of making decisions in the face of uncertainty. It is a branch of applied mathematics.
Typical topics in Statistics:- Frequency distributions- Measures of location (mean,
median, mode)- Measures of variation (standard
deviation), - Probability and probability
distributions- Expectations- Statistical inference- Analysis of variance- Nonparametric methods- Regression, Correlation- Multivariate methods (Factor,
Principal Components, Discriminant Functions, Cluster etc)
Examples of application:- Drug testing- Deciding on which well to drill- Comparison of the efficiency
of 2 production processes- Election predictions- Casinos- Everyday decisions such as
whether to bring an umbrella or not
- Which route to take to work
Exploring Data Science - Statistics
Machine learning is the study devoted to the development of machines that improve performance with experience.
Typical topics in Machine Learning:- Classification algorithms- Splitting dataset - Decision trees- Probabilistic classification, Bayes- Regression analysis- Forecasting & prediction- Supervised and unsupervised learning- Principal components analysis- Matrix algebra- Big data toolkits (Hadoop, MapReduce)
Examples of application:- Making sense of diverse data- Relating apparently unrelated
data- Quantifying concepts such as
“maximize profits”, “minimize risk”, “find the best marketing strategy”
- Building autonomous robots
Exploring Data Science – Machine Learning
Data Science
Data Science - Profile
Data
Visualization.
Machine
Learning
Mathematics Statistics Computer
Science
Communi-
cation
Domain
Expertise
Source: Doing Data Science. Cathy O’Neil & Rachel Schutt
Inventorize &
Understand
Define Way
Forward
Implementation
Planning
Trusted Data
& Robust
Analytics
Cognitive
Computing
Data Science/ AIData Quality Metrics
Cluster analysis is a multivariate technique which allows comparisons and classifications to be done on a set of samples (Q-mode), based on their species content, even when little is known about the structure of the data.
This example is based on foraminiferal presence/absence data.
Dendrogram of samples from 1 well using Ward’s clustering method and Squared Euclidean Distance coefficient
Source:
Computer-assisted interpretation of depositional palaeoenvironments
based on foraminifera. Philip Lesslar, Geol. Soc. Malaysia Bulletin 21,
December, 1987.
North West Borneo Environmental Scheme
Cluster Analysis Example – Environments of Deposition
Identification Program -
Bayesian Inference
(Likelihood Ratio)Clustering
Build Probability
Matrix of valid
clusters
• ~2500 samples
• ~1500 species in region
• ~3 million identified
species in all samples
Input data
Results
Interpretation of Depositional Environments - Foraminifera