Data Quality Metrics - Energistics · • Data Quality Metrics is key to measuring the quality of...

Data Quality Metrics

NDR2017, June 6th – 8th, Stavanger, Norway

Report of the NDR Data Quality Metrics Workgroup

2014-2017

Philip Lesslar

Helen Stephenson

Ugur Algan

Jill Lewis

- Development of the Data

Quality Guidelines for

implementation- Documentation of key

data quality issues

- Basic Data Quality

Primers prepared

2012-2014

2014-2017

2017-2019/20

2020++

Breakout to list and

categorize data

quality problems

-Set up of Working

Group

3 breakouts to cover

i) business rules

ii) tools and

dashboards

iii) data correction

process

- Define, design & carry

out implementation in

NOCs

- Data quality

transparency drive

- Define architectural and

quality requirements for

big data analytics

Ad hoc sessions

i) business rules

development,

ii) Implementation

planning

Inventorize &

Understand

Define Way

Forward

Implementation

Planning

Trusted Data

& Robust

Analytics

Data Quality Metrics – The Journey

October 2012

October 2014

June 2017

Cognitive

Computing

Trusted

Data

Asset

Fit-for-

purpose

quality

levels

Transparent

& ongoing

monitoring

Shorter

decision

making lead

time Project

ready data

Enhanced

reputation

to external

parties

Investment

attractive

Quality

metrics of

data

submitted

Data

management

workstreams

The EP value chain

External investments

Data asset

integrity

Data Quality in the context of NDRs

Business rules

library

DQ Tool &

target DBs

NDR Dashboard

Traffic lights

Data correction

“Going Green”

Objectives:

• Understand what business

rules are

• Review proposed structure

• Review existing rules list

• Define additional rules (eg 20

more)

• Agree on SIG virtual structure

and roadmap to next NDR

Deliverables:

• 20 additional business rules

• Agreed SIG working structure

• Roadmap to next NDR

Facilitator

• Helen Stephenson

Objectives:

• Understand the working model of

a DQ tool

• Understanding how business rules

are implemented and used

• Discuss the presentation of results

in a DQ dashboard

• Develop and agree on a dashboard

for NDR purposes

Deliverables:

• DQ Dashboard mockup for NDR

Facilitators

• Jill Lewis / Philip Lesslar

Objectives:

• Discuss the data correction

workflow

• Review the requirements for

business rules in order to

facilitate data correction

• Develop and agree the workflow

needed to ensure data

corrections (include roles &

responsibilities)

Deliverables:

• Data correction workflow

• Requirements for business rules

Facilitator

• Ugur Algan

1 2 3

HOVSAN AMBURAN SHARQ ZALI 1

NDR2014 Breakouts

NDR11 Data Quality Workgroup Team Members

Name

1 Helen Stephenson

2 Andrew Ochan

3 Marco Cota

4 Fanny Herawati

5 Melissa Amstelveen

6 Sarah Spinoccia

7 Jess Kozman

8 Ugur Algan

9 Richard Wylde

10 Cyril Dzreke

11 Lim TeckHuat

12 Hairel Dean

13 Deano Maling

14 Choo Chuan Heng

15 Iman Al-Farsi

16 Jill Lewis

17 Henri Blondelle

18 Armando Gomez

19 Kapil Joneja

20 Giuseppe Vitobello

21 Gareth Wright

22 Chan Kok Wah

23 Ali Alyahyaee (Scribe)

24 Philip Lesslar (Facilitator)

24

NDR2014 Data Quality Workgroup Team Members

3 DATA CORRECTION

WORKFLOW

1 Mehman Yusufov

2 Vahid Jafarov

3 Aleksa Shchorlich

4 Ngwako Maguai

5 Joseph Justin Soosai

6 Samit Sencurta

7 Julian Pickering

8 Ugur Algan

2 TOOLS &

DASHBOARDS

1 Kapil Jonjega

2 Jack Walten

3 Johanda du Toit

4 Gustavo Tinoco

5 Ferdinand Aniwa

6 David Atta-Peters

7 Angus Craig

8 Natalia Rakhmanina

9 Glab Khanuntin

10 Edem Mawuko

11 Alexander Kosolapov

12 Daniel Arthur

13 Eric Toogood

14 Tatiana Vassilieva

15 Henri Blondelle

16 Marianne Hansen

17 Jill Lewis

18 Mikhail Leypunsky

19 Aygun Mamedova

20 Rena Huseyn-zade

21 Irada Huseynova

22 Philip Lesslar

1 BUSINESS RULES

1 Helen Stephenson

2 Richard Salway

3 Abraham Oseng

4 Malcolm Flowers

5 Uffe Larsen

6 Calisto Nhatugues

7 Sylvester Nguessan

8 Gianluca Monachese

9 Jan Adolfssen

10 Lee Allison

40

• The WG will consolidate all the discussions and will upload it to the

collaboration site (Minutes of meetings)

• WG will work through Energistics to explore an improved collaboration space

• We all need to increase global participation and collaboration

• Business Rules:• Ongoing buildup of the library

• Separate into logical categories (data types, activity type etc)

• Develop documentation and “soft version”

• Compilation of a DQ Metrics Starter “Kit”• Include work done to date

• Guidelines on how to get started

• Example implementation & experiences

• What are the essentials and pitfalls

• Tools assessment and fitting to your needs etc

NDR2014 Closing– What we planned to do

Did not manage to do

What we did : 2014-2017

Part 1: Background and

Case for Change

Part 2: Business Rules

Fundamentals

Part 3: Implementation

Context and justification

to Management

Data quality dimensions,

key concepts around

business rules, 18 data

types, 241 rules

Metrics, dashboards,

implementing rules as

queries, understanding

results, getting the

program going

NDR2017 Activities for Data Quality Metrics

• No data quality metrics breakouts planned in view of full

program

• Completed documents available for review

• The Working Group (WG) is available to discuss

implementation of data quality metrics

Why implement data quality metrics?

Quality, fit-

for-purpose

data

Streamlines the

business and its

workflows

Increases data

asset value and

investor

confidence

Builds essential

data condition for

effective use of

new technologies

Perspective

Business

NDR

Data Management

Enabler for improving

data efficiency by up

to 90%

Data Science &

Analytics

• Without metrics, we cannot

measure the quality of the

data we have

• Consequently, we cannot

show how much quality, fit-

for-purpose data there is…

Investment Trends

Across all businesses, there will be a

greater than 300% increase in

investment in artificial intelligence in

2017 compared with 2016.

Across all businesses, there will be a

greater than 300% increase in

investment in artificial intelligence in

2017 compared with 2016.

Open

OriginalFormat Data

Reference Data/Metadata

Master Data/Corporate

“Single Source of Truth”

Derived Data Data Collections

Raw SeismicRaw Logs

Units of measure- Linear measures- Pressure

Static (hard) data- Well header- Deviation- Checkshot- Temperature- Pressure

Processed data- Seismic deconvolution- Seismic filtering- Seismic processing- Edited logs- Spliced logs

Composite data- Completion log- Mud log- Paleontological

composites- TRAPIS

Abbreviations- TD, DFE, KB etc

Interpreted (soft) data- Geological markers- Seismic horizons

Interpreted data- Geological markers- Seismic horizons

Data hoards- Projects en masse- Personal stores- Team folders

Valid Lists Data archive- Projects en masse

Range indicators

Comments

Requires:- Official data repository

Requires:- Standards- Implementation

across all impacted tools and databases

Requires:- Clear processes, workflows

and checkpoints- Proper & official repository- Management and security

processes around repository and data access

Requires:- Standard workflows- Standard algorithms- Standard processes- Housekeeping procedures

Requires:- Standard display and

formatting templates- Procedures

Secondary DataPrimary Data

Data Classification – Digital Data (>100 types in Upstream)

Open

…. ……. …..

Exploration workflow

Facilities workflow

Production Geology workflow

Analytics workflow

Data Type Building Blocks along the Exploration & Production (EP) Value Chain

PRIMARY DATA SECONDARY DATA

Quality Data Envelope

Open

…. ……. …..

Exploration workflow

Facilities workflow

Production Geology workflow

Analytics workflow

Data Type Building Blocks along the Exploration & Production (EP) Value Chain

PRIMARY DATA SECONDARY DATA

Packaging Quality Data – The Building Blocks

Quality Data Envelope

Google Analytics

Trend Plot (IQM)

Heat Map

Bubble Plots

Yahoo Web Analytics

Data Science / Analytics – Typical Deliverables

Data

ExtractionData Viz &

Analytics

Corporate

Databank

Stack

Right

Decision

Timely

Intervention

Requires quality

data to be

available

-Mapping

-Extraction

-Cleansing

-Standardising

-Quality control

Cleansed / Qced data does

not flow back to the

corporate banks

Note: The Data Mart starts to have better quality data than the official

corporate databank

Data Analytics Conceptual Architecture

Corporate

Databank

The analytics may not

indicate quality levels

Eg. Annulus Pressure

These errors will only be recognised if you are tracking the

quality levels in the source databank

Data Quality Error Persistence

Business Rule:

Well must have

annulus pressure

defined

Official

Corporate

Databanks

Poorer Quality Better Quality

Data Quality – Progressive Lopsidedness + Hidden Risks

Right

Decision?

Timely

Intervention?

Quality throughout the life cycle

Official

Corporate

Databanks

Data Quality Metrics – Tackling Quality at the Source

Right

Decision

Timely

Intervention

Data Quality Metrics

Dashboard

Checkshot

Well H

eader

Devia

tion

Well L

ogs (

8)

Well I

nte

grity

Pip

elines

• Understand our DATA

INVENTORY

• Measure and KNOW

how much FIT-FOR-

PURPOSE data there is

BUSINESS

NDR

DATA MGT

• We solve business problems

and create new opportunities

• Address data types as

building blocks across all

100+ EP types

Towards data science and big data analytics,

by putting science into data management

Concluding Remarks

• Implement METRICS to

improve QUALITY

• While measuring and knowing

where we are at all times

Thank You

Background Slides

Concluding Remarks

• Data Quality Metrics is key to measuring the quality of data we

manage

• Quality, fit-for-purpose data will positively impact business

workflows (up to 90% in cases)

• We have to know where we are, how far we have come and how

much further we need to go – metrics provides that transparency

• Data science / Analytics is meaningless if we do not put science

into data management.

Data quality is all about monetizing data

Build up of

the NDR

business

rules

library

Extract to

required

data typesNDR

Operators

Service

companies

Better internal data

quality(track quantified benefits)

Better quality of

data submitted(track quantified benefits)

Build utilities into

tools for these(track quantified benefits)

Data Quality Metrics Tool / DashboardBusiness Rules

Once standardized, it makes it easier for compliance

Business Rules in DQM – How it makes sense

3 dimensions

Cluster Analysis – Separating Variables in n-Dimensions

Visualization

2 dimensions

4, 5, ……, n dimensions?

Through the use of dendrograms

Data Science – Frontier Opportunities in O&G

Graph Databases to investigate

data relationships

Modeling the Top 5 Securities using Neo4Jhttps://neo4j.com/graphgist/aad2c4f2-06c7-40a4-

8b20-bca2f2a4ca92

Semantic Text Extraction example:

Final well report – Basic well data

http://www.npd.no/engelsk/cwi/pbl/wellbore_do

cuments/1878_2_7_27_S_Completion_report.pdf

The main goal of data visualization is to communicate information clearly and effectively through graphical means

Typical topics in Data Visualization:- Exploratory Data Analysis- Information design- Descriptive statistics- Inferential statistics- Statistical graphics- Plot graphics- Data analysis- Infographics

Examples of application:- Is there a correlation between

carbohydrates and fat?- Age distribution of shoppers- Trends in production

performance- Porosities versus depth plots- Sand distribution in a basin

Exploring Data Science - Visualization

Gantt Chart Network Chart Tree MapStreamgraphBar ChartScatterplot

Statistics is the science of making decisions in the face of uncertainty. It is a branch of applied mathematics.

Typical topics in Statistics:- Frequency distributions- Measures of location (mean,

median, mode)- Measures of variation (standard

deviation), - Probability and probability

distributions- Expectations- Statistical inference- Analysis of variance- Nonparametric methods- Regression, Correlation- Multivariate methods (Factor,

Principal Components, Discriminant Functions, Cluster etc)

Examples of application:- Drug testing- Deciding on which well to drill- Comparison of the efficiency

of 2 production processes- Election predictions- Casinos- Everyday decisions such as

whether to bring an umbrella or not

- Which route to take to work

Exploring Data Science - Statistics

Machine learning is the study devoted to the development of machines that improve performance with experience.

Typical topics in Machine Learning:- Classification algorithms- Splitting dataset - Decision trees- Probabilistic classification, Bayes- Regression analysis- Forecasting & prediction- Supervised and unsupervised learning- Principal components analysis- Matrix algebra- Big data toolkits (Hadoop, MapReduce)

Examples of application:- Making sense of diverse data- Relating apparently unrelated

data- Quantifying concepts such as

“maximize profits”, “minimize risk”, “find the best marketing strategy”

- Building autonomous robots

Exploring Data Science – Machine Learning

Data Science

Data Science - Profile

Data

Visualization.

Machine

Learning

Mathematics Statistics Computer

Science

Communi-

cation

Domain

Expertise

Source: Doing Data Science. Cathy O’Neil & Rachel Schutt

Inventorize &

Understand

Define Way

Forward

Implementation

Planning

Trusted Data

& Robust

Analytics

Cognitive

Computing

Data Science/ AIData Quality Metrics

Cluster analysis is a multivariate technique which allows comparisons and classifications to be done on a set of samples (Q-mode), based on their species content, even when little is known about the structure of the data.

This example is based on foraminiferal presence/absence data.

Dendrogram of samples from 1 well using Ward’s clustering method and Squared Euclidean Distance coefficient

Source:

Computer-assisted interpretation of depositional palaeoenvironments

based on foraminifera. Philip Lesslar, Geol. Soc. Malaysia Bulletin 21,

December, 1987.

North West Borneo Environmental Scheme

Cluster Analysis Example – Environments of Deposition

Identification Program -

Bayesian Inference

(Likelihood Ratio)Clustering

Build Probability

Matrix of valid

clusters

• ~2500 samples

• ~1500 species in region

• ~3 million identified

species in all samples

Input data

Results

Interpretation of Depositional Environments - Foraminifera

Date post:	25-Mar-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Data Quality Metrics - Energistics · • Data Quality Metrics is key to measuring the quality of...

Documents