Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | stefan-kuehn |
View: | 410 times |
Download: | 0 times |
Data QualityThe True Big Data Challenge
Dr. Stefan KühnLead Data Scientist
data2day 2016 - Karlsruhe
A short motivation
• Some „famous“ quotes• "Data are becoming the new raw material of
business."
• "The data fabric is the next middleware.“
• "Data matures like wine, applications like fish."
• "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days."
• "Information is the oil of the 21st century, and analytics is the combustion engine."
2
A short motivation
3
Data matures like wine?
A short motivation
4
Data matures like wine?
More like grapes…
A short motivation
• Some „critical“ quotes• "Big Data is not the new oil."
• "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."
• "It’s easy to lie with statistics. It’s hard to tell the truth without statistics."
• "Anything that can be measured can be improved.“
5
Data Quality Fundamentals
6
Twofold Approach to Data Quality
• Does Data represent the real-world objects / events / concepts it is supposed to?• Does Data meet the expectations of the Data
consumers and the requirements of intended usage?• Warning: Data is not facts!
Data is not existing independent from its creation.
7
Data as representation
8
Idea
Word World
Semiotic Triangle
Where is the Data?
Data as representation
9
Metadata
Data World
Semiotic Triangle
Here is the Data
Data and Metadata
• Data implies a context -> Metadata• Metadata provides explicit knowledge about Data
• Metadata enables a common understanding of Data inside an organization• Metadata serves as documentation and
dictionary, as context for Data Understanding
Metadata is absolutely necessary for the effective use of Data.
10
Responsibility for Data
Common Misunderstanding• Data and Data-related systems typically are managed
and hosted by IT, therefore most people (from business and IT) tend to think that Data is part of IT and not of Business• BUT: Data is not the by-product of Business processes
Data is THE product of Business Processes• Data Quality Improvement as Business Strategy
Shared Responsibility
11
Data Creation as Observation
• Data is created under specific Conditions and for specific Purposes• Creation process involves• Observed Object• Observer• Instrument
• Example - Customer Self-Registration Form• Customer Information as Observed Object• Customer as Observer• Registration Form as Instrument
Instrument is not built / known by Observer.
12
Data as Product
• Analogy between manufacturing of products and creation / production of data• Data as core product of a business process• Transfer quality concepts from Software Development
to „Data Development“• Testing• Staging• Versioning• Continuous Delivery / Improvement
• Product Management• Standardization
Data Quality as Manufactoring Quality
13
Expectations and Requirements
• Implicit assumptions for usage of Data• Creation of Data is a business process• Expectations and requirements have to be
explicitely known when defining the process• Data Quality is Business Process Quality• Constantly changing expectations and
requirements makes Data age like grapes…
Make all assumptions explicit.
14
Data Producers
• People or systems that create Data• Producers have control over what they create
(given the functionality of the instrument)
• Producers don’t have control over possible uses of data• Most Data is produced for a dedicated purpose but used
for several purposes• Data Quality is fixed at the moment of creation
Data Quality starts with enabling producers to produce high-quality Data -> useable Data
15
Data Consumers
• People or systems that use Data within its lifecycle• Multiple systems and people can consume data• Often, Consumers are Producers at the same time• Consumers do not control the production of Data but
have implicit assumptions and expectations about it
Data Quality Processes are Consumers of Data of an unknown Quality and Producers of Data of a defined Quality
16
17
Data Quality Problems
Problematic Aspects of Data Management
• Data crosses Organizational Boundaries• Technical (IT) and non-technical (Business)
roles have to communicate• Shared Responsibility instead of „Ownership“• No common definitions• Twelve Barriers to Effective Management of
Data and Information Assets (Th. Redman)
Holistic Approach to Data Quality required
18
Problematic Aspects of Data Management
19
20
Big Data Quality Big Problems
Summary
Big Data
• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)
21
What is Big Data?
• Different Data sources• External Data• No control over data production• No sufficient documentation (Metadata)• No quality definitions available
• Incompatible schema• Example: Car callbacks
• Even more implicit assumptions• Big Data implies less information per unit of data• Lots of data points are redundant• Example: Measure a constant quantity once per day or once
per second
22
What is Big Data in the Media?
• „new oil“• „gold“• „revolution“• „raw material“• „the future“• „bigger, better, faster, more“• „more data beats better algorithms“• …
23
Three major problems
• Redundancy• Big Data by Copy/Paste
• Resolution• Every problem has an inherent time scale of change• Every problem has an inherent level of uncertainty• Increasing the resolution beyond these levels only
resolves noise• Noise• Adding noisy features decreases the signal-noise ratio• Adding good but irrelevant features increases
complexity and can look like noise
24
Redundancy
25
Resolution
26
Resolution
27
Noise
28
Noise
29
Example from Kaggle
30
349 variables - basically rank 1
31
Moore’s Law
32
Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
What’s the point?
• Moore’s Law• Amount of transistors per area doubles every two
year
• Real-world Problem sizes• Grow at approximately the same speed
• Algorithmic requirements• For answering the same questions in the same
time, we need algorithms with linear complexity
33
Solutions?
34
Overall Goals
• Implement Data Quality Standards• Detect Data Quality Problems• Manage Data Quality Problems• Root Cause Analysis of Data Quality Problems• Measure Costs of „poor“ Data Quality• Measure Value of Data / „high“ Data Quality• Measure Effects of Data Quality
Improvements
35
Typical Approaches
• Force Data Quality (via order)• Fillrate: Make certain fields a must• Range: Prescribe list of valid options
• Buy tool• Hire expert• Fire expert• Collect more bad Data• Relabel „bad“ Data Pool as Data Lake• …
36
Summary of the problem
Big Data
• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)
37
Useful Approaches
• Hire expert ;-)• Shared Responsibility• Common Understanding of and access to Metadata• This does not imply that the terminology has to change• Typically, the same term has a different meaning in
different departments • Bounded contexts! (DDD)
• Invest in creating better Data instead of fixing old and broken Data
Treat Data as Product, not as Fact
38