+ All Categories
Home > Documents > Data X-Ray: A diagnostic tool for data errors - UMass …xlwang/dataxray-slides.pdf · Data X-Ray:...

Data X-Ray: A diagnostic tool for data errors - UMass …xlwang/dataxray-slides.pdf · Data X-Ray:...

Date post: 08-Feb-2018
Category:
Upload: ngothuan
View: 220 times
Download: 1 times
Share this document with a friend
23
UNIVERSITY OF MASSACHUSETTS, AMHERST College of Information and Computer Sciences Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou
Transcript

UNIVERSITY OF MASSACHUSETTS, AMHERST • College of Information and Computer Sciences

Data X-Ray: A diagnostic tool for data errors

Xiaolan Wang Xin Luna Dong

Alexandra Meliou

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

MANY APPLICATIONS RELY ON DATA

Knowledge graph (www.google.com)

Social network analytics

Shopping systems of retail companies

2

Data is not perfect! Erroneous data can be extremely costly!

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor

Fusion

Extractor Extractor … …

Extraction System

3.0 billion extracted triples More than 70% are wrong

3

prKB prKB

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor

Fusion

prKB

Extractor Extractor … …

Extraction System

Traditional method: identify errors Traditional method: identify errors and drop them

Perfect KB

4

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor

Fusion

Extractor Extractor … …

Extraction System

Errors are Systematic

Faulty information

Bad extraction

rules

5

Perfect KB

prKB

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor

Fusion

prKB

Extractor Extractor … …

Extraction System

prKB prKB … …

Continue to generate erroneous data

6

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor

Fusion

prKB

Extractor Extractor … …

Extraction System

prKB prKB … …

Diagnose root reason for errors

7

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

REAL-WORLD SYSTEMATIC ERRORS

(besoccer.com, date_of_birth, 1986_02_18) # Triples 630 Error Rate 100% Context: Date of birth of athletes extracted from besoccer.com is set to default value 1986_02_18

8

Default Value Error

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

(Extractor S, obj: Baseball Coach) # Triples 674,000 Error Rate 89.3% Context: reconciling all coaches to baseball coaches E.g., [Bob Barton, profession, Baseball Coach]

9

Reconciliation Error

REAL-WORLD SYSTEMATIC ERRORS

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

(Extractor T, pred:namesakes, obj:the county) # Triples 4878 Error Rate 99.8% E.g., [Salmon P. Chase, namesakes, the county] Contexts: The county was named for Salmon P. Chase, former senator and governor of Ohio

10

Coreference Errors

REAL-WORLD SYSTEMATIC ERRORS

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

HOW TO DERIVE A DIAGNOSIS? Knowledge triple Correct?

<Domenico Modugno, DoB, 01/09/1958>

<Bert Kaempfert, DoB, 09/01/1961>

<The Singing Nun, DoB, 07/12/1963>

<Paul Mauriat, DoB, 10/02/1963>

<Shocking Blue, DoB, 02/07/1968>

<U2, DoB, 05/16/1987>

Knowledge triple Correct?

<Domenico Modugno, DoB, 01/09/1958>

False

<Bert Kaempfert, DoB, 09/01/1961> False

<The Singing Nun, DoB, 07/12/1963> False

<Paul Mauriat, DoB, 10/02/1963> False

<Shocking Blue, DoB, 02/07/1968> True

<U2, DoB, 05/16/1987> True

Leveraging on existing data cleaning methods [Abiteboul99, Fan08, Kalashnikov06, Rahm00, Raman01]

11

Q: Can we treat the error triples as a diagnosis? A: No; for two reasons:

•  Too many erroneous triples (more than 2B in KV) •  Due to a variety of errors

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

WHAT IS A DIAGNOSIS?

Knowledge triple Correct? Subject Predicate Object Web source

Extractor

<Domenico Modugno, DoB, 01/09/1958> False People/D.M.

Bio/DoB Date/01091958

euromusicxx.com

Extractor 1

<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.

Bio/DoB

Date/09011961

euromusicxx.com

Extractor 1

<The Singing Nun, DoB, 07/12/1963> False People/TSN

Bio/DoB

Date/07121963

euromusicxx.com

Extractor 1

<Paul Mauriat, DoB, 10/02/1963> False People/P.M.

Bio/DoB

Date/10021963

euromusicxx.com

Extractor 1

<Shocking Blue, DoB, 02/07/1968> True People/S.B.

Bio/DoB

Date/02071968

wiki.com Extractor 1

<U2, DoB, 05/16/1987> True People/U2 Bio/DoB

Date/05161987

wiki.com Extractor 1

Knowledge triple Correct? Subject Predicate Object Web source

Extractor

<Domenico Modugno, DoB, 01/09/1958> False People/D.M.

Bio/DoB Date/01091958

euromusicxx.com

Extractor 1

<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.

Bio/DoB

Date/09011961

euromusicxx.com

Extractor 1

<The Singing Nun, DoB, 07/12/1963> False People/TSN

Bio/DoB

Date/07121963

euromusicxx.com

Extractor 1

<Paul Mauriat, DoB, 10/02/1963> False People/P.M.

Bio/DoB

Date/10021963

euromusicxx.com

Extractor 1

<Shocking Blue, DoB, 02/07/1968> True People/S.B.

Bio/DoB

Date/02071968

wiki.com Extractor 1

<U2, DoB, 05/16/1987> True People/U2 Bio/DoB

Date/05161987

wiki.com Extractor 1

Group error data: Date from website (euromusicxx.com) extracted by Extractor 1 is wrong. (Bad extraction rule: use U.S. date format rule to extract date information from European website).

12

Knowledge triple Correct?

<Domenico Modugno, DoB, 01/09/1958> False

<Bert Kaempfert, DoB, 09/01/1961> False

<The Singing Nun, DoB, 07/12/1963> False

<Paul Mauriat, DoB, 10/02/1963> False

<Shocking Blue, DoB, 02/07/1968> True

<U2, DoB, 05/16/1987> True

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

WHAT IS A DIAGNOSIS?

Knowledge triple Correct? Subject Predicate Object Web source

Extractor

<Domenico Modugno, DoB, 01/09/1958> False People/D.M.

Bio/DoB Date/01091958

euromusicxx.com

Extractor 1

<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.

Bio/DoB

Date/09011961

euromusicxx.com

Extractor 1

<The Singing Nun, DoB, 07/12/1963> False People/TSN

Bio/DoB

Date/07121963

euromusicxx.com

Extractor 1

<Paul Mauriat, DoB, 10/02/1963> False People/P.M.

Bio/DoB

Date/10021963

euromusicxx.com

Extractor 1

<Shocking Blue, DoB, 02/07/1968> True People/S.B.

Bio/DoB

Date/02071968

wiki.com Extractor 1

<U2, DoB, 05/16/1987> True People/U2 Bio/DoB

Date/05161987

wiki.com Extractor 1

Input2: Features Combination of meta-data information

Input1: Element And its correctness

Output (diagnosis): set of features

13

Which diagnosis is the best?

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

DATAXRAY: COST MODEL

Cost Model: Conciseness: fewer features preferred Specificity: higher error rate preferred Consistency: fewer true elements preferred

Pr(F|E) =Y

fi2F↵✏

|fi.E�i |

i (1� ✏i)|fi.E+i |

Bayesian estimate of causal likelihood

14

Theorem 1: Derive a diagnosis with minimum cost is NP-Complete

Probability of being the cause of errors under the observation of data items

FE

True elements in the feature

Error rate of the feature

False elements in the feature

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

DATAXRAY: ALGORITHM

Top-down iterative traversal

(all, all, all)

(all, all, extractor1)

(all, wiki, all) (all, euromusicxx, all)

(date, all, all)

(all, wiki, extractor1)

(all, euromusic, extractor1)

(date, all, extractor1)

(date, wiki, all)

(date, euromusic, all)

Split Split Compare Merge

(all, all, all) (all, wiki, all) (all, euromusicxx, all)

Compare Merge

(all, all, all)

Theorem 2: The DataXRay traversal has linear complexity in the number of features; with O(# of features) approximation.

15

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 3 sec

Recall Precision F-measure0.0

0.2

0.4

0.6

0.8

1.0

DataXRay+Greedy

DataXRay

Greedy

RedBlue

DataAuditor

FeatureSelection

DataXRay vs. SetCover[Chvatal79]

16

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 4.2 sec

Recall Precision F-measure0.0

0.2

0.4

0.6

0.8

1.0

DataXRay+Greedy

DataXRay

Greedy

RedBlue

DataAuditor

FeatureSelection

DataXRay vs. RedBlue[Peleg07]

17

Finer-granularity features preferred

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 5.5 sec

Recall Precision F-measure0.0

0.2

0.4

0.6

0.8

1.0

DataXRay+Greedy

DataXRay

Greedy

RedBlue

DataAuditor

FeatureSelection

DataXRay vs. FeatureSelection[Tibshirani96, Ng04]

18

Target on predication Redundant features Low error rate features

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION SUMMARY DataXRay is effective several real-world scenarios

Extraction errors, traffic incidents, …

DataXRay is better than alternative algorithms

Classification, summarization, set cover methods

DataXRay is robust under different parameters and settings

Different error rate, feature failure, …

DataXRay is parallelizable in MapReduce

19

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

Takeaways

Diagnosis is different than cleaning Reason about root cause of data errors.

Defined a good diagnosis Cost function based on Bayesian analysis: Conciseness, Specificity, Consistency.

Designed a scalable algorithm Leverage the feature hierarchy.

  The top-down iterative algorithm is efficient and easy to parallelize.

20

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

References [Abiteboul99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3–8, 1999. [Carr00] R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–353, 2000. [Chvatal79] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233–235, 1979. [Dong14] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. [Eckerson02] W. W. Eckerson. Data warehousing special report: Data quality and the bottom line. http://www.adtmag.com/article.asp?id=6321, 2002. [Fader11] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. [Fan08] W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data cleaning. Proc. VLDB Endow., 1(2):1522–1523, Aug. 2008. [Golab08] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376–390, Aug. 2008. [Golab10] L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641–1644, 2010.

21

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

References [Kalashnikov06] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716–767, June 2006. [Ng04] A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004. [Peleg07] D. Peleg. Approximation algorithms for the label-cover< sub> max</sub> and red-blue set cover problems. Journal of Discrete Algorithms, (1):55–64, March 2007. [Quinlan86] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, Mar. 1986. [Rahm00] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13, 2000. [Raman01] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 381–390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [Sakl12] M. Sakal and L. Rakovi´c. Errors in building and using electronic tables: Financial consequences and minimisation techniques. Strategic Management, 17(3):29–35, 2012. [Samar08] V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.

22

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

Reference [Tibshirani96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [ten Cate et al. 2015] ten Cate, Balder, et al. "High-Level Why-Not Explanations using Ontologies." Proceedings of the 34th ACM Symposium on Principles of Database Systems. ACM, 2015.

23


Recommended