UNIVERSITY OF MASSACHUSETTS, AMHERST • College of Information and Computer Sciences
Data X-Ray: A diagnostic tool for data errors
Xiaolan Wang Xin Luna Dong
Alexandra Meliou
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
MANY APPLICATIONS RELY ON DATA
Knowledge graph (www.google.com)
Social network analytics
Shopping systems of retail companies
2
Data is not perfect! Erroneous data can be extremely costly!
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
KNOWLEDGE VAULT [Dong14]
TXT DOM TBL ANO Web Sources
Extractor
Fusion
Extractor Extractor … …
Extraction System
3.0 billion extracted triples More than 70% are wrong
3
prKB prKB
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
KNOWLEDGE VAULT [Dong14]
TXT DOM TBL ANO Web Sources
Extractor
Fusion
prKB
Extractor Extractor … …
Extraction System
Traditional method: identify errors Traditional method: identify errors and drop them
Perfect KB
4
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
KNOWLEDGE VAULT [Dong14]
TXT DOM TBL ANO Web Sources
Extractor
Fusion
Extractor Extractor … …
Extraction System
Errors are Systematic
Faulty information
Bad extraction
rules
5
Perfect KB
prKB
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
KNOWLEDGE VAULT [Dong14]
TXT DOM TBL ANO Web Sources
Extractor
Fusion
prKB
Extractor Extractor … …
Extraction System
prKB prKB … …
Continue to generate erroneous data
6
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
KNOWLEDGE VAULT [Dong14]
TXT DOM TBL ANO Web Sources
Extractor
Fusion
prKB
Extractor Extractor … …
Extraction System
prKB prKB … …
Diagnose root reason for errors
7
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
REAL-WORLD SYSTEMATIC ERRORS
(besoccer.com, date_of_birth, 1986_02_18) # Triples 630 Error Rate 100% Context: Date of birth of athletes extracted from besoccer.com is set to default value 1986_02_18
8
Default Value Error
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(Extractor S, obj: Baseball Coach) # Triples 674,000 Error Rate 89.3% Context: reconciling all coaches to baseball coaches E.g., [Bob Barton, profession, Baseball Coach]
9
Reconciliation Error
REAL-WORLD SYSTEMATIC ERRORS
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(Extractor T, pred:namesakes, obj:the county) # Triples 4878 Error Rate 99.8% E.g., [Salmon P. Chase, namesakes, the county] Contexts: The county was named for Salmon P. Chase, former senator and governor of Ohio
10
Coreference Errors
REAL-WORLD SYSTEMATIC ERRORS
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
HOW TO DERIVE A DIAGNOSIS? Knowledge triple Correct?
<Domenico Modugno, DoB, 01/09/1958>
<Bert Kaempfert, DoB, 09/01/1961>
<The Singing Nun, DoB, 07/12/1963>
<Paul Mauriat, DoB, 10/02/1963>
<Shocking Blue, DoB, 02/07/1968>
<U2, DoB, 05/16/1987>
Knowledge triple Correct?
<Domenico Modugno, DoB, 01/09/1958>
False
<Bert Kaempfert, DoB, 09/01/1961> False
<The Singing Nun, DoB, 07/12/1963> False
<Paul Mauriat, DoB, 10/02/1963> False
<Shocking Blue, DoB, 02/07/1968> True
<U2, DoB, 05/16/1987> True
Leveraging on existing data cleaning methods [Abiteboul99, Fan08, Kalashnikov06, Rahm00, Raman01]
11
Q: Can we treat the error triples as a diagnosis? A: No; for two reasons:
• Too many erroneous triples (more than 2B in KV) • Due to a variety of errors
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
WHAT IS A DIAGNOSIS?
Knowledge triple Correct? Subject Predicate Object Web source
Extractor
<Domenico Modugno, DoB, 01/09/1958> False People/D.M.
Bio/DoB Date/01091958
euromusicxx.com
Extractor 1
<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.
Bio/DoB
Date/09011961
euromusicxx.com
Extractor 1
<The Singing Nun, DoB, 07/12/1963> False People/TSN
Bio/DoB
Date/07121963
euromusicxx.com
Extractor 1
<Paul Mauriat, DoB, 10/02/1963> False People/P.M.
Bio/DoB
Date/10021963
euromusicxx.com
Extractor 1
<Shocking Blue, DoB, 02/07/1968> True People/S.B.
Bio/DoB
Date/02071968
wiki.com Extractor 1
<U2, DoB, 05/16/1987> True People/U2 Bio/DoB
Date/05161987
wiki.com Extractor 1
Knowledge triple Correct? Subject Predicate Object Web source
Extractor
<Domenico Modugno, DoB, 01/09/1958> False People/D.M.
Bio/DoB Date/01091958
euromusicxx.com
Extractor 1
<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.
Bio/DoB
Date/09011961
euromusicxx.com
Extractor 1
<The Singing Nun, DoB, 07/12/1963> False People/TSN
Bio/DoB
Date/07121963
euromusicxx.com
Extractor 1
<Paul Mauriat, DoB, 10/02/1963> False People/P.M.
Bio/DoB
Date/10021963
euromusicxx.com
Extractor 1
<Shocking Blue, DoB, 02/07/1968> True People/S.B.
Bio/DoB
Date/02071968
wiki.com Extractor 1
<U2, DoB, 05/16/1987> True People/U2 Bio/DoB
Date/05161987
wiki.com Extractor 1
Group error data: Date from website (euromusicxx.com) extracted by Extractor 1 is wrong. (Bad extraction rule: use U.S. date format rule to extract date information from European website).
12
Knowledge triple Correct?
<Domenico Modugno, DoB, 01/09/1958> False
<Bert Kaempfert, DoB, 09/01/1961> False
<The Singing Nun, DoB, 07/12/1963> False
<Paul Mauriat, DoB, 10/02/1963> False
<Shocking Blue, DoB, 02/07/1968> True
<U2, DoB, 05/16/1987> True
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
WHAT IS A DIAGNOSIS?
Knowledge triple Correct? Subject Predicate Object Web source
Extractor
<Domenico Modugno, DoB, 01/09/1958> False People/D.M.
Bio/DoB Date/01091958
euromusicxx.com
Extractor 1
<Bert Kaempfert, DoB, 09/01/1961> False People/B.K.
Bio/DoB
Date/09011961
euromusicxx.com
Extractor 1
<The Singing Nun, DoB, 07/12/1963> False People/TSN
Bio/DoB
Date/07121963
euromusicxx.com
Extractor 1
<Paul Mauriat, DoB, 10/02/1963> False People/P.M.
Bio/DoB
Date/10021963
euromusicxx.com
Extractor 1
<Shocking Blue, DoB, 02/07/1968> True People/S.B.
Bio/DoB
Date/02071968
wiki.com Extractor 1
<U2, DoB, 05/16/1987> True People/U2 Bio/DoB
Date/05161987
wiki.com Extractor 1
Input2: Features Combination of meta-data information
Input1: Element And its correctness
Output (diagnosis): set of features
13
Which diagnosis is the best?
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
DATAXRAY: COST MODEL
Cost Model: Conciseness: fewer features preferred Specificity: higher error rate preferred Consistency: fewer true elements preferred
Pr(F|E) =Y
fi2F↵✏
|fi.E�i |
i (1� ✏i)|fi.E+i |
Bayesian estimate of causal likelihood
14
Theorem 1: Derive a diagnosis with minimum cost is NP-Complete
Probability of being the cause of errors under the observation of data items
FE
True elements in the feature
Error rate of the feature
False elements in the feature
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
DATAXRAY: ALGORITHM
Top-down iterative traversal
(all, all, all)
(all, all, extractor1)
(all, wiki, all) (all, euromusicxx, all)
(date, all, all)
(all, wiki, extractor1)
(all, euromusic, extractor1)
(date, all, extractor1)
(date, wiki, all)
(date, euromusic, all)
Split Split Compare Merge
(all, all, all) (all, wiki, all) (all, euromusicxx, all)
Compare Merge
(all, all, all)
Theorem 2: The DataXRay traversal has linear complexity in the number of features; with O(# of features) approximation.
15
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
EVALUATION (ReVerb ClueWeb Extraction dataset)
Execution time: 0.43 sec vs. 3 sec
Recall Precision F-measure0.0
0.2
0.4
0.6
0.8
1.0
DataXRay+Greedy
DataXRay
Greedy
RedBlue
DataAuditor
FeatureSelection
DataXRay vs. SetCover[Chvatal79]
16
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
EVALUATION (ReVerb ClueWeb Extraction dataset)
Execution time: 0.43 sec vs. 4.2 sec
Recall Precision F-measure0.0
0.2
0.4
0.6
0.8
1.0
DataXRay+Greedy
DataXRay
Greedy
RedBlue
DataAuditor
FeatureSelection
DataXRay vs. RedBlue[Peleg07]
17
Finer-granularity features preferred
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
EVALUATION (ReVerb ClueWeb Extraction dataset)
Execution time: 0.43 sec vs. 5.5 sec
Recall Precision F-measure0.0
0.2
0.4
0.6
0.8
1.0
DataXRay+Greedy
DataXRay
Greedy
RedBlue
DataAuditor
FeatureSelection
DataXRay vs. FeatureSelection[Tibshirani96, Ng04]
18
Target on predication Redundant features Low error rate features
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
EVALUATION SUMMARY DataXRay is effective several real-world scenarios
Extraction errors, traffic incidents, …
DataXRay is better than alternative algorithms
Classification, summarization, set cover methods
DataXRay is robust under different parameters and settings
Different error rate, feature failure, …
DataXRay is parallelizable in MapReduce
19
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Takeaways
Diagnosis is different than cleaning Reason about root cause of data errors.
Defined a good diagnosis Cost function based on Bayesian analysis: Conciseness, Specificity, Consistency.
Designed a scalable algorithm Leverage the feature hierarchy.
The top-down iterative algorithm is efficient and easy to parallelize.
20
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
References [Abiteboul99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3–8, 1999. [Carr00] R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–353, 2000. [Chvatal79] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233–235, 1979. [Dong14] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. [Eckerson02] W. W. Eckerson. Data warehousing special report: Data quality and the bottom line. http://www.adtmag.com/article.asp?id=6321, 2002. [Fader11] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. [Fan08] W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data cleaning. Proc. VLDB Endow., 1(2):1522–1523, Aug. 2008. [Golab08] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376–390, Aug. 2008. [Golab10] L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641–1644, 2010.
21
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
References [Kalashnikov06] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716–767, June 2006. [Ng04] A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004. [Peleg07] D. Peleg. Approximation algorithms for the label-cover< sub> max</sub> and red-blue set cover problems. Journal of Discrete Algorithms, (1):55–64, March 2007. [Quinlan86] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, Mar. 1986. [Rahm00] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13, 2000. [Raman01] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 381–390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [Sakl12] M. Sakal and L. Rakovi´c. Errors in building and using electronic tables: Financial consequences and minimisation techniques. Strategic Management, 17(3):29–35, 2012. [Samar08] V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.
22
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Reference [Tibshirani96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [ten Cate et al. 2015] ten Cate, Balder, et al. "High-Level Why-Not Explanations using Ontologies." Proceedings of the 34th ACM Symposium on Principles of Database Systems. ACM, 2015.
23