Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
Data Quality Assurance in Telecommunications Databases
C.-M. Chen , M. Cochinwala
{chungmin, munir}@research.telcordia.com
Applied Research
Telcordia Technologies
Morristown, NJ 07960
Doc Name – 2
Outline
Operation Support Systems Data Quality Issues Record Matching Problem & Techniques Example Applications References
Doc Name – 3
Operation Support Systems
Telecom carriers or service providers use multiple Operation Support Systems (OSS) for – network configuration
– network engineering
– service provisioning
– network performance monitoring
– customer care & billing, etc. Each OSS may maintain its own database using a DBMS
Doc Name – 4
OSS Databases
The databases may overlap on the network entities they describe
The OSSs may use different model/schemas to describe the same entity
Doc Name – 5
Data Quality Issues
Corrupted data: data do not correctly reflect the properties/specifications of the modeled entities
Inaccurate data: discrepancy between the data and the real-world entity they modeled (e.g. outdated information)
Inconsistent data: – records in different OSS databases referencing the same real-world entity
do not agree on the attribute values of the entity
– an entity is represented by more than one record in a single OSS database
Doc Name – 6
Example: DSL Deployment
corrupted data: many of the Central Offices (CO) do not have correct location information (longitude,latitude)
inaccurate data: the database says a port or bandwidth is available at the CO while indeed it is not (or vice versa)
inconsistent data: Service Provisioning database says a line is available while Network Configuration database says otherwise (which one to believe?)
Doc Name – 7
Why is data quality an issue
corrupted data disrupt/delay business deployment inaccurate data impede decision making inconsistent data degrade operation efficiency bottom line:
– increased cost
– decreased revenue
– customer dissatisfaction
Doc Name – 8
Cost of Data Quality Assurance
To date, practice of data quality assurance is mostly manual-driven and labor-intensive
“ … it costs a typical telephone company in the mid-1990s $2 per line per year to keep the network database reasonably accurate.”
Now imagine the scale and complexity of the networks brought by the IP and wireless technologies and their effects on the cost of data quality
(Semi-) automatic tools are needed
Doc Name – 9
Data Quality Assurance Techniques
data correctness and accuracy:– database validation and correction via autodiscovery tools
data consistency (reconciliation)– record matching (record linkage): identify records that correspond to the
same real-world entity
Doc Name – 10
Data Accuracy
to assess and correct the discrepancy between data in the database and the modeled entities in the real-world
autodiscovery tools: – automatically probe network elements for configuration parameters
– reconstruct the network connection topology at different layers (from physical to application layers)
how to efficiently validate the data stored in the databases against the auto-discovered information?
– sampling
– graph algorithms
Doc Name – 11
Record Matching Techniques [Cochinwala01b]
Problem Definition Record Matching Phases Quality Metrics Searching Techniques (to reduce search space) Matching Decision Rules (to determine matched pairs)
Doc Name – 12
The Record Matching Problem
Goal: To identify records in the same or different databases that correspond to the same real-world entity.
two records that correspond to the same entity are called a matched pair
in a relational table, no two records can hold identical values on all attributes a matched pair consists of two different records that refer to the same entity
the same entity may be represented differently in different tables/databases, with no agreement on key value(s)
Doc Name – 13
Three Phases of Reconciliation
Data Preparation: scrubbing and cleansing Searching: reducing search space Matching: finding matched pairs
Doc Name – 14
Data Preparation
Parsing Data transformation Standardization
Doc Name – 15
Searching Problem
Consider reconciling tables A and B search space: P = A B (Cartesian product) M: matched pairs U: unmatched pairs P = M U
Problem: searching P requires |A||B| comparisons
Goal: reduce the search space to a smaller P’ P, such that M P’.
Doc Name – 16
Searching Techniques
Heuristic: potentially matched records should fall into the same cluster
Relational join operators: nest-loop join, merge-sort join, hash join, band join [Hernadez96]
Blocking: Soundex Code, NYSIIS [Newcombe88]
Windowing (sorted neighborhood) [Hernadez98]
Priority queue [MongeElkan97]
Multi-pass merge-sort joins [Hernadez98]
Doc Name – 17
String Matching Techniques
Edit distance: counts of insert, delete, replace [Manber89]
Smith-Waterman: dynamic programming to find minimum edit distance [SmithWaterman81]
Doc Name – 18
Record Matching Techniques
Given a (reduced) space of pairs, how to determine if a pair (x,y) is a match, i.e., records x and y refer to the same real-world entity ?
Probabilistic approaches [Newcombe59]– Bayes test for minimum error– Bayes test for minimum cost
Non-probabilistic approaches– Supervised Learning (to generate decision rules)
[Cochinwala01a]– Equation theory model [Hernadez98]– Distance-based techniques [DeySarkarDe98]
Doc Name – 19
Bayes Test for Minimum Error
Cartesian product P = A B = M UFor each record (a1, a2, … , an , b1, b2, … , bn) P, define
x = (x1, x2, …, xn), where
xi = 1 if ai = bi
0 if ai bi
a-prior prob conditional density function
Class M:M p(x|M)
Class U: U p(x|U)
Unconditional density function:
p(x) = M p(x|M) + U p(x|U)
Doc Name – 20
Bayes Test for Minimum Error (cont.)Assume M, U, p(x|M), and p(x|U) are knowm
Basyes theorem: the posteriori probabilityp(M|x) = M p(x|M) / (M p(x|M) + U p(x|U))
Likelihood Ratio Decision Rule:x is decided to belong to M iff
p(M|x) p(U|x) iffM p(x|M) U p(x|U) iff
l(x) = (p(x|M) / p(x|U)) (U / M )
The test gives min probability of error (misclassification)
Doc Name – 21
Bayes Test for Minimum Cost
C i,j : cost of a class j record being (mis)-classified to class iConditional cost of x being classified to class M given x is:
cM(x) = cMM p(M|x) + cMU p(U|x)Similarly,
cU(x) = cUM p(M|x) + cUU p(U|x)
Likelihood Ratio Decision Rule:x is decided to belong to M iff
cM(x) cU(x) iff
l(x) = (p(x|M) / p(x|U)) ((cMU - cUU )U / (cUM - cMM )M )
The test gives min cost
Doc Name – 22
Supervised Learning
Take a small sample S form A B For every record s S, label it as M (matched) or U
(unmatched) Select a predictive model, along with associated
decision rules, that best discriminates between classes M and U for records in S
Apply the selected model to classify all records in A B
Doc Name – 23
Arroyo: a data reconciliation tool [Cochinwala01a]
matches customer records across two databases:– wireline database: 860,000 records– wireless database: 1,300,000 records
methodology– pre-processing (application dependent)– parameter space definition (application dependent)– matching rule generation & pruning (learning, model
selection)– matching
Doc Name – 24
Preprocessing (Application Dependent)
elimination of stop-words– blanks, special characters, ...– “Chung-Min Chen” becomes “ChungMin Chen”
word re-ordering– “20 PO BOX” becomes “PO BOX 20”
word substitution– “St.” becomes “Street”
Doc Name – 25
Parameter Space Definition (Application Dependent)
Six parameters used: edit distance between the Name fields edit distance between the Address fields length of Name field in Wireline length of Name field in Wireless length of Address field in Wireline length of Address field in Wireless
Doc Name – 26
Matching Rule Generation
1. Learning Set Generation select records that contain the word “Falcon” in Name
field– Wireless - 241 records– Wireline - 883 records
Choose sample first from the database that is the image of the higher degree “onto”-mapping. For example, Wireline Wireless has a higher “onto” degree than “Wireless Wireline”
Doc Name – 27
Matching Rule Generation
2. Learning Set Labeling identify prospective matches
– match if the edit distances on Name and Address fields are both less than 3
– (30% false matches but no miss of true matches) verdict assignment
– each pair that is not a prospective match is assigned “unmatch”
– each prospective match pair is manually examined and labeled as “match”, “unmatch”, or “ambigious” (labor intensive)
Doc Name – 28
Matching Rule Generation
3. Model Selection Input: 241 883 = 212,803 pairs with 7 fields (parameters)
Label (verdict)
Edit distance between Name fields Edit distance between Address fields
Length of Name field in WirelessLength of Name field in Wireline
Length of Address in Wireless Length of Address in Wireline
Three model tried, with cross-validation (half sample to build the model, half sample to estimate the error rate)Model Error Rate (average of 50
runs)CART 5.1%Linear Discriminant Analysis 5.3%Vector Quantization 9.4%
Doc Name – 29
Matching Rule Generation
3. Model Selection (cont.)Model Error Rate (average of 50
runs)CART 5.1%Linear Discriminant Analysis 5.3%Vector Quantization 9.4%
LDA could have inefficient rules for database evaluation, e.g.:
2*(Address edit distance) + 1.3*(Name length on Wireless) < 3 match
– an index on Name is useless
CART rules can be evaluated efficiently using existing database indices
Doc Name – 30
Matching Rule Pruning
determines which parameters (rules) to drop to reduce tree complexity while retaining tree quality– maximize delta of a cost function– dynamic programming with a threshold
complexity of tree T: C(T) = C(p) / C(q) ,
C(p) = complexity to compute the value of parameter p e.g.
C(edit distance between fields A and B) = avg_length(A)* avg_length(B)
Note: 0 < C(T) 1
pT q S
Doc Name – 31
Matching Rule Pruning (cont.)
C(edit distance between fields A and B) = avg_length(A)* avg_length(B)
Parameter Avg length ComplexityAddress length (wireless) 6.88 6.88Address length (wireline) 6.72 6.72Name length (wireless) 10.87 10.87Name length (wireline) 10.70 10.70Edit distance on Addresses 6.88*6.72 = 46.23Edit distance on Names 10.87*10.7 = 116.31
Doc Name – 32
Matching Rule Pruning (cont.)
misclassification rate of T:M(T) = # misclassified pairs / # pairs in the test set,
Note: 0 M(T) 1
Cost of Tree T:J(T) = w1*C(T) + w2*M(T),
where w1 and w2 are weights. (We used w1 = w2 = 1) Let T* = T - {all rues involving parameter p), define
J(T) = J(T) - J(T*) = (C(T)-C(T*)) + (M(T) - M(T*)) = C(T) + M(T)
Goal: Find a set of parameters to drop to maximize J(T)
Doc Name – 33
Dynamic programming– find a parameter p1 that maximizes J(T)
– fix p1 and find a second parameter p2 such that {p1,p2} maximizes J(T)
– repeat the step until we reach a set of parameters {p1,p2,…,pk} such that J(T) is less than a threshold value
Parameter C(T) M(T) J(T) Edit dist on Name 0.5882 -0.0140 0.5742Edit dist on Address 0.2338 -0.1840 0.0498Len. on Name (Wireless) 0.0566 -0.0610 -0.0044Len. on Name (Wireline) 0.0541 -0.0160 0.0381Len. on Address (Wireless) 0.0347 -0.0081 0.0266Len. On Address (Wireline) 0.0339 -0.0070 0.0269
Single parameter pruning
Matching Rule Pruning (cont.)
Doc Name – 34
2nd Parameter C(T) M(T) J(T)
Len. on Name (Wireline) 0.0541 -0.00880.0453
Len. on Address (Wireless) 0.0347 -0.00720.0275
Len. On Address (Wireline) 0.0339 -0.00910.0248
...Dual parameters pruning
Matching Rule Pruning (cont.)
Doc Name – 35
Matching Rule Pruning (cont.)
Reduce from 6 parameters (original tree) to 2 parameters (final tree)
index on Name field can be used to evaluate the tree
rootAddress edit distance > 1.5Address edit distance < 1.5
Wireless Name length < 11.5 Wireless Name length > 11.5
matched
unmatched ambiguous
Doc Name – 36
CART Matching Result
x/y Matched Unmatched Ambiguous
Matched 0.965 0.014 0.03Unmatched 0.020 0.956 0.18Ambiguous 0.015 0.030 0.79
approximately 10% error rate reduced 50% computation cost for matching about 47% more correctly matched records over a
leading commercial product
Prob (predicted as x | given y)
Doc Name – 37
Major Telephone Operating Company
Problem:– Mergers and acquisitions of wireless companies resulted in
the RBOC’s inability to determine common customers among wireline and wireless businesses
– Customer databases for each business unit use different schema and contain many quality problems
– RBOC’s experience with commercial vendor’s data reconciliation tool was unsatisfactory
Solution:– Use small manually-verified data samples (~100 records) to
determine appropriate matching rules– Use machine learning to prune the rules for efficient analysis
of the large dataset– Resulted in 30% more correct matches than the commercial
tool
Doc Name – 38
Large Media Conglomerate
Problem:– Company provides magazines to wholesalers who in turn
provide magazines to resalers for distribution– Reconciliation of wholesaler and retailer databases would
make it easier to track where gaps in reporting are occurring
– Identify ‘bad’ retailers Solution:
– Group by primary keys– Match by secondary keys– E.g. 3000 C.V.S. Pharmacies are grouped and compared by
zip code and street address – identify ‘bad’ pharmacies
Doc Name – 39
International Government
Problem:– Reconcile vital taxpayer data from several different sources– Known problems include record duplication, address
mismatches, address obsolescence, distributed responsibility for database accuracy and updates
– Identify causes for mistakes Solution:
– Improve the process flows and architecture to allow for rapid modification of pre-processing rules and matching rules
– Detection and classification of likely causes of duplication– Analysis and improvements reduced number of records that
required manual verification
Doc Name – 40
ILEC-CLEC Billing Reconciliation
Problem– ILECs charge CLECs for use of network resources– Verification of actual usage vs. charging
E.g customer changing providers
– Identify actual usage and send verification to ILEC– Resource identification in ILEC and CLEC is different
Solution– Check charges in bill against actual usage– Common identification of resources (matching table)– Solution has only been implemented for access line charge
Doc Name – 41
Future Directions
Reduction of manual work– Type based standardization, rules– Rules and Parameter pruning?
Database and tool– Rule execution plan
E.g. string length before edit distance
– Sharing/maintenance of index across steps in process– Extending matching to structures (trees, graphs – circuits)– Matching across layers in networks (OSI layers)– How often to discover/audit network? – sampling techniques
Doc Name – 42
References
[Cochinwala01a ]M Cochinwala, V. Kurien, G. Lalk and D. Shasha, “Efficient data reconciliation”, Information Sciences, 137 (1-4) , 2001.
[Cochinwala01b] M. Cochinwala, S. Dalal, A. Elmagarmid and V. Verykios, “Record matching: past, present and future”, submitted for publication.
[DeySarkarDe]D. Dey, S. Sarkar and P. De, “Entity matching in heterogeneous databases: a distance-based decision model”, 31st Hawaii int. Conf. On System Sciences, 1998.
[Hernadez96] M. Hernadez, “A generalization of band joins and the merge/purge problem”, Ph.D. Thesis, CS Dept, Columbia University, 1996.
[Hernadez98] M. Hernadez and S. Stolfo, “Real-world data is dirty: data cleansing and the merger/purge problem”, Data Mining and Knowledge Discovery, 2 (1), 1998.
[Manber89] U. Manber, “Introductions to Algorithms”, Addiso_Wesley, 1989. [MongeElkan97] A. Monge and C. Elkan, “An efficient domain-independent algorithm for
detecting approximately duplicate database records”, SIGMOD DMKD Workshop, 1997. [Newcombe59] H. Newcombe, J. Kennedy, S. Axford and A. James, “Automatic linkage of
vital records”, Science 130(3381), 1959. [Newcombe88]H. Newcombe, “Handbook of Record Linkage”, Oxford University Press,
1988. [SmithWaterman81] T. Smith and M. Waterman, “Identification of common molecular
subsequences”, J. Mol. Biol. 147, 1981.