Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 219 times |
Download: | 0 times |
Topes: Reusable Abstractions for Topes: Reusable Abstractions for Validating DataValidating Data
Christopher Scaffidi
Brad Myers, Mary Shaw
Carnegie Mellon University
22
Even when lives are at stake,Even when lives are at stake,people still make typos. people still make typos.
Hurricane Katrina“Person Locator”
Web site
Problem Topes Validation Conclusion
33
Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.
Wrong data category
Problem Topes Validation Conclusion
Questionable input
Incorrectformatting
44
The website creators omitted input The website creators omitted input validation.validation.
• Primary reason: rejecting obviously-wrong inputs would prevent collecting questionable data
– Eg: Would you accept a city with 1 letter?
This is the UI code for the web form where users entered data for this website.A RAD tool called CodeCharge Studio was used to create the UI.
Problem Topes Validation Conclusion
55
This site was not alone in This site was not alone in lacking input validation.lacking input validation.
• Eg: Google Base web application– 13 primary web forms – Even numeric fields accept unreasonable inputs
(such as a salary of “-45”)
• Eg: Spreadsheets– 40% of cells are non-numeric, non-date textual data– Commonly used to gather and organize textual data
for reports
Problem Topes Validation Conclusion
66
Validation of these short human-readable Validation of these short human-readable strings must support…strings must support…
• Testing membership in a data category– Categories based on standards (eg: email address)– Categories lacking standards (eg: city name)
• Ambiguously defined categories– Identify questionable values for double-checking
• Multiple formats– Format consistency, post-validation
• Platform-independent implementation– Reuse in webapps, spreadsheets, others
Problem Topes Validation Conclusion
77
Limitations of existing approachesLimitations of existing approaches
• Types do not support questionable values
• Grammars do not, either, nor can they reformat
• Information extraction algorithms rely on grammatical cues that are absent during validation
• Cues, Forms/3, -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent
Problem Topes Validation Conclusion
88
New Approach: TopesNew Approach: Topes
• A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data
• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain
• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification
Problem Topes Validation Conclusion
99
A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation
Notional representation for a CMU room number tope…
Formal building name& room number
Elliot Dunlap Smith Hall 225
Colloquial building name& room number
Smith 225
Problem Topes Validation Conclusion
Building abbreviation& room number
EDSH 225
1010
A tope is a conceptual abstraction.A tope is a conceptual abstraction.A tope A tope implementationimplementation is code. is code.
• Each tope implementation has executable functions:– 1 isa:string[0,1] function per format, for
recognizing instances of the format (a fuzzy set)– 0 or more trf:stringstring functions linking formats,
for transforming values from one format to another
• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1
Problem Topes Validation Conclusion
1111
Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns
• Multi-format Enumerations, e.g: US states– “New York”, “CA”, maybe “Guam”
• Open-set proper nouns, e.g.: Company names– Whitelist of definitely valid names (“Google”), with
alternate formats (e.g. “Google Corp”, “GOOG”)– Augmented with a pattern for promising inputs that
are not yet on the whitelist
Problem Topes Validation Conclusion
1212
Two more common kinds of topes:Two more common kinds of topes:numeric and hierarchicalnumeric and hierarchical
• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– (Very rarely) labeled with an explicit unit– Transformation usually by multiplication
• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”
uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts,
changes to separators, arithmetic, and lookup tables.
Problem Topes Validation Conclusion
1313
Formal tool demonstration on FridayFormal tool demonstration on Friday
Features:
• Format inference• Format/part names• Soft constraints• Testing features• Format reusability
Problem Topes Validation Conclusion
1414
Formal tool demonstration on FridayFormal tool demonstration on Friday
Microsoft Excel:
buttons and menus
Visual Studio: drag-and drop
code generation
Problem Topes Validation Conclusion
1515
Evaluating accuracy, reusability, and Evaluating accuracy, reusability, and usefulness for data cleaningusefulness for data cleaning
• Implemented topes for spreadsheet data– 32 topes based on 720 online spreadsheets– Tested accuracy
• Reused topes on web application data– 8 data categories in Google Base and
5 data categories in Hurricane Katrina site– Tested accuracy
• Used transformations to reformat data– 5 data categories in Hurricane Katrina site– Measured increase in number of duplicates identified
Problem Topes Validation Conclusion
1616
Extracting spreadsheet test dataExtracting spreadsheet test data
• Cluster spreadsheet columns based on data category– EUSES spreadsheet corpus “database” section– Hierarchical agglomerative clustering– Manual inspection– Result = 1713 columns in 246 clusters
(1 cluster per data category)
• Created 1 tope for each of 32 most common categories – Yielding 32 topes– Covered 70% of clustered columns
Problem Topes Validation Conclusion
1717
We considered 5 validation strategiesWe considered 5 validation strategies
• Strategy 1: Current spreadsheet practice(accept all inputs)
• Strategy 2: Current webapp practice(validate with regexp or fixed list, when available; accept all other inputs)– 36 regexps + 35 fixed lists, in 7 categories
• Strategy 3A: Tope rejecting questionable(accept when (str)=1)
• Strategy 3B: Tope accepting questionable(accept when (str)>0)
• Strategy 4: Tope warn on questionable(simulate double-check by user when 0<(str)<1)
Problem Topes Validation Conclusion
1818
MeasurementsMeasurements
• Based on 100 random values per category
• Used F1 to measure accuracy– standard measure of accuracy for
classifiers = (precision*recall)/avg(precision,recall)
• Considered topes with 1, 2, 3, 4, or 5 formats
Problem Topes Validation Conclusion
1919
Recognizing multiple formats and Recognizing multiple formats and questionable inputs raises accuracyquestionable inputs raises accuracy
Condition 4: Hypothetical user has to help on ~ 3% of inputs
Condition 1: Recall = 0 (fails to identify any invalid inputs)
Problem Topes Validation Conclusion
2020
Topes based on spreadsheet data were Topes based on spreadsheet data were accurate on web application data.accurate on web application data.
Problem Topes Validation Conclusion
Hurricane KatrinaGoogle Base
2121
Putting data in a consistent format improves Putting data in a consistent format improves duplicate identification.duplicate identification.
• Randomly extracted 10000 values for each of 5 Hurricane Katrina data categories
• Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used
• Found approximately 8% more duplicates after transformation
Problem Topes Validation Conclusion
2222
Topes improve data validationTopes improve data validation
• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification
• Contributions:– Support for ambiguous data categories– Support for transforming values– Platform-independent validation
Problem Topes Validation Conclusion
2323
Future Work: Sharing topesFuture Work: Sharing topes
• Repository search mechanisms based on– Relevance to new applications – Quality criteria
• Integrate with more programming platforms– Microsoft Excel – Microsoft Visual Studio.NET – A simple XML processing API – Univ. Nebraska’s Robofox – IBM’s CoScripter – Your tool or platform ?
Problem Topes Validation Conclusion
2424
Thank You…Thank You…
• To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret Burnett, and others at ICSE 2007 for early feedback
• To NSF for funding
• To ICSE 2008 for this opportunity to present
Problem Topes Validation Conclusion