+ All Categories
Home > Education > Tom johnson datavalidity-eng-nov21-arbol

Tom johnson datavalidity-eng-nov21-arbol

Date post: 28-Jan-2015
Category:
Upload: j-t-tom-johnson
View: 110 times
Download: 3 times
Share this document with a friend
Description:
Lecture presented at Catedra Walter Lippmann, Universidad del Rosario, Bogota, Colombia, 23 Nov. 2012 See: http://issuu.com/consejo_de_redaccion/docs/ur_-_semana_-_seminario_walter_lippmann_2012_2
Popular Tags:
43
Árbol de vida de los datos (Data validation in the Digital Age) Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m @ j t j o h n s o n 1
Transcript
Page 1: Tom johnson datavalidity-eng-nov21-arbol

Árbol de vida de los datos

(Data validation in the Digital Age)

Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m @ j t j o h n s o n

1

Page 2: Tom johnson datavalidity-eng-nov21-arbol

Data validation in the Digital Age

Presentation by Tom Johnson at

Cátedra Walter Lippmann de Periodismo y Opinión PúblicaClaustro de la UniversidadUniversidad del Rosario, Bogota, Colombia

Date/Time: 22 November 2012

This PowerPoint deck and Tipsheets posted at:

http:// s d r v . m s / w N t i M 7

2

Page 3: Tom johnson datavalidity-eng-nov21-arbol

Impt. Point 1-You know more than I do

Important point

3

1Each of you know more about some aspect of insuring data quality than I do.

Page 4: Tom johnson datavalidity-eng-nov21-arbol

DataSet--Story

4

The STORY!

Page 5: Tom johnson datavalidity-eng-nov21-arbol

01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101

DataSet

DataSet--CollectionProcess

5

CollectionProcess

The STORY!

Page 6: Tom johnson datavalidity-eng-nov21-arbol

01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101

DataSet

DataSet-ValidationProcess

[6]

CollectionProcess

ValidationProcess

The STORY!

Page 7: Tom johnson datavalidity-eng-nov21-arbol

Paying the price of bad dataIllinois and Missouri sex-offender DB•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca

•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; State's database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger

Page 8: Tom johnson datavalidity-eng-nov21-arbol

How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionable voters” data set – “The Big Bundle”•~1.1m voters•Previous Sec. of State didn’t clean rolls•Matched name, address, DoB and SS#

• SSA data base; NM driver’s licenses• 2 variables “mismatch” = Questionable?• Asked State Police (not AG’s office) to

investigate

8

Page 9: Tom johnson datavalidity-eng-nov21-arbol

Problems with Sec. of State methodology

• What is the error rate of original DB?• Definition of “error”? (Gonzales or

Gonzalez)• Sample(s) by county and state total?• Error rates of comparative DBs?• Aggregation of error problem

• 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html

Page 10: Tom johnson datavalidity-eng-nov21-arbol

01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101

DataSet

DataSetCollectionProcess

10

CollectionProcess

The STORY!

Page 11: Tom johnson datavalidity-eng-nov21-arbol

Data sets are living things; they have pedigree and genealogy

Important point

11

2•Most [all?] data sets are living things. •And they have a pedigree, a genealogy, an “árbol de vida”. •Data sets live in a dynamic environment. •Understand the DB ecology

Page 12: Tom johnson datavalidity-eng-nov21-arbol

Data sets are living things; they have pedigree and genealogy

Important point

12

3• NEVER work with your original data set; always a copy of the file(s)

• More combined data sets = greater chance of error

• Larger data sets = greater chance of error

Page 13: Tom johnson datavalidity-eng-nov21-arbol

01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101

DataSet

13

Types of Data

Page 14: Tom johnson datavalidity-eng-nov21-arbol

DataQuality=FunctionOf…• Data Quality = function of…• Objectives, reputation of data-base

creator• Validity and precision of the

collection/creation process – and resulting data

• Statistical Data?• Primary Data (collected, managed by

agency or individual)• Secondary (Agency or individual is

using someone else’s “primary” data)

[14]

Page 15: Tom johnson datavalidity-eng-nov21-arbol

Pyramid of significance

• How to judge whether some data – and its potential stories -- are more trustworthy than others?• Go back to librarians’ hierarchy of

trusted sources when searching? (Has anyone tested the “quality” of data sets from those strata of sources? If not, a good research project.)

[15]

Page 16: Tom johnson datavalidity-eng-nov21-arbol

Learn from Librarians

• Evaluating Web Pages: Techniques to Apply & Questions to Ask http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html

• What can the URL tell you?• Gov’t agency? Scholarly? Interest Group? Individual?

• Has a reputation for accuracy been created over time?

[16]

Page 17: Tom johnson datavalidity-eng-nov21-arbol

Learn from Librarians

• Does it all add up?• Why was the page put on the web?

• Inform, give facts, give data?• Explain, persuade?• Sell, entice?• Share?• Disclose?

• Is the information current? When was it last updated and by whom?

• If the data is available on other sites, who/what was the original creator and editor of the data?

[17]

Page 18: Tom johnson datavalidity-eng-nov21-arbol

Hierarchy of Trust

• For .gov, .edu, or .mil, probably the information has been vetted before it was posted.

• Websites with .gov, .edu and .mil have to be applied for, and their use is controlled.

• It doesn’t mean they are fool-proof though.

• ".org" is organization. Sites that end in .org are usually non-profit organizations.

• Can be very good sources or very poor sources; take care to research their possible agendas or political biases.

“.net” means network.

“.info” is the Internet’s first unrestricted top-level domain since .COM. There are no restrictions on who may register .INFO names. .INFO was created for general use around the world.

Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299

Page 19: Tom johnson datavalidity-eng-nov21-arbol

Hierarchy of Trust

• Credible websites should list contact information and resources.

• If only cell phones and PO boxes = suspicion

• If the author is named, find his/her web page to…

• Verify educational credits • Discover if the writer is either

published in a scholarly journal • Verify that the writer is

employed by a research institution or university

Page 20: Tom johnson datavalidity-eng-nov21-arbol

Hierarchy of Trust

• Internet pages that have been published more recently are usually more credible.

• Find this information at the bottom of a website; in the "about us“; or “view page source”

Page 21: Tom johnson datavalidity-eng-nov21-arbol

Hierarchy of Trust

• Selling something?

• Asking you to sign up for something?

• May not be presenting you with neutral, unbiased information.

Page 22: Tom johnson datavalidity-eng-nov21-arbol

Hierarchy of Trust

Probably reliable sites,

but not necessarily reliable data

Page 23: Tom johnson datavalidity-eng-nov21-arbol

01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101

DataSet

CollectionProcessDataSet

23

CollectionProcess

The STORY!

Page 24: Tom johnson datavalidity-eng-nov21-arbol

Precess of Data Evaluation

24

1. Pre-planning

• 2nd Monitor• “Logbook”

(bitácora) apps

• Checklist of intended steps

2. Lit. review/ interview peers

• Nothing is new; everything has a precedent

• How have others attacked this problem?

3. Do data fit theoretical models?

- Depends on subject: traffic flow vs. Crime or educational level vs. Income

- Sometimes good to use non-trad. models: Crime and disease

Page 25: Tom johnson datavalidity-eng-nov21-arbol

Precess of Data Evaluation

25

4. Do a “critical biography” of the data

- Why was data collected? Who ordered its creation (law? Agency? Individual?)

- When first collected?

- News stories about the data?

5. Does biography raise critical warnings?

- Have laws related to data remained the same?

- Have definitions remained the same?

6. Have others run analysis of this data?

- Not only journalists, but other agencies/people

Page 26: Tom johnson datavalidity-eng-nov21-arbol

Precess of Data Evaluation

26

7. Acquire latest data and related documentation

- Get data schema & code sheet

- Get instructions to data collectors and data entry clerks

Page 27: Tom johnson datavalidity-eng-nov21-arbol

Process of DB evaluation

27

Ask for copy of DATA ENTRY formData Sheet Codes & Explanation

Data base schema sheet

Computer Data-Entry

Sheet

Page 28: Tom johnson datavalidity-eng-nov21-arbol

Precess of Data Evaluation

28

7. Acquire latest data and related documentation

- Get data schema & code sheet

- Get instructions to data collectors and data entry clerks

8. Compare record layout to tables

This may tell you:- What data

you did not receive

- Possibly, what data is feeding into other variables or calculations

9. Do documents specify expected ranges & frequencies?

- Suggests variables to be found. If expected range is 1-7 and you find 8…

Page 29: Tom johnson datavalidity-eng-nov21-arbol

Precess of Data Evaluation

29

10. Are data values missing or out of range?

- Use Excel (or R) formula to test “expected” ranges- =MIN(A1:A100) or

=MAX(A1:A100)- Use Excel's

conditional formatting feature

Page 30: Tom johnson datavalidity-eng-nov21-arbol

Process of DB evaluation

30

10. Review major checklist10. Review major checklist - Revise your list of major checkpointsMajor questions•Are there changes in definitions

• Changed by law?• By the administrators?• Formal or informal by data entry process?

•Are there changes in the collection methods, data entry, editing of data, quality checking, and the type and form of files?•Were there changes in the users and the use of the data?•Now it is time to clean the data

Page 31: Tom johnson datavalidity-eng-nov21-arbol

Is perfection necessary?

• How “clean” must the data be?• Depends on the goals – and scale -- of

the analysis• How important is the actual age of an

individual? Or…• How precise should be the lat/longitude

data?

• Precision: Are the numbers rounded or?• Hope for fine-grained, not summaries or

aggregates • Can be especially important with temporal

and geographic data, i.e. What is the range(s) of the time scales?

31

Page 32: Tom johnson datavalidity-eng-nov21-arbol

Data Quality checkpoints

• Constancy of definitions and coding categories?

• Completeness: • How many records have unfilled cells? • Are the tendencies of “nulls” consistent

in all records, variable types?

Page 33: Tom johnson datavalidity-eng-nov21-arbol

COMMON VERIFICATION METHODS

•CountingDo you have the number of records indicated/promised?

• If >1,000 records, sample to test• To confirm your mythology

• Proportion of completed fields

• If a record has X fields, what % of records are complete?

• Are there trends of null (empty) fields?

•Draw on many Excel functions:• COUNTIFs or SUMIF

33

Page 35: Tom johnson datavalidity-eng-nov21-arbol

ScatterPlots+BoxPlots

35

Box Plots

Page 36: Tom johnson datavalidity-eng-nov21-arbol

What is a scatterplot?

• Scatterplot is often 1st step in analysis

• Examine relationship between the variables; determine if there are any problems/issues with the data

• Scatterplot indicates anything unique or interesting about the data, such as:• How is the data dispersed? • Are there outliers? A

scatterplot is useful for "eyeballing" the presence of outliers.

36

Page 37: Tom johnson datavalidity-eng-nov21-arbol

Convergence of Data Quality with Data Veracity

What is the difference?•Data quality is the responsibility of who

or what agency is collecting or creating thedata setThis suggests questions journalists should ask about DQ

Do methodologies differ?

Page 38: Tom johnson datavalidity-eng-nov21-arbol

Resources• Free

• Power Pivot – Excel 2010 add-on for working with large data sets

• R – free software environment for statistical computing and graphics• Shiny – Lets R users turn analyses into interactive web applications

• Google Refine - tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to database

• Google Fusion Tables - an experimental data visualization web application to gather, visualize, and share larger data tables.

• Tableau Public - Interact with the data, download it, or create visualizations of it

• Junar - cloud-based platform for opening data

Page 39: Tom johnson datavalidity-eng-nov21-arbol

Resources

• Open Source• Flat File Checker - a simple, intuitive tool for validation of

structured data in flat files (*.txt, *.csv, etc.).• Shiny – Lets R users turn analyses into interactive

web applications

• Excel add-ons• Commercial Companies & Products

• Techspeed Data Cleansing• SAS® Data Quality Advanced

Page 40: Tom johnson datavalidity-eng-nov21-arbol

Resources

Professional disciplines and organizations• International Association for Information and Data

Quality• DAMA International

• Forensic Accounting/ Performance Measurement• National Association of Forensic Accountants (NAFA) • Certified Fraud Examiner (CFE)• International Forensic Accounting Association• Forensic Accountants Society of North America• International City/County Management Association

Page 43: Tom johnson datavalidity-eng-nov21-arbol

Árbol de vida de los datos

(Data validation in the Digital Age)

Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m @ j t j o h n s o n

43


Recommended