+ All Categories
Home > Documents > Dealing with Data Quality

Dealing with Data Quality

Date post: 24-Feb-2016
Category:
Upload: glynn
View: 30 times
Download: 0 times
Share this document with a friend
Description:
Dealing with Data Quality. Google Workshop July 24, 2009. ?. Low light. Blurry. Missing. Blurry. Faults can reduce the quantity and quality of the collected information. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle”. “Circle”. - PowerPoint PPT Presentation
Popular Tags:
20
Dealing with Data Quality Google Workshop July 24, 2009
Transcript
Page 1: Dealing with Data Quality

Dealing with Data Quality

Google WorkshopJuly 24, 2009

Page 2: Dealing with Data Quality
Page 3: Dealing with Data Quality

Blurry Blurry Low lightMissing

?

Page 4: Dealing with Data Quality

Faults can reduce the quantity and quality of the collected information.

Page 5: Dealing with Data Quality

When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions.

“Circle”

“Circle”“Circle”

“Square”“Square”

“Square”

“Square”

“Square”

“Square”

Page 6: Dealing with Data Quality

Unfortunately faults in networked sensing systems are common

GDI ‘04 Redwoods '05

63 G. Werner-Allen et. al. Fidelity and Yield in a Volcano Monitoring Sensor Network. In Procs. OSDI, 2006.2 G. Tolle et. al. A macroscope in the redwoods. In Proc. SenSys, 2005.1 R. Szewczyk et. al. An analysis of a large scale habitat monitoring application. In Procs. Sensys, 2004.

4 Cms database. http://cens.jamesreserve.edu/phpmyadmin

*** Numbers are approximations based on publications, personal communications

Volcan '06 James Reserve '06

Network Faults

Data Faults

Good Data

Page 7: Dealing with Data Quality

Ammonium

Calcium

Carbonate

Chloride

Nitrate

pH

Our experience is similar: Almost 60% of data was faulty in this soil deployment (Bangladesh, 2006)

Page 8: Dealing with Data Quality

Many methods to find faults

Examples include• Visual inspection• Manual validation• Analytical validation: statistical, scientific

models

Statistical, e.g. outlier detection

Scientific, e.g. “temperature decreases with depth”

Tem

pera

ture

Depth

Page 9: Dealing with Data Quality

Several methods to fix faults

• Go into the field and replace or fix the problem.

• Remove the faulty data, (“clean” the dataset), after the deployment is over.

Page 10: Dealing with Data Quality

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Page 11: Dealing with Data Quality

Faults persist partly because they are difficult to define

X

Page 12: Dealing with Data Quality

Faults persist partly because they are difficult to define

A nitrate deployment in the riverbed of Merced river

Page 13: Dealing with Data Quality

A nitrate deployment in the riverbed of Merced river

Faults persist partly because they are difficult to define

Page 14: Dealing with Data Quality

A nitrate deployment in the riverbed of Merced river

Nitrate data taken from nearby locations

Faults persist partly because they are difficult to define

Which one is correct? Are the both correct? Are they both faulty?

Page 15: Dealing with Data Quality

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Second, faults are not always worth fixing

Page 16: Dealing with Data Quality

Not all faults need to be fixed [Schoellhammer ‘08]

Maintenance can be expensive

And, if the analysis can happen without the faulty data, then what’s the point?

Tem

pera

ture

Depth

Tem

pera

ture

Depth

Page 17: Dealing with Data Quality

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Second, faults are not always worth fixing

Answering these questions is hard

Page 18: Dealing with Data Quality

Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem.

Regardless of the solution for addressing faults - and there are many – it should be incorporated into the design and implementation of the system right from the beginning.

Page 19: Dealing with Data Quality

Thank You

Nithya Ramanathan

Page 20: Dealing with Data Quality

Collecting usable sensor data from a networked system is never easy. Whetherthe data consists of images or nitrate levels from a chemistry sensor,faults can reduce the quantity and quality of the collected information. Andwhen ignored, faults in a dataset can lead to ambiguous, or worse, incorrectconclusions. Unfortunately faults in networked sensing systems are painfullycommon.

Faults persist partly because they are difficult to define, and even onceidentified, they are not always worth fixing. Incomplete, ad-hoc, or lastminute solutions for addressing faults only exacerbates the problem.Regardless of the solution for addressing faults - and there are many - itshould be incorporated into the design and implementation of the systemright from the beginning.


Recommended