Dealing with Data Quality

Post on 24-Feb-2016

30 views 0 download

Tags:

description

Dealing with Data Quality. Google Workshop July 24, 2009. ?. Low light. Blurry. Missing. Blurry. Faults can reduce the quantity and quality of the collected information. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle”. “Circle”. - PowerPoint PPT Presentation

transcript

Dealing with Data Quality

Google WorkshopJuly 24, 2009

Blurry Blurry Low lightMissing

?

Faults can reduce the quantity and quality of the collected information.

When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions.

“Circle”

“Circle”“Circle”

“Square”“Square”

“Square”

“Square”

“Square”

“Square”

Unfortunately faults in networked sensing systems are common

GDI ‘04 Redwoods '05

63 G. Werner-Allen et. al. Fidelity and Yield in a Volcano Monitoring Sensor Network. In Procs. OSDI, 2006.2 G. Tolle et. al. A macroscope in the redwoods. In Proc. SenSys, 2005.1 R. Szewczyk et. al. An analysis of a large scale habitat monitoring application. In Procs. Sensys, 2004.

4 Cms database. http://cens.jamesreserve.edu/phpmyadmin

*** Numbers are approximations based on publications, personal communications

Volcan '06 James Reserve '06

Network Faults

Data Faults

Good Data

Ammonium

Calcium

Carbonate

Chloride

Nitrate

pH

Our experience is similar: Almost 60% of data was faulty in this soil deployment (Bangladesh, 2006)

Many methods to find faults

Examples include• Visual inspection• Manual validation• Analytical validation: statistical, scientific

models

Statistical, e.g. outlier detection

Scientific, e.g. “temperature decreases with depth”

Tem

pera

ture

Depth

Several methods to fix faults

• Go into the field and replace or fix the problem.

• Remove the faulty data, (“clean” the dataset), after the deployment is over.

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Faults persist partly because they are difficult to define

X

Faults persist partly because they are difficult to define

A nitrate deployment in the riverbed of Merced river

A nitrate deployment in the riverbed of Merced river

Faults persist partly because they are difficult to define

A nitrate deployment in the riverbed of Merced river

Nitrate data taken from nearby locations

Faults persist partly because they are difficult to define

Which one is correct? Are the both correct? Are they both faulty?

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Second, faults are not always worth fixing

Not all faults need to be fixed [Schoellhammer ‘08]

Maintenance can be expensive

And, if the analysis can happen without the faulty data, then what’s the point?

Tem

pera

ture

Depth

Tem

pera

ture

Depth

Faults persist for a number of reasons, including:

First, faults can be difficult to define and identify

Second, faults are not always worth fixing

Answering these questions is hard

Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem.

Regardless of the solution for addressing faults - and there are many – it should be incorporated into the design and implementation of the system right from the beginning.

Thank You

Nithya Ramanathan

Collecting usable sensor data from a networked system is never easy. Whetherthe data consists of images or nitrate levels from a chemistry sensor,faults can reduce the quantity and quality of the collected information. Andwhen ignored, faults in a dataset can lead to ambiguous, or worse, incorrectconclusions. Unfortunately faults in networked sensing systems are painfullycommon.

Faults persist partly because they are difficult to define, and even onceidentified, they are not always worth fixing. Incomplete, ad-hoc, or lastminute solutions for addressing faults only exacerbates the problem.Regardless of the solution for addressing faults - and there are many - itshould be incorporated into the design and implementation of the systemright from the beginning.