Eiman Elnahrawy WSNA’03 Cleaning and Querying Noisy Sensors Eiman Elnahrawy and Badri Nath Rutgers...

Post on 02-Apr-2015

216 views 0 download

Tags:

transcript

Eiman Elnahrawy WSNA’03

Cleaning and Querying Noisy Sensors

Eiman Elnahrawy and Badri NathRutgers University

WSNA September 2003

This work was supported in part by NSF grant ANI-0240383and DARPA under contract number N-666001-00-1-8953

Eiman Elnahrawy WSNA’03

I can’t rely on this sensor data anymore. It has too many problems!!?-Noise-Bias-Missing information-Hmm, is this a malicious sensor-Something strange or sensor gone bad

Eiman Elnahrawy WSNA’03

Outline

• Motivation• General Framework• Cleaning Noise• Querying Noisy Sensors Statistically• Preliminary Evaluations• Challenges and Future Work• Conclusion

Eiman Elnahrawy WSNA’03

Motivation

• “Measurements” subject to many sources of error

• Systematic errors->Bias (Calibration) [Bychkovskiy03]

• Random errors (Noise) : external, uncontrollable environmental, HW, inaccuracies/imprecision

• Current technology: cheap noisy sensors, vary in tolerance, precision/accuracy

• Focus of industry is even cheaper sensors -> noisier, noise varies with the cost of the sensor

Eiman Elnahrawy WSNA’03

So What?

• Uncertainty• Interest is generally queries

over a set of noisy sensors– Predicate/ range queries– Aggregates SUM, MIN– Other

• Accumulation: seriously affects decision-making/triggers

• False +ve/-ve• Misleading answers• May cost you money

h

t

Eiman Elnahrawy WSNA’03

Problem Definition

• Research focused on homogeneous sensors, in-network aggregation, query languages, optimization

• The primitives are now working fairly fine, why don’t we move on to more complex data quality problems

• If the collected data/query result is erroneous/misleading, why would we need such nets?

• Given any query and some user-defined confidence metrics, how do we answer this query “efficiently” given noisy sensors?

• What is the effect of noise on queries?

Eiman Elnahrawy WSNA’03

Is this a new problem?

• Traditional databases– Data entry, transactional activity– Clean data: no noise– Supervised off-line cleaning

• Sensors– Stream– Decision-making in real time– Online cleaning and query processing– Many resource constraints

Eiman Elnahrawy WSNA’03

General Framework

• Two Steps• Online cleaning

– Inputs: noisy data + error models + prior knowledge– Output: uncertainty models (clean data)

• Queries evaluated on clean data (uncertainty models)

Cleaning ModuleQuery Processing

Module

Uncertainty Models (Posteriors) Query Answer

Noisy Observations from Sensors

Error Models Prior Knowledge

User Query

Eiman Elnahrawy WSNA’03

• Observation: noisy reading from the sensor• Prior Knowledge: r.v., distribution of the true

reading– Facts, learning, using less noisy as priors for

noisier, experts, dynamic (parametric model)• Error Model: r.v., noise characteristic

– Any appropriate distribution, e.g., Gaussian– Heterogeneity -> model for each type or

even each individual sensor• Uncertainty Model (true unknown): r.v., with a

distribution, we would like to estimate

Cleaning Module

Noisy Observations from Sensors

Error Models Prior Knowledge

Uncertainty Models (Posteriors)

Eiman Elnahrawy WSNA’03

Cleaning

• Single Sensor Fusion using Bayes’ rule Posterior = (likelihood x prior) / (evidence)• Single attribute sensors

• Example: Gaussian prior (μs,σ2s), Gaussian error (0,δ2)

yield Gaussian posterior (uncertainty model)

Eiman Elnahrawy WSNA’03

Cleaning

• Multi-attributes sensors

• Example: Gaussian prior (μs,Σs), Gaussian error (0, Σ2) yield Gaussian posterior (uncertainty model)

• The terms Σs [Σs + Σ]-1, ΣT will be computed off-line

Eiman Elnahrawy WSNA’03

• Classification of Queries– What is the reading(s) of sensor x? Single Source

Queries (SSQ)

– Which sensors have at least c% chance of satisfying a given predicate? Set Non-Aggregate Queries (SNAQ)

– On those sensors which have at least c% chance of satisfying a given predicate, what is the value of a given aggregate?

• Summary Aggregate Queries (SUM, AVG, COUNT) SAQ • Exemplary Aggregate Queries (MIN, MAX, etc.) EAQ

Query Processing Module

Uncertainty Models (Posteriors) Query Answer

User Query

Eiman Elnahrawy WSNA’03

Single Source Queries

• Approach 1: output expected value of the probability distribution

• Approach 2: output p% confidence interval using Chebychev’s inequality [μs - ε, μs + ε]

– “p” is user-defined with a default value, e.g., 95%

• Multi-attribute: first compute the marginal pdf of each attribute then proceed as above

Eiman Elnahrawy WSNA’03

Set Non-Aggregate Queries

• Output sensor id, confidence (pi)

• Confidence = probability of satisfying the given predicate (range R) >= user defined confidence

pi = ∫R psi(t) dt

• {si} = SR , eligible set

• If the readings are required compute it using the SSQ’s algorithms

• Multi-attribute: compute SR over a region rather than a single interval

Eiman Elnahrawy WSNA’03

Summary Aggregate Queries

• SUM: compute sum of independent continuous r.vs.

• Z = sum(s1, s2,…, sm) • Perform convolution on two sensors and then add

one sensor repeatedly from the eligible set (SR)

• Output expected value or p% confidence interval of overall sum

Eiman Elnahrawy WSNA’03

Summary Aggregate Queries

• COUNT: output |SR| over the given predicate

• AVG: output SUM/COUNT

• Multi-attribute: compute SR , marginalize over the aggregated attribute, then proceed as above

Eiman Elnahrawy WSNA’03

Exemplary Aggregate Queries

• Min: compute min of independent continuous r.vs.

• Z = min(s1, s2,…, sm)

• Output expected value or p% confidence interval• Other order statistics Max, Top-K, Min-K, and

median in a similar manner • Multi-attribute: analogous

Eiman Elnahrawy WSNA’03

Tradeoffs “Sensors” Vs. “Database”

• Sensor Level– Storage cost – Communication cost “sending priors”– Processing cost “compute posteriors” – Adv: point estimate, in-network aggregation

with error bounds • Database Level

– 0 cost assuming free processing, storage– Communication cost saved – Exact query answer– Disadv: no distributed query processing

Eiman Elnahrawy WSNA’03

Evaluations

• Synthetic data• “Unknown” true readings

– 1000 sensors, random from 5 clusters – Gaussian, μ = 1000, 2000, 3000, 4000, 5000, δ2 = 100

• Noisy data (Raw data) – Added random noise, Gaussian, μ = 0, different noise

levels• Posteriors (Bayesian data)

– Prior: distribution of the cluster generated the reading • Predicates: 500 random range queries at each noise level,

averaged the error

Eiman Elnahrawy WSNA’03

• Single source queries– Metric is MSE – Reduces uncertainty, yields far less errors

– Error scaled down by a factor of δp2 /(δp

2 + δn2)

Eiman Elnahrawy WSNA’03

• Set non-aggregate queries: prior δ = 10

– Metrics are Precision and Recall– Recall: fraction of relevant objects that are retrieved– Precision: fraction of retrieved objects that are relevant – High Recall, Precision (low false –ve, +ve, res.) better – Maintained high Recall, Precision at different confidence

levels – 95 % versus 70 % for noisy readings

Eiman Elnahrawy WSNA’03

• Summary aggregate queries: prior δ = 10– Metric is Absolute error– More accurate priors yield smaller error– SUM: noisy readings caused four times the error– COUNT: 2 versus 14 for noisy data

Eiman Elnahrawy WSNA’03

Challenges and Future Work

• Prototype and more evaluations on real data• Just scratched the surface!

– Other estimation techniques– Other uncertainty problems: outliers,

missing data, etc. – Other queries– Effect of noise on queries

• “Efficient” distributed query processing

Eiman Elnahrawy WSNA’03

Challenges and Future Work

• Given a query and specific quality requirements (confidence, number of false +/-) what to do if can’t satisfy confidence? – Sensors are not homogeneous– Change sampling method at running time– Turn on “specific” sensors at running time– Routing– Up-to-date metadata about sensors’

resources/characteristics– Cost and query optimization

Eiman Elnahrawy WSNA’03

Conclusion

• Taking noise into consideration is important• Single sensor fusion• Statistical queries• Works well• Many open problems and future work

directions

Eiman Elnahrawy WSNA’03

Thank You