Post on 02-Apr-2015
transcript
Eiman Elnahrawy WSNA’03
Cleaning and Querying Noisy Sensors
Eiman Elnahrawy and Badri NathRutgers University
WSNA September 2003
This work was supported in part by NSF grant ANI-0240383and DARPA under contract number N-666001-00-1-8953
Eiman Elnahrawy WSNA’03
I can’t rely on this sensor data anymore. It has too many problems!!?-Noise-Bias-Missing information-Hmm, is this a malicious sensor-Something strange or sensor gone bad
Eiman Elnahrawy WSNA’03
Outline
• Motivation• General Framework• Cleaning Noise• Querying Noisy Sensors Statistically• Preliminary Evaluations• Challenges and Future Work• Conclusion
Eiman Elnahrawy WSNA’03
Motivation
• “Measurements” subject to many sources of error
• Systematic errors->Bias (Calibration) [Bychkovskiy03]
• Random errors (Noise) : external, uncontrollable environmental, HW, inaccuracies/imprecision
• Current technology: cheap noisy sensors, vary in tolerance, precision/accuracy
• Focus of industry is even cheaper sensors -> noisier, noise varies with the cost of the sensor
Eiman Elnahrawy WSNA’03
So What?
• Uncertainty• Interest is generally queries
over a set of noisy sensors– Predicate/ range queries– Aggregates SUM, MIN– Other
• Accumulation: seriously affects decision-making/triggers
• False +ve/-ve• Misleading answers• May cost you money
h
t
Eiman Elnahrawy WSNA’03
Problem Definition
• Research focused on homogeneous sensors, in-network aggregation, query languages, optimization
• The primitives are now working fairly fine, why don’t we move on to more complex data quality problems
• If the collected data/query result is erroneous/misleading, why would we need such nets?
• Given any query and some user-defined confidence metrics, how do we answer this query “efficiently” given noisy sensors?
• What is the effect of noise on queries?
Eiman Elnahrawy WSNA’03
Is this a new problem?
• Traditional databases– Data entry, transactional activity– Clean data: no noise– Supervised off-line cleaning
• Sensors– Stream– Decision-making in real time– Online cleaning and query processing– Many resource constraints
Eiman Elnahrawy WSNA’03
General Framework
• Two Steps• Online cleaning
– Inputs: noisy data + error models + prior knowledge– Output: uncertainty models (clean data)
• Queries evaluated on clean data (uncertainty models)
Cleaning ModuleQuery Processing
Module
Uncertainty Models (Posteriors) Query Answer
Noisy Observations from Sensors
Error Models Prior Knowledge
User Query
Eiman Elnahrawy WSNA’03
• Observation: noisy reading from the sensor• Prior Knowledge: r.v., distribution of the true
reading– Facts, learning, using less noisy as priors for
noisier, experts, dynamic (parametric model)• Error Model: r.v., noise characteristic
– Any appropriate distribution, e.g., Gaussian– Heterogeneity -> model for each type or
even each individual sensor• Uncertainty Model (true unknown): r.v., with a
distribution, we would like to estimate
Cleaning Module
Noisy Observations from Sensors
Error Models Prior Knowledge
Uncertainty Models (Posteriors)
Eiman Elnahrawy WSNA’03
Cleaning
• Single Sensor Fusion using Bayes’ rule Posterior = (likelihood x prior) / (evidence)• Single attribute sensors
• Example: Gaussian prior (μs,σ2s), Gaussian error (0,δ2)
yield Gaussian posterior (uncertainty model)
Eiman Elnahrawy WSNA’03
Cleaning
• Multi-attributes sensors
• Example: Gaussian prior (μs,Σs), Gaussian error (0, Σ2) yield Gaussian posterior (uncertainty model)
• The terms Σs [Σs + Σ]-1, ΣT will be computed off-line
Eiman Elnahrawy WSNA’03
• Classification of Queries– What is the reading(s) of sensor x? Single Source
Queries (SSQ)
– Which sensors have at least c% chance of satisfying a given predicate? Set Non-Aggregate Queries (SNAQ)
– On those sensors which have at least c% chance of satisfying a given predicate, what is the value of a given aggregate?
• Summary Aggregate Queries (SUM, AVG, COUNT) SAQ • Exemplary Aggregate Queries (MIN, MAX, etc.) EAQ
Query Processing Module
Uncertainty Models (Posteriors) Query Answer
User Query
Eiman Elnahrawy WSNA’03
Single Source Queries
• Approach 1: output expected value of the probability distribution
• Approach 2: output p% confidence interval using Chebychev’s inequality [μs - ε, μs + ε]
– “p” is user-defined with a default value, e.g., 95%
• Multi-attribute: first compute the marginal pdf of each attribute then proceed as above
Eiman Elnahrawy WSNA’03
Set Non-Aggregate Queries
• Output sensor id, confidence (pi)
• Confidence = probability of satisfying the given predicate (range R) >= user defined confidence
pi = ∫R psi(t) dt
• {si} = SR , eligible set
• If the readings are required compute it using the SSQ’s algorithms
• Multi-attribute: compute SR over a region rather than a single interval
Eiman Elnahrawy WSNA’03
Summary Aggregate Queries
• SUM: compute sum of independent continuous r.vs.
• Z = sum(s1, s2,…, sm) • Perform convolution on two sensors and then add
one sensor repeatedly from the eligible set (SR)
• Output expected value or p% confidence interval of overall sum
Eiman Elnahrawy WSNA’03
Summary Aggregate Queries
• COUNT: output |SR| over the given predicate
• AVG: output SUM/COUNT
• Multi-attribute: compute SR , marginalize over the aggregated attribute, then proceed as above
Eiman Elnahrawy WSNA’03
Exemplary Aggregate Queries
• Min: compute min of independent continuous r.vs.
• Z = min(s1, s2,…, sm)
• Output expected value or p% confidence interval• Other order statistics Max, Top-K, Min-K, and
median in a similar manner • Multi-attribute: analogous
Eiman Elnahrawy WSNA’03
Tradeoffs “Sensors” Vs. “Database”
• Sensor Level– Storage cost – Communication cost “sending priors”– Processing cost “compute posteriors” – Adv: point estimate, in-network aggregation
with error bounds • Database Level
– 0 cost assuming free processing, storage– Communication cost saved – Exact query answer– Disadv: no distributed query processing
Eiman Elnahrawy WSNA’03
Evaluations
• Synthetic data• “Unknown” true readings
– 1000 sensors, random from 5 clusters – Gaussian, μ = 1000, 2000, 3000, 4000, 5000, δ2 = 100
• Noisy data (Raw data) – Added random noise, Gaussian, μ = 0, different noise
levels• Posteriors (Bayesian data)
– Prior: distribution of the cluster generated the reading • Predicates: 500 random range queries at each noise level,
averaged the error
Eiman Elnahrawy WSNA’03
• Single source queries– Metric is MSE – Reduces uncertainty, yields far less errors
– Error scaled down by a factor of δp2 /(δp
2 + δn2)
Eiman Elnahrawy WSNA’03
• Set non-aggregate queries: prior δ = 10
– Metrics are Precision and Recall– Recall: fraction of relevant objects that are retrieved– Precision: fraction of retrieved objects that are relevant – High Recall, Precision (low false –ve, +ve, res.) better – Maintained high Recall, Precision at different confidence
levels – 95 % versus 70 % for noisy readings
Eiman Elnahrawy WSNA’03
• Summary aggregate queries: prior δ = 10– Metric is Absolute error– More accurate priors yield smaller error– SUM: noisy readings caused four times the error– COUNT: 2 versus 14 for noisy data
Eiman Elnahrawy WSNA’03
Challenges and Future Work
• Prototype and more evaluations on real data• Just scratched the surface!
– Other estimation techniques– Other uncertainty problems: outliers,
missing data, etc. – Other queries– Effect of noise on queries
• “Efficient” distributed query processing
Eiman Elnahrawy WSNA’03
Challenges and Future Work
• Given a query and specific quality requirements (confidence, number of false +/-) what to do if can’t satisfy confidence? – Sensors are not homogeneous– Change sampling method at running time– Turn on “specific” sensors at running time– Routing– Up-to-date metadata about sensors’
resources/characteristics– Cost and query optimization
Eiman Elnahrawy WSNA’03
Conclusion
• Taking noise into consideration is important• Single sensor fusion• Statistical queries• Works well• Many open problems and future work
directions
Eiman Elnahrawy WSNA’03
Thank You