Date post: | 20-Jan-2018 |
Category: |
Documents |
Upload: | susan-ramsey |
View: | 216 times |
Download: | 0 times |
Analysis of Uncertain Data:Smoothing of Histograms
Eugene FinkAnkur Sarin
Jaime G. Carbonell
10 20 30
Density estimate problemConvert a set of numeric data points to a smoothed approximation of the underlying probability density.
10 20 30
1112
1921
ExamplePoints
1718
2226
2729
Techniques
•Manual estimates
•Histograms10 20 30
10 20 30
•Curve fitting10 20 30
Generalized histograms
10 20 30
0.2 chance: [11 .. 12]0.5 chance: [17 .. 22]0.3 chance: [26 .. 29]
General formprob1: [min1 .. max1]prob2: [min2 .. max2]
…probn: [minn .. maxn]
• Intervals do not overlap• Probabilities sum to 1.0
Special cases
•Standard histogram
•Set of points
•Weighted points
Smoothing problem
Given a generalized histogram, construct its coarser approximation.
10 20 30
10 20 30
10 20 30
Input
•Initial distribution:A point set or a fine-grained histogram
•Distance function:A measure of similarity between distributions
• Target size:The number of intervals in an approximation
Standard distance measures•Simple difference:∫ | p(x) − q(x) | dx
•Kullback-Leibler:∫ p(x) · log (p(x) / q(x)) dx
•Jensen-Shannon:(Kullback-Leibler (p, (p+q)/2) + Kullback-Leibler (q, (p+q)/2)) / 2
Smoothing algorithmRepeat: Merge two adjacent intervalsUntil the histogram has the right size
10 20 30
Interval merging
min1 min2max1 max2
prob1
prob2
min1 max2
prob1 + prob2
•For each potential merge,calculate the distance
•Perform the smallest-distance merge
Smoothing examples:Normal distribution
5000 points 200 intervals
50 intervals 10 intervals
Smoothing examples:Geometric distribution
5000 points 200 intervals
10 intervals50 intervals
Running time
•Theoretical:O (n · log n)
•Practical:O (n)
Running time3.4 GHz Pentium, C++ code
(2.5 ± 0.5) · num-pointsmicroseconds
Number of points
Tim
e (m
icro
sec)
102 104 106
102
104
106
Visual smoothing
We convert a piecewise-uniform distribution to a smooth curve by spline fitting.
The user usually prefers a smooth probability density.
10 20 30
Main results
10 20 30
10 20 30
10 20 30
•Density estimation
•Lossy compression ofgeneralized histograms
Advantages
•Explicit specification of - Distance measure- Compression level
•Effective representationfor automated reasoning