Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | mia-graham |
View: | 215 times |
Download: | 0 times |
Forecasting Skewed Forecasting Skewed Biased Stochastic Ozone Biased Stochastic Ozone
Days: Days: Analyses and SolutionsAnalyses and Solutions
Kun Zhang, Wei Fan, Xiaojing Yuan, Ian Davidson, and Xiangshang Li
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
VE
What this Paper Offers Application: more accurate (higher recall &
precision) solution to predict “ozone days” Interesting and Difficult Data Mining Problem:
High dimensionality and some could be irrelevant features: 72 continuous, 10 verified by scientists to be relevant
Skewed class distribution : either 2 or 5% “ozone days” depending on “ozone day criteria” (either 1-hr peak and 8-hr peak)
Streaming: data in the “past” collected to train model to predict the “future”.
“Feature sample selection bias”: hard to find many days in the training data that is very similar to a day in the future
Stochastic true model: given measurable information, sometimes target event happens and sometimes it doesn’t.
Key Solution Highlights Non-parametric models are easier to
use when “physical or generative mechanism” is unknown.
Reliable conditional probabilities estimation under “skewed, high-dimensional, possibly irrelevant features”, …
Estimate decision threshold predict the unknown distribution of the future
Seriousness of Ozone Problem
Ground ozone level is a sophisticated chemical and physical process and “stochastic” in nature.
Ozone level above some threshold is rather harmful to human health and our daily life.
Drawbacks of current ozone forecasting systems
Traditional simulation systems Consume high computational power Customized for a particular location,
so solutions not portable to different places
Regression-based methods E.g. Regression trees, parametric
regression equations, and ANN Limited prediction performances
Ozone Level Prediction: Ozone Level Prediction: Problems we are Problems we are
facingfacing Daily summary maps of two datasets
from Texas Commission on Environmental Quality (TCEQ)
1. Rather skewed and relatively sparse distribution
2500+ examples over 7 years (1998-2004) 72 continuous features with missing
values Huge instance space
If binary and uncorrelated, 272 is an astronomical number
2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively
Challenges as a Data Mining Problem
2. True model for ozone days are stochastic in nature.
Given all relevant features XR, P(Y = “ozone day”| XR) < 1
Predictive mistakes are inevitable
3. A large number of irrelevant features Only about 10 out of 72 features verified to be
relevant, No information on the relevancy of the other 62
features For stochastic problem, given irrelevant
features Xir , where X=(Xr, Xir), P(Y|X) = P(Y|Xr) only if the data is exhaustive.
May introduce overfitting problem, and change the probability distribution represented in the data.
P(Y = “ozone day”| Xr, Xir) 1 P(Y = “normal day”|Xr, Xir) 0
4. “Feature sample selection bias”.
Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future
Given these, 2 closely-related challenges
1. How to train an accurate model2. How to effectively use a model to predict
the future with a different and yet unknown distribution
Training Distribution
Testing Distribution
12
3
12
3
+ +
+
+
+
+
- -
Addressing Challenges Skewed and
stochastic distribution Probability
distribution estimation
Parametric methods Non-parametric
methods Decision threshold
determination through optimization of some given criteria
Compromise between precision and recall
List of methods:• Logistic Regression• Naïve Bayes• Kernel Methods• Linear Regression• RBF• Gaussian mixture models
List of methods:• Decision Trees• RIPPER rule learner• CBA: association rule• clustering-based methods• … …
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
Highly accurate if the data is indeed generated from that model you use!
But how about, you don’t know which to choose or use the wrong one?
use a family of “free-form” functions to “match the data”
given some “preference criteria”.
• free form function/criteria is appropriate.• preference criteria is appropriates
VE
Reliable probability estimation under
irrelevant features Recall that due to irrelevant features:
P(Y = “ozone day”| Xr, Xir) 1 P(Y = “normal day”|Xr, Xir) 0
Construct multiple models Average their predictions
P(“ozone”|xr): true probability P(“ozone”|Xr, Xir, θ): estimated probability
by model θ MSEsinglemodel:
Difference between “true” and “estimated”. MSEAverage
Difference between “true” and “average of many models”
Formally show that MSEAverage ≤ MSESingleModel
Prediction with feature sample selection bias
TrainingS
et Algorithm
…..
Estimated probability
values1 fold
Estimated probability
values10 fold
10CV
10CV
Estimated probability
values2 fold
Decision threshold
VE
VE
“Probability-TrueLabel”
file
Concatenate
Concate
nate
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/2/98 0.6245 Ozone
7/3/98 0.5944 Ozone
………
PrecRecplot
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
A CV based procedure for decision threshold selection
Training Distribution
Testing Distribution
12
3
12
3
+ +
+
+
+
+
- -
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/3/98 0.5944 Ozone
7/2/98 0.6245 Ozone
………
Addressing Data Mining Challenges
Prediction with feature sample selection bias Future prediction based on decision
threshold selectedWhole TrainingSet
θ
Classification on
future days
if P(Y = “ozonedays”|X,θ ) ≥ VE
Predict “ozonedays”
Probabilistic Tree Models Single tree estimators
C4.5 (Quinlan’93) C4.5Up,C4.5P
C4.4 (Provost’03) Ensembles
RDT (Fan et al’03) Member tree trained
randomly Average probability
Bagging Probabilistic Tree (Breiman’96)
Bootstrap Compute probability Member tree: C4.5, C4.4
RDT: Random Decision Tree (Fan et al’03) “Encoding data” in trees. At each node, an un-used feature is chosen
randomly A discrete feature is un-used if it has never been chosen
previously on a given decision path starting from the root to the current node.
A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
Stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits:
Different from Random Forest
1. Original Data vs Bootstrap2. Random pick vs. Random Subset + info gain3. Probability Averaging vs. Voting
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
BaselineForecasting Parametric Model
in which,
• O3 - Local ozone peak prediction• Upwind - Upwind ozone background level• EmFactor - Precursor emissions related factor• Tmax - Maximum temperature in degrees F• Tb - Base temperature where net ozone production begins (50 F)• SRd - Solar radiation total for the day• WSa - Wind speed near sunrise (using 09-12 UTC forecast mode)• WSp - Wind speed mid-day (using 15-21 UTC forecast mode)
15.0WSp1.0WSa
SRdTbmaxTEmFactorUpwindO3
Model evaluation criteria Precision and Recall
At the same recall level, Ma is preferred over Mb if the precision of Ma is consistently higher than that of Mb
Coverage under PR curve, like AUC
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
MaMb
Recall
Pre
cisi
on
Pre
cisi
on
Pre
cisi
on
Some Coverage Results 8-hour: recall = [0.4,0.6]
Coverage under PR-Curve
BC4.4 RDT
C4.4
Para
0
0.03
0.06
0.09
Some “Action” ResultsAnnual test
1. BC4.4 and RDT more accurate than baseline Para2. BC4.4 and RDT “less surprise” than single tree
1. Previous years’ data for training2. Next year for testing3. Repeated 6 times using 7 years of data
1. C4.4 best among single trees2. BC4.4 and RDT best among tree ensembles
• 8-hour: thresholds selected at the recall = 0.6
• 1-hour: thresholds selected at the recall = 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
BC4.4 RDT C4.4 Para0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BC4.4 RDT C4.4 Para
Recall
Precision
Summary Procedures to formulate as a data
mining problem, Analysis of combination of technical
challenges Process to search for the most
suitable solutions. Model averaging of probability
estimators can effectively approximate the true probability A lot of irrelevant features Feature sample selection bias
A CV based guide for decision threshold determination for stochastic problems under sample selection bias
AUC Score
Given dataset
Signal-noise separability
estimation through RDT or BPET
Ensemble or Single trees
Low signal-noise
separability
High signal-noise
separability
Ensemble or Single
trees
Ensemble
(AUC,MSE,ErrorRate)
RDT CFT
Single Trees
(AUC,MSE,ErrorRate)
>=0.9< 0.9
EnsembleSingle Tree
AUCMSEError Rate
CFT
AUC
MSE, ErrorRate
C4.5 or C4.4
Feature types and
value characteris
tics Categorical feature with limited values
BPETRDT ( BPET)
Continuous features or categorical feature with a large number of values
AUC, MSE, ErrorRate
AUC, MSE, ErrorRate
Choosing the Appropriate PET come to our other talk 10:30 RM
402
Thank you!Thank you!
Questions?Questions?