Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | emma-brown |
View: | 218 times |
Download: | 2 times |
1
Machine Learning in Performance Management
Irina RishIBM T.J. Watson Research
CenterJanuary 24, 2001
Irina Rish, IBM TJWRC2
Outline
Introduction
Machine learning applications in Performance Management
Bayesian learning tools: extending ABLE
Advancing theory
Summary and future directions
Irina Rish, IBM TJWRC3
Pattern discovery, classification, diagnosis and prediction
Pattern discovery, classification, diagnosis and prediction
Learning problems: examples
System event mining
Even
ts f
rom
hosts
Time
End-user transaction recognition
5R5R3R 2R 2R1R2R
Remote Procedure Calls (RPCs)
BUY?SELL?
OPEN_DB?SEARCH?
Transaction1 Transaction2
Irina Rish, IBM TJWRC4
Approach: Bayesian learning
Numerous important applications: Medicine Stock market Bio-informatics eCommerce Military ………
classmax
Diagnosis:P(cause|
symptom)=?
Diagnosis:P(cause|
symptom)=?
Learn (probabilistic) dependency modelsLearn (probabilistic) dependency models
C
S
B
DX
P(S)
P(B|S)
P(X|C,S)
P(C|S)
P(D|C,B)
Prediction:P(symptom|
cause)=?
Prediction:P(symptom|
cause)=?
Bayesian networks
Pattern classification:
P(class|data)=?
Pattern classification:
P(class|data)=?
Irina Rish, IBM TJWRC5
Outline
Introduction
Machine-learning applications in Performance Management
Transaction Recognition In progress: Event Mining;
Probe Placement; etc.
Bayesian learning tools: extending ABLE
Advancing theory
Summary and future directions
Irina Rish, IBM TJWRC6
End-User Transaction Recognition: why is it important?
Client Workstation
End-UserTransactions (EUT)
Remote ProcedureCalls (RPCs)
Server (Web, DB, Lotus Notes)
Session (connection)
Realistic workload models (for testing performance)
Resource management (anticipating requests)
Quantifying end-user perception of performance (response times)
OpenDB Search SendMail
Examples: Lotus Notes, Web/eBusiness (on-line stores, travel agencies, trading):database transactions, buy/sell, search, email, etc.
RPCs?
Irina Rish, IBM TJWRC7
Why is it hard? Why learn from data?
MoveMsgToFolder
FindMailByKey
1. OPEN_COLLECTION2. UDATE_COLLECTION3. DB_REPLINFO_GET4. GET_MOD_NOTES5. READ_ENTRIES6. OPEN_COLLECTION7. FIND_BY_KEY8. READ_ENTRIES
EUTs
RPCs
Example: EUTs and RPCs in Lotus Notes
Many RPC and EUT types (92 RPCs and 37 EUTs) Large (unlimited) data sets (10,000+ Tx inst.) Manual classification of a data subset took about
a month Non-deterministic and unknown EUT RPC
mapping: “Noise” sources - client/server states No client-side instrumentation – unknown EUT
boundaries
Irina Rish, IBM TJWRC8
Problem 2: both segment and label (EUT recognition)
1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4
Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2
Unsegmented RPC's
Segmented RPC's and Labeled Tx's Tx2
Our approach: Classification + Segmentation
Problem 1: label segmented data (classification)
Labeled Tx's
Segmented RPC's
Tx3Tx2
1 3 31 31 31 31 31 31 3 1111111122222222 1 3
Tx1 Tx 1
1 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 4
Tx3
11111111
(similar to text classification)
(similar to speech understanding, image segmentation)
Irina Rish, IBM TJWRC9
How to represent transactions? “Feature vectors”
ijij p)T|1P(R :Bernoulli
M
1j
nijM
1j ij
iiMi1ijp
!n
n!)T|n,...,P(n :lMultinomia
)p(1p)T|P(n :Geometric ijn
ijiijij
)p(1p)T|P(n :Geometric Shifted ijsn
ijiijijij
i typeof nTransactio
5R 5R3R 2R 2R1R2R
RPC counts ...) 0, 2, 0, 1, 3, (1,f
...) 0, 1, 0, 1, 1, (1,f RPC occurrences
geometric shifted:)( data to fit Best 2
Irina Rish, IBM TJWRC10
Classification scheme
RPCs labeledwith EUTs
LearningClassifier
Unlabeled RPCs
EUTs
Training phase
FeatureExtraction Classifier
Training data:
Operation phase
“Test” data: FeatureExtraction Classification
Irina Rish, IBM TJWRC11
Our classifier: naïve Bayes (NB)
counts RPC or occurences RPC
:Features EUT)|P(f1 EUT)|P(f2 EUT)|P(fn
1f feature nf feature2f feature
EUTP(EUT)
2. Classification: given (unlabeled) instance , choose most likely class:
1. Training: estimate parameters and (e.g., ML-estimates) EUT)|P(fi
P(EUT)
)f,...,f|P(EUTmax arg EUT n1ii
(Bayesian decision rule)
)f,...,(f n1
Simplifying (“naïve”) assumption:feature independence given class
Irina Rish, IBM TJWRC12
Classification results on Lotus CoC data
Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:
SVM – 85-87%, Decision Tree – 90-92% Best-fit distribution (shift. geom) - not necessarily best classifier! (?)
Baseline classifier:Always selects most-frequent transaction
Accu
racy
Training set size
NB + Bernoulli, mult. or geom.
NB + shifted geom.
2% 87
3% 79
1%10
Irina Rish, IBM TJWRC13
Transaction recognition:segmentation + classification
)i,...,(iV m1 find : taskonSegmentati
n1jjpm1 ...rr ...r...r......rr
11 i 2i 3i ki
n ,1jRjR ,1
n
Naive Bayes classifier
)maxmax 10
kjT
jkj
k P(T,Rαα
),(maxarg )|(maxarg* 11 nV
nV
RVPRVPV onsegmentati probable most :Objective
Dynamic programming (Viterbi search)
(Recursive)DP equation:
Irina Rish, IBM TJWRC14
Transaction recognition results
Third bestbestMultinomial
Fourth bestbestGeometric
bestworstShift. Geom.
Second bestbestBernoulli
SegmentationClassificationModel
64%- geom. shifted
geom. Multinom.,
Bernoulli
Accu
racy
Training set size
Good EUT recognition accuracy: 64% (harder problem than classification!) Reversed order of results: best classifier - not necessarily best recognizer! (?) further research!
Irina Rish, IBM TJWRC15
EUT recognition: summary A novel approach: learning EUTs from RPCs Patent, conference paper (AAAI-2000), prototype system Successful results on Lotus Notes data (Lotus CoC):
Classification – naive Bayes (up to 87% accuracy) EUT recognition – Viterbi+Bayes (up to 64% accuracy)
Work in progress: Better feature selection (RPC subsequences?) Selecting “best classifier” for segmentation task Learning more sophisticated classifiers (Bayesian networks) Information-theoretic approach to segmentation (MDL)
Irina Rish, IBM TJWRC16
Outline
Introduction
Machine-learning applications in Performance Management
Transaction Recognition In progress: Event Mining;
Probing Strategy; etc.
Bayesian learning tools: extending ABLE
Advancing theory
Summary and future directions
Irina Rish, IBM TJWRC17
Event Mining:analyzing system event sequences
Example: USAA data 858 hosts, 136 event types 67184 data points: (13 days, by sec) Event examples:
High-severity events: 'Cisco_Link_Down‘, 'chassisMinorAlarm_On‘, etc. Low-severity events: 'tcpConnectClose‘, 'duplicate_ip‘, etc.
Even
ts f
rom
hosts
Time (sec)
What is it? Why is it important? learning system behavior patterns for better performance management
Why is it hard?
large complex systems (networks)
with many dependencies; prior models not always available many events/hosts, data sets: huge and constantly growing
Irina Rish, IBM TJWRC18
???
Event1
Event
N
1. Learning event dependency models
Event2
EventM
Important issue: incremental learning from data streams
Current approach: learn dynamic probabilistic graphical models (temporal, or dynamic Bayes nets) Predict:
time to failure event co-occurrence existence of hidden nodes – “root causes”
Recognize sequence of high-level system states: unsupervised version of EUT recognition problem
Irina Rish, IBM TJWRC19
2. Clustering hosts by their history
“Problematic” hosts “Silent” hosts
group hosts w/ similar event sequences: what is appropriate similarity (“distance”) metric? One example:
e.g., distance between “compressed” sequences – event distribution models:
type evente
(e),P S Seq.(e),P S Seq.
12
11
distance) Leibler-(Kullback entropy relative is
(e)P
(e)P(e)logP)P||D(P where),P||D(P)S,S dist(
e 2
11212121
Irina Rish, IBM TJWRC20
Probing strategy (EPP)
Objectives: find probe frequency F that minimizes 1. E (Tprobe-Tstart) - failure detection, or
2. E( total “failure” time – total “estimated” failure time) -
gives accurate performance estimate Constraints on additional load induced by probes: L(F) <
MaxLoad
maxR
time
resp
on
se t
ime
s1v e
1v e2vs
2v Ts2t
e2t
Availabilityviolations Probes
Irina Rish, IBM TJWRC21
Outline
Introduction
Machine-learning applications in Performance Management
Bayesian learning tools: extending ABLE
Advancing theory
Summary and future directions
ABLE: Agent Building and Learning Environment
Irina Rish, IBM TJWRC23
What is ABLE? What is my contribution?
A JAVA toolbox for building reasoning and learning agents
Provides: visual environment, boolean and fuzzy rules, neural networks, genetic search
My contributions: naïve Bayes classifier (batch and
incremental) Discretization Future releases:
General Bayesian learning and inference tools
Available at AlphaWorks: www.alphaWorks.ibm.com/tech Project page: w3.rchland.ibm.com/projects/ABLE
How does it work?
Irina Rish, IBM TJWRC25
Who is using Naïve Bayes tools?Impact on other IBM projects
Video Character Recognition:
(w/ C. Dorai): Naïve Bayes: 84% accuracy Better than SVM on some pairs
of characters (aver. SVM = 87%) Current work: combining Naïve
Bayes with SVMs
Environmental data analysis:(w/ Yuan-Chi Chang)
Learning mortality rates using data on air pollutants
Naïve Bayes is currently being evaluated
Performance management: Event mining – in progress EUT recognition – successful
results
Irina Rish, IBM TJWRC26
Outline
Introduction
Machine-learning in Performance Management
Bayesian learning tools: extending ABLE Advancing theory
analysis of naïve Bayes classifier
inference in Bayesian Networks
Summary and future directions
Irina Rish, IBM TJWRC27
Why Naïve Bayes does well? And when?
1f feature nf feature2f feature
Class
When independence assumptions do not hurt classification?
Class-conditional feature independence:
jj class)|P(fclass)|(fP ˆ
Unrealistic assumption! But why/when it works?
Intuition: wrong probability estimates wrong classification!
optclass
True
NB estimate
P(c
lass
|f)
f)|(classP̂
f)|P(class
NBclassNaïve Bayes: f)|(classPmax arg i
i
ˆ
optclassBayes-optimal:f)|P(classmax arg i
i
Irina Rish, IBM TJWRC28
Case 1: functional dependencies
Lemma 1: Naïve Bayes is optimal when features are functionally dependent given classProof
:
)(xP )(xP
))(x(fP))(x(fP )(xP)(xP
)x,...,(xP)x,...,(xP
,-}{cn1,..,i ),(xfx
-)P(C)P(C
-)C|P(X (X)P ),C|P(X (X)P ,-},{C
n1
n1
n
1i1i
n
1i1i
n
1ii
n
1ii
n1n1
1ii
-
:
:
for and
for :dependence functional 2. :priors uniform 1.
:AssumeLet
rule decisionBayes Naive
rule decision optimalBayes
)(xP)(xP 11
)(xP)(xP 11
Irina Rish, IBM TJWRC29
0 100 200 300 400 5000
0.01
0.02
0.03
0.04
0.05
0.06
δ
ii af
class)|P(fi
Lemma 2: Naïve Bayes is a “good approximation” for “almost-functional” dependencies
) assumption nce (independemarginals of product joint
Case 2: “almost-functional” (low-entropy) distributions
Related practical examples: RPC occurrences in EUTs: often almost-deterministic (and NB does well) Successful “local inference” in almost-deterministic Bayesian networks
(Turbo coding, “mini-buckets” – see Dechter&Rish 2000)
Formally:
δ1)afP(
or ,δ1)aP(f ii
then
If
n1,...,i for ,
nδ |)aP(f)afP(| i
ii
Irina Rish, IBM TJWRC30
Experimental results support theory
1. Less “noise” (smaller ) => NB closer to optimal
δ
δ-1Random problem generator: uniform P(class); random P(f|class):
1. A randomly selected entry in P(f|class) is assigned2. The rest of entries – uniform random sampling + normalization
2. Feature dependence does NOT correlate with NB error
Irina Rish, IBM TJWRC31
Outline Introduction
Machine-learning in Performance Management Transaction Recognition
Event Mining
Bayesian learning tools: extending ABLE Advancing theory
analysis of naïve Bayes classifier
inference in Bayesian Networks
Summary and future directions
Irina Rish, IBM TJWRC32
From Naïve Bayes to Bayesian Networks
class)|P(f1 class)|P(f2class)|P(fn
1f feature nf feature2f feature
ClassNaïve Bayes model:independent features given class
Bayesian network (BN) model: Any joint probability distributions
lung Cancer
Smoking
X-ray
Bronchitis
DyspnoeaP(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
P(S, C, B, X, D)=
CPD: C B D=0 D=10 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1
Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Irina Rish, IBM TJWRC33
Example: Printer Troubleshooting (Microsoft Windows 95)
Print OutputOK
Correct Driver
UncorruptedDriver
CorrectPrinter Path
Net CableConnected
Net/LocalPrinting
Printer On and Online
CorrectLocal Port
Correct Printer
Selected
Local CableConnected
ApplicationOutput OK
PrintSpooling On
Correct Driver
Settings
Printer MemoryAdequate
NetworkUp
SpooledData OK
GDI DataInput OK
GDI Data Output OK
PrintData OK
PC to PrinterTransport OK
PrinterData OK
SpoolProcess OK
NetPath OK
LocalPath OK
PaperLoaded
Local DiskSpace Adequate
[Heckerman, 95]
Irina Rish, IBM TJWRC34
MEU Decision-making
(given utility function)
MEU Decision-making
(given utility function)
How to use Bayesian networks?
Applications: Medicine Stock market Bio-informatics eCommerce Performance management etc.
1C 2C
cause
symptomsymptom
cause
Classification: P(class|data)=?
Classification: P(class|data)=?class
max
Prediction:P(symptom|
cause)=?
Prediction:P(symptom|
cause)=?
Diagnosis:P(cause|
symptom)=?
Diagnosis:P(cause|
symptom)=?
NP-complete inference problems
Approximate algorithms
Irina Rish, IBM TJWRC35
Idea: reduce complexity of inference by ignoring some dependencies
Successfully used for approximating Most Probable Explanation:Very efficient on real-life (medical, decoding) and synthetic problems
)xP(max arg xx
Local approximation scheme “Mini-buckets” (paper submitted to JACM)
MPE on bound lower
MPE on bound upperaccuracy
Less “noise” => higher accuracy similarly to naïve Bayes!
General theory needed: Independence assumptions and “almost-deterministic” distributions
noise
Appro
xim
ati
on a
ccura
cy
Potential impact: efficient inference in complex performance management models (e.g., event mining, system dependence models)
Irina Rish, IBM TJWRC36
Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning
Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning
Summary
Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:
Video character recognition Environmental data analysis
Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:
Video character recognition Environmental data analysis
Performance management: End-user transaction recognition: (Lotus CoC)
novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)
Performance management: End-user transaction recognition: (Lotus CoC)
novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)
Irina Rish, IBM TJWRC37
Future directions
Automated learning and inferenceAutomated learning and inference
Research interest
Practical ProblemsPractical Problems
Generic toolsGeneric tools
TheoryTheory
Performance management: Transaction recognition – better feature selection, segmentation Event Mining – Bayes net models, clustering Web log analysis – segmentation/ classification/ clustering Modeling system dependencies – Bayes nets “Technology transfer” – generic approach to “event streams” (EUTs, sys.events, web page accesses)
ML library / ABLE: Bayesian learning
general Bayes nets temporal BNs incremental learning
Bayesian inference Exact inference Approximations
Other tools: SVMs, decision trees Combined tools, meta-learning tools
Analysis of algorithms: Naïve Bayes accuracy: other distribution types Accuracy of local inference approximations
Comparing model selection criteria (e.g., Bayes net learning)
Relative analysis and combination of classifiers (Bayes/max. margin/DT)
Incremental learning
Irina Rish, IBM TJWRC38
Collaborations Transaction recognition
J. Hellerstein, T. Jayram (Watson) Event Mining
J. Hellerstein, R. Vilalta, S. Ma, C. Perng (Watson) ABLE
J. Bigus, R. Vilalta (Watson) Video Character Recognition
C. Dorai (Watson) MDL approach to segmentation
B. Dom (Almaden) Approximate inference in Bayes nets
R. Dechter (UCI) Meta-learning
R. Vilalta (Watson) Environmental data analysis
Y. Chang (Watson)
Irina Rish, IBM TJWRC39
Machine learning discussion group
Weekly seminars: 11:30-2:30 (w/ lunch) in 1S-F40
Active group members: Mark Brodie, Vittorio Castelli, Joe Hellerstein, Daniel
Oblinger, Jayram Thathachar, Irina Rish (more people joint recently)
Agenda: discussions of recent ML papers, book chapters (“Pattern Classification” by Duda, Hart, and Stork, 2000) brain-storming sessions about particular ML topics Recent discussions: accuracy of Bayesian classifiers (naïve Bayes)
Web site:http://reswat4.research.ibm.com/projects/mlreadinggroup/
mlreadinggroup.nsf/main/toppage