Automated Cellular Root Cause Analysis
Sayandeep Sen Bell Labs India
Joint work with Sourjya Bhaumik & Rijin John
Cellular Base Station Monitoring
Monitoring Centre
Cell site
Cell sites
Every 15 minutes
Performance countersExample: connected users, average signal strength, cell radius etc.
Cell site
Cell sites
Performance counters
Cellular Base Station Monitoring
Monitoring Centre
Every 15 minutes
Cellular Base Station Monitoring
KPI: Key Performance IndicatorExample: Call drop rate, Successful connection setup rate, Throughput
Cell site
Cell sites
KPI
Every 15 minutes
Monitoring Centre
Root cause analysis
Monitoring Centre
Cell site
Cell sites
KPIKPI
Perf
orm
ance
coun
ters
Why KPI went below threshold ?
Manually
Root Cause Analysis – Issues
Time
Time
Time
KPI
Para
met
er 1
Para
met
er N
Too many variables• ~300 parameters• 1 engineer per O(100) cell
sites
Manual debugging is inefficient
Time
Time
Time
KPI
Para
met
er 1
Para
met
er N
??? Sporadic parameter dips
Root Cause Analysis – Issues
Manual debugging is inefficient
Too many variables• ~300 parameters• 1 engineer per O(100) cell
sites
Time
Time
Time
KPI
Para
met
er 1
Para
met
er N
Multiple parameter interaction
Root Cause Analysis – Issues
Sporadic parameter dips
Manual debugging is inefficient
Too many variables• ~300 parameters• 1 engineer per O(100) cell
sites
Carry out automated (fast) root cause analysis which accounts for sporadic dips and multiple parameter interactions while ensuring human readable output.
Problem Statement
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
KPI-parameter relationship is dependent on other parameter values
Key Intuition
Conn. Req.
Call SuccessHan
doff rate
Key Intuition
Conn. Req.
Threshold
Handoff ra
te
Call Success
y
Conn. Req. > X & H/o =y
X
Key Intuition
Conn. Req.
Handoff ra
te
Call Success
Conn. Req. > X’ & H/o =y’
y’
Key Intuition
KPI-parameter relationship is dependent on other parameter values
X’Determine the rules for various parameter combination values using Regression trees
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
Form clusters of points
To minimize the sum of distance metric for sub-clusters
Δ
Δ’Δ”
Regression treesCall Success
Distance metric: sum of Euclidean distance of points in a sub-cluster
Δ
Δ’Δ”
Regression treesCall Success
Form clusters of points
To minimize the sum of distance metric for sub-clusters
Provide human readable rule for each cluster
Conn. Req.
2) Calculate Δ
Regression trees
1) Pick an axis
Call Success
1) Pick an axis 2) Calculate Δ
Conn. Req.
X
Regression treesCall Success
3)Pick pivot to divide points in two clusters,
Conn. Req.4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate ΔX
Regression treesCall Success
Δ”
Δ’
Conn. Req.4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate ΔX X X X Repeat for
all pivots
Regression treesCall Success
Conn. Req.4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate Δ
Repeat for
all pivots
Regression treesRepeat for all axis
Call Success
Conn. Req.
4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate Δ
Repeat for
all pivots
5) Pick pivot with minimum Δ’+Δ”
X
Conn.Req<X Conn.Req>=X
Regression treesRepeat for all axis
Call Success
Conn. Req.
4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate Δ
Repeat for all axis
Repeat for
all pivots
5) Pick pivot with minimum Δ’+Δ”
X
Repeat for sub-clusters
Conn.Req<X Conn.Req>=X
Regression treesCall Success
Conn. Req.
XHan
doff ra
teY
Conn.Req<X Conn.Req>=X
Handoff Rate >= Y
Handoff Rate < Y
Regression treesCall Success
4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate Δ
Repeat for all axis
Repeat for
all pivots
5) Pick pivot with minimum Δ’+Δ”
Repeat for sub-clusters
4) Calculate Δ’+Δ”
3)Pick pivot to divide points in two clusters,
1) Pick an axis 2) Calculate Δ
Repeat for all axis
Repeat for
all pivots
5) Pick pivot with minimum Δ’+Δ”
Repeat for sub-clusters
Conn. Req.
XHan
doff ra
teY
Conn.Req<X Conn.Req>=X
Handoff Rate >= Y
Handoff Rate < Y
Select rules corresponding to low KPI values
Regression treesCall Success
Conn. Req.
XHan
doff ra
teY
Conn.Req<X Conn.Req>=X
Handoff Rate >= Y
Handoff Rate < Y
Regression treesCall Success
Human readable
Capture multiple variable interaction
Capture sporadic events due to time agnostic clustering
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
• Distance metric oblivious of significance of KPI values• Curse of dimensionality
Regression trees – Issues
Conn. Req.
Handoff rate
Metric oblivious KPI value significance
Call Success
Need big separation between good and bad values
Conn. Req.
Handoff rate
Call Success
98.5%
Bad
Call Success
Metric oblivious KPI value significance
Conn. Req.
Handoff rate
98.5%
98.6%
98.7 %
98.5%
Bad
Call Success
Metric oblivious KPI value significance
Call Success
Conn. Req.
Handoff rate
98.5%
98.6%
98.7 %
98.5%
Bad
Call Success
Metric oblivious KPI value significance
Distinction between good and bad is small
Stratify KPI values
Call Success
Conn. Req.
Handoff rate
98.5%
98.6%
98.7 %
98.5%
Bad
Call Success
Metric oblivious KPI value significance
Distinction between good and bad is small
Call Success
Multiply KPI value with custom step function
Conn. Req.
Handoff rate
98.5%
98.6%
98.7 %
98.5%
Bad
Stratification of dataCall Success
Multiply KPI value with custom step function
Call Success
Distinction between good and bad is small
Conn. Req.
Handoff rate
98.5%
98.6%
98.7 %
Bad
Stratification of dataCall Success
Call Success
Distinction between good and bad is small
Conn. Req.
Handoff rate
Stratification of data
98.5%
98.6%
98.7 %
98.5%
Bad
Call Success
Call Success
Distinction between good and bad is small
• Distance metric oblivious of significance of KPI values• Stratify KPI values
• Curse of dimensionality reduction
Regression trees – Issues
Interference
Traffic Load
Curse of DimensionalityCall Success
Traffic Load > X & Interference > Y
Handoff rate < X & Conn. Req. < Y
Cell Radius > X & Allotted Power < Y
Interference
Traffic Load
Traffic Load > X & Interference > Y
Handoff rate < X & Conn. Req. < Y
Cell Radius > X & Allotted Power < Y
Call SuccessCurse of Dimensionality
~300 variables lead to 2^300 combinationsregression tree can be misled
• Preprocessing – Remove correlated, barely changing parameters etc.
• Domain knowledge based filtering– Remove unrelated parameters, apply weights
● Heuristics– Spike, Correlation, 3 more …
Dimensionality reduction
Spike heuristic
Time
Time
Call SuccessValues spike around same time
Correlation heuristic
Conn. Req. Conn. Req.
Call
Succ
ess
Call
Succ
ess
Call Success > 98.5 % Call Success <= 98.5 %
Correlation changes significantly
Regression tree
Apply filters
Stratify KPI data
Select rules
Rule generation
Data store
Rule store
Rule application
Rule storeMatching rules
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
Training & Verification Data
• Analyzed 28 days of data from 217 cell sites • 2 countries, 2 OEMs
• 317 parameters @ 15 minute interval • 80% data to train and 20% to validate
Find rules for all KPI dips
Country #1 (18 cell sites)
Country #2(60 cell sites)
Cell sites with at least 4 KPIs with more than 100 bad instances selected
1 2 3 40
100
200
300
400
500
600
700
800
Found Rule Bad KPI
1 2 3 40
50
100
150
200
250
300
350
400
450 Found Rule Bad KPI
KPI KPIIn
stan
ces
Inst
ance
s
Rule Verification
• Picked rules for randomly selected 50 KPI dips• Show rules to 15 RF engineers (Ongoing)
• 80% rules were actionable• For all the KPI dips at least one actionable
rule in the rule set
1) Total users in 5 to 10 KM from base station > 63%
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
KPI dip: Call success rate < 98.5%
3) Download Traffic < 500 Kbytes AND Total active users < 200
Example rule set
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
3) Download Traffic < 500 Kbytes AND Total active users < 200
1) Total users in 5 to 10 KM from base station > 63%
Users concentrated at cell edge
Example rule set
KPI dip: Call success rate < 98.5%
3) Download Traffic < 500 Kbytes AND Total active users < 200
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
1) Total users in 5 to 10 KM from base station > 63%
21% users with bad RSSI and high traffic load
Example rule set
KPI dip: Call success rate < 98.5%
1) Total users in 5 to 10 KM from base station > 63%
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
3) Download Traffic < 500 Kbytes AND Total active users < 200
Do not point to meaningful cause ?
Example rule set
KPI dip: Call success rate < 98.5%
Example rule set
1) Total users in 5 to 10 KM from base station > 63%
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
3) Download Traffic < 500 Kbytes AND Total active users < 200
Coarse timescale leading to multiple other failures
Don’t have access to relevant parameters
Specific problem rare event in current sector
KPI dip: Call success rate < 98.5%
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
Recommending solution for a problem
Cell site
Cell sitesMonitoring Centre
Parameter list
Parameter list: Remotely configurable parameters,Example: Antenna tilt, Min. signal strength to associate, allowable idle time etc.
Ongoing Work
Recommending solution for a problem
Cell site
Cell sitesMonitoring Centre
Parameter list
When a KPI dips:• Generate rules• Find sectors where the rules do not lead to
KPI dip• Return the parameter list for those sectors
Ongoing Work
Ongoing WorkRecommending solution for problem
More customizations necessary …
• Motivation
• Problem statement
• Approach
• Insight, Mechanism, Customizations
• Results
• Ongoing work
• Other work
Outline
All bits of a video application are not created equal
< 5 msec
< 105 msec
Nearer the deadline more valuable the packet
Value
I P B
MPEG4/ H.264 encoded video
Value aware networking
ApplicationTransportNetwork
MACPHY
000101 011101 010101I P B
0001011010101100101010101100001001
Value aware application layer
I P B
API
ApplicationTransportNetwork
MACPHY
000101 011101 010101I P B
0001011010101100101010101100001001
Value aware networking
• Order of sending data• Times to retransmit• MAC data rate
Can protocol decisions be taken in a value aware manner ?
I P B
Yes Almost no data overhead
API
Questions?
Backup
Future work
• Online regression tree formation• Fast emulation systems for what-if analysis
Research overview
Scout
[ Submitted]
[DySPAN 2012]
Range-Write[OSDI 2008]
Apex[Sigcomm 2010]
Medusa[NSDI 2010]
MOM[Submitted]
RDP-TS
DGP[MobiCom 2006]
MCB-Mesh[IMC 2008]
Fractel[INFOCOM 2008]
WiScape[IMC 2011]
[WWW 2008]
Topo-cons
WhiteCell
PhD Dissertation
Systems & ProtocolsCross-Layer design Measurement
& Analysis
Root-causeMultIfaceT [ HotMobile’10]
Rx
Higher bandwidth• Home repeater• Vehicular
whitespace
Reliability• Whitespace femto
Benefits
Rx
Tx
Multi-Interface systems
Tx
API with higher layersStriping decisionChannel selectionFeedback gathering
Multi-Interface systems
Challenges
Rx
Tx
Tx Rx