+ All Categories
Home > Documents > Automated Cellular Root Cause Analysis

Automated Cellular Root Cause Analysis

Date post: 17-Feb-2016
Category:
Upload: limei
View: 45 times
Download: 0 times
Share this document with a friend
Description:
Automated Cellular Root Cause Analysis. Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik & Rijin John . Cellular Base Station Monitoring. Every 15 minutes. Cell sites. Monitoring Centre. Cell site. Cellular Base Station Monitoring. Performance counters. - PowerPoint PPT Presentation
Popular Tags:
68
Automated Cellular Root Cause Analysis Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik & Rijin John
Transcript
Page 1: Automated Cellular Root Cause Analysis

Automated Cellular Root Cause Analysis

Sayandeep Sen Bell Labs India

Joint work with Sourjya Bhaumik & Rijin John

Page 2: Automated Cellular Root Cause Analysis

Cellular Base Station Monitoring

Monitoring Centre

Cell site

Cell sites

Every 15 minutes

Page 3: Automated Cellular Root Cause Analysis

Performance countersExample: connected users, average signal strength, cell radius etc.

Cell site

Cell sites

Performance counters

Cellular Base Station Monitoring

Monitoring Centre

Every 15 minutes

Page 4: Automated Cellular Root Cause Analysis

Cellular Base Station Monitoring

KPI: Key Performance IndicatorExample: Call drop rate, Successful connection setup rate, Throughput

Cell site

Cell sites

KPI

Every 15 minutes

Monitoring Centre

Page 5: Automated Cellular Root Cause Analysis

Root cause analysis

Monitoring Centre

Cell site

Cell sites

KPIKPI

Perf

orm

ance

coun

ters

Why KPI went below threshold ?

Manually

Page 6: Automated Cellular Root Cause Analysis

Root Cause Analysis – Issues

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Manual debugging is inefficient

Page 7: Automated Cellular Root Cause Analysis

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

??? Sporadic parameter dips

Root Cause Analysis – Issues

Manual debugging is inefficient

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Page 8: Automated Cellular Root Cause Analysis

Time

Time

Time

KPI

Para

met

er 1

Para

met

er N

Multiple parameter interaction

Root Cause Analysis – Issues

Sporadic parameter dips

Manual debugging is inefficient

Too many variables• ~300 parameters• 1 engineer per O(100) cell

sites

Page 9: Automated Cellular Root Cause Analysis

Carry out automated (fast) root cause analysis which accounts for sporadic dips and multiple parameter interactions while ensuring human readable output.

Problem Statement

Page 10: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 11: Automated Cellular Root Cause Analysis

KPI-parameter relationship is dependent on other parameter values

Key Intuition

Page 12: Automated Cellular Root Cause Analysis

Conn. Req.

Call SuccessHan

doff rate

Key Intuition

Page 13: Automated Cellular Root Cause Analysis

Conn. Req.

Threshold

Handoff ra

te

Call Success

y

Conn. Req. > X & H/o =y

X

Key Intuition

Page 14: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff ra

te

Call Success

Conn. Req. > X’ & H/o =y’

y’

Key Intuition

KPI-parameter relationship is dependent on other parameter values

X’Determine the rules for various parameter combination values using Regression trees

Page 15: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 16: Automated Cellular Root Cause Analysis

Form clusters of points

To minimize the sum of distance metric for sub-clusters

Δ

Δ’Δ”

Regression treesCall Success

Page 17: Automated Cellular Root Cause Analysis

Distance metric: sum of Euclidean distance of points in a sub-cluster

Δ

Δ’Δ”

Regression treesCall Success

Form clusters of points

To minimize the sum of distance metric for sub-clusters

Provide human readable rule for each cluster

Page 18: Automated Cellular Root Cause Analysis

Conn. Req.

2) Calculate Δ

Regression trees

1) Pick an axis

Call Success

Page 19: Automated Cellular Root Cause Analysis

1) Pick an axis 2) Calculate Δ

Conn. Req.

X

Regression treesCall Success

3)Pick pivot to divide points in two clusters,

Page 20: Automated Cellular Root Cause Analysis

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate ΔX

Regression treesCall Success

Δ”

Δ’

Page 21: Automated Cellular Root Cause Analysis

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate ΔX X X X Repeat for

all pivots

Regression treesCall Success

Page 22: Automated Cellular Root Cause Analysis

Conn. Req.4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for

all pivots

Regression treesRepeat for all axis

Call Success

Page 23: Automated Cellular Root Cause Analysis

Conn. Req.

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

X

Conn.Req<X Conn.Req>=X

Regression treesRepeat for all axis

Call Success

Page 24: Automated Cellular Root Cause Analysis

Conn. Req.

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

X

Repeat for sub-clusters

Conn.Req<X Conn.Req>=X

Regression treesCall Success

Page 25: Automated Cellular Root Cause Analysis

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression treesCall Success

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

Repeat for sub-clusters

Page 26: Automated Cellular Root Cause Analysis

4) Calculate Δ’+Δ”

3)Pick pivot to divide points in two clusters,

1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots

5) Pick pivot with minimum Δ’+Δ”

Repeat for sub-clusters

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Select rules corresponding to low KPI values

Regression treesCall Success

Page 27: Automated Cellular Root Cause Analysis

Conn. Req.

XHan

doff ra

teY

Conn.Req<X Conn.Req>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression treesCall Success

Human readable

Capture multiple variable interaction

Capture sporadic events due to time agnostic clustering

Page 28: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 29: Automated Cellular Root Cause Analysis

• Distance metric oblivious of significance of KPI values• Curse of dimensionality

Regression trees – Issues

Page 30: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

Metric oblivious KPI value significance

Call Success

Need big separation between good and bad values

Page 31: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

Call Success

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Page 32: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Call Success

Page 33: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Distinction between good and bad is small

Stratify KPI values

Call Success

Page 34: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Metric oblivious KPI value significance

Distinction between good and bad is small

Call Success

Multiply KPI value with custom step function

Page 35: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

98.5%

Bad

Stratification of dataCall Success

Multiply KPI value with custom step function

Call Success

Distinction between good and bad is small

Page 36: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

98.5%

98.6%

98.7 %

Bad

Stratification of dataCall Success

Call Success

Distinction between good and bad is small

Page 37: Automated Cellular Root Cause Analysis

Conn. Req.

Handoff rate

Stratification of data

98.5%

98.6%

98.7 %

98.5%

Bad

Call Success

Call Success

Distinction between good and bad is small

Page 38: Automated Cellular Root Cause Analysis

• Distance metric oblivious of significance of KPI values• Stratify KPI values

• Curse of dimensionality reduction

Regression trees – Issues

Page 39: Automated Cellular Root Cause Analysis

Interference

Traffic Load

Curse of DimensionalityCall Success

Traffic Load > X & Interference > Y

Handoff rate < X & Conn. Req. < Y

Cell Radius > X & Allotted Power < Y

Page 40: Automated Cellular Root Cause Analysis

Interference

Traffic Load

Traffic Load > X & Interference > Y

Handoff rate < X & Conn. Req. < Y

Cell Radius > X & Allotted Power < Y

Call SuccessCurse of Dimensionality

~300 variables lead to 2^300 combinationsregression tree can be misled

Page 41: Automated Cellular Root Cause Analysis

• Preprocessing – Remove correlated, barely changing parameters etc.

• Domain knowledge based filtering– Remove unrelated parameters, apply weights

● Heuristics– Spike, Correlation, 3 more …

Dimensionality reduction

Page 42: Automated Cellular Root Cause Analysis

Spike heuristic

Time

Time

Call SuccessValues spike around same time

Page 43: Automated Cellular Root Cause Analysis

Correlation heuristic

Conn. Req. Conn. Req.

Call

Succ

ess

Call

Succ

ess

Call Success > 98.5 % Call Success <= 98.5 %

Correlation changes significantly

Page 44: Automated Cellular Root Cause Analysis

Regression tree

Apply filters

Stratify KPI data

Select rules

Rule generation

Data store

Rule store

Page 45: Automated Cellular Root Cause Analysis

Rule application

Rule storeMatching rules

Page 46: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 47: Automated Cellular Root Cause Analysis

Training & Verification Data

• Analyzed 28 days of data from 217 cell sites • 2 countries, 2 OEMs

• 317 parameters @ 15 minute interval • 80% data to train and 20% to validate

Page 48: Automated Cellular Root Cause Analysis

Find rules for all KPI dips

Country #1 (18 cell sites)

Country #2(60 cell sites)

Cell sites with at least 4 KPIs with more than 100 bad instances selected

1 2 3 40

100

200

300

400

500

600

700

800

Found Rule Bad KPI

1 2 3 40

50

100

150

200

250

300

350

400

450 Found Rule Bad KPI

KPI KPIIn

stan

ces

Inst

ance

s

Page 49: Automated Cellular Root Cause Analysis

Rule Verification

• Picked rules for randomly selected 50 KPI dips• Show rules to 15 RF engineers (Ongoing)

• 80% rules were actionable• For all the KPI dips at least one actionable

rule in the rule set

Page 50: Automated Cellular Root Cause Analysis

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

KPI dip: Call success rate < 98.5%

3) Download Traffic < 500 Kbytes AND Total active users < 200

Example rule set

Page 51: Automated Cellular Root Cause Analysis

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

1) Total users in 5 to 10 KM from base station > 63%

Users concentrated at cell edge

Example rule set

KPI dip: Call success rate < 98.5%

Page 52: Automated Cellular Root Cause Analysis

3) Download Traffic < 500 Kbytes AND Total active users < 200

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

1) Total users in 5 to 10 KM from base station > 63%

21% users with bad RSSI and high traffic load

Example rule set

KPI dip: Call success rate < 98.5%

Page 53: Automated Cellular Root Cause Analysis

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

Do not point to meaningful cause ?

Example rule set

KPI dip: Call success rate < 98.5%

Page 54: Automated Cellular Root Cause Analysis

Example rule set

1) Total users in 5 to 10 KM from base station > 63%

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

Coarse timescale leading to multiple other failures

Don’t have access to relevant parameters

Specific problem rare event in current sector

KPI dip: Call success rate < 98.5%

Page 55: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 56: Automated Cellular Root Cause Analysis

Recommending solution for a problem

Cell site

Cell sitesMonitoring Centre

Parameter list

Parameter list: Remotely configurable parameters,Example: Antenna tilt, Min. signal strength to associate, allowable idle time etc.

Ongoing Work

Page 57: Automated Cellular Root Cause Analysis

Recommending solution for a problem

Cell site

Cell sitesMonitoring Centre

Parameter list

When a KPI dips:• Generate rules• Find sectors where the rules do not lead to

KPI dip• Return the parameter list for those sectors

Ongoing Work

Page 58: Automated Cellular Root Cause Analysis

Ongoing WorkRecommending solution for problem

More customizations necessary …

Page 59: Automated Cellular Root Cause Analysis

• Motivation

• Problem statement

• Approach

• Insight, Mechanism, Customizations

• Results

• Ongoing work

• Other work

Outline

Page 60: Automated Cellular Root Cause Analysis

All bits of a video application are not created equal

< 5 msec

< 105 msec

Nearer the deadline more valuable the packet

Value

I P B

MPEG4/ H.264 encoded video

Value aware networking

Page 61: Automated Cellular Root Cause Analysis

ApplicationTransportNetwork

MACPHY

000101 011101 010101I P B

0001011010101100101010101100001001

Value aware application layer

I P B

API

Page 62: Automated Cellular Root Cause Analysis

ApplicationTransportNetwork

MACPHY

000101 011101 010101I P B

0001011010101100101010101100001001

Value aware networking

• Order of sending data• Times to retransmit• MAC data rate

Can protocol decisions be taken in a value aware manner ?

I P B

Yes Almost no data overhead

API

Page 63: Automated Cellular Root Cause Analysis

Questions?

Page 64: Automated Cellular Root Cause Analysis

Backup

Page 65: Automated Cellular Root Cause Analysis

Future work

• Online regression tree formation• Fast emulation systems for what-if analysis

Page 66: Automated Cellular Root Cause Analysis

Research overview

Scout

[ Submitted]

[DySPAN 2012]

Range-Write[OSDI 2008]

Apex[Sigcomm 2010]

Medusa[NSDI 2010]

MOM[Submitted]

RDP-TS

DGP[MobiCom 2006]

MCB-Mesh[IMC 2008]

Fractel[INFOCOM 2008]

WiScape[IMC 2011]

[WWW 2008]

Topo-cons

WhiteCell

PhD Dissertation

Systems & ProtocolsCross-Layer design Measurement

& Analysis

Root-causeMultIfaceT [ HotMobile’10]

Page 67: Automated Cellular Root Cause Analysis

Rx

Higher bandwidth• Home repeater• Vehicular

whitespace

Reliability• Whitespace femto

Benefits

Rx

Tx

Multi-Interface systems

Tx

Page 68: Automated Cellular Root Cause Analysis

API with higher layersStriping decisionChannel selectionFeedback gathering

Multi-Interface systems

Challenges

Rx

Tx

Tx Rx


Recommended