Part II – 8: Failure management
Machine Learning Methods for Communication Networks and Systems – 051911
Francesco MusumeciDipartimento di Elettronica, Informazione e Bioingegneria(DEIB)Politecnico di Milano, Milano, Italy
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• Hard-failureso Sudden events, e.g., fiber cuts, power outages, etc.o Unpredictable, require «protection» (reactive procedures)
• Soft-failures:o Gradual transmission degradation due to equipment
malfunctioning, filter shrinking/misalignment…o Trigger early network reconfiguration (proactive procedures)
Two main failure types in optical networks
2
RXTX
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
1. Early detection (When?)o «Predict» that BER will go above a thresholdo Allows early/quick activation of proactive procedures
2. Identification (Which element?)o e.g., filter misalignment or amplifier malfunctioning ..o Reduced Time To Repair (TTR)
3. Localization of soft-failures (Where?)o e.g., which node/link along the path?
4. Magnitude estimation (How much?)o Triggers the proper reaction (e.g., device restart/reconfiguration, lightpath
re-routuing, in-field reparation…)
Handling soft-failures
3
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• How can we predict soft-failures?
Perform continuous monitoring of Bit Error Rate (BER) at the receiver…… until some “anomalies” are detected
Early-detection helps preventing service disruption (e.g., through proactive network reconfiguration)
Soft-failure early detection
RX
RX
TX
TX
time
BER
time
BER
timeBE
R
intolerable BER
time
BER intolerable BER
detection
failure
reconfiguration
4
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• How can we identify the cause of the failure?– Failures can be caused by different sources
o Filters shrinking/misalignmento Excessive attenuation (e.g., due to amplifier malfunctioning)o Laser/photodetectors malfunctioningo …
Different sources of failure can be distinguishedvia the different effects they cause on BER variation(i.e., via different BER “features”)
Soft-failure cause identification
5
RX
RX
TX
TX
time
BER
intolerable BER
time
BER
intolerable BER
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• How can we identify the location of the failure?– A single failure may affect multiple lightpaths– Leverage information on failure-cause on each lightpath
in combination with routing information– No need for monitoring in the entire network (monitors
can be deployed only at the receivers)
Soft-failure localization
6
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• What is the failure magnitude (i.e., severity)?– Different failures magnitude can affect the network
differently– According to the severity, different actions can be
triggered to solve the failureo device restart/reconfigurationo lightpath re-routuingo in-field reparation…)
Soft-failure magnitude estimation
7
RX
RX
TX
TX
time
BER
Replacethe deviceReset the
device
Reconfigurethe device
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
1. F. Musumeci et al., “A Tutorial on Machine Learning for Failure Management in Optical Networks”, Journal of LightwaveTechnology, vol. 37, n. 16, Aug. 2019
2. S. Shahkarami et al, “Machine-Learning-Based Soft-Failure Detection and Identification in Optical Networks,” in OFC Conference 2018, pp. M3A–5
• Paper(s) objective: failure detection, cause identification and magnitude estimation in optical transmission system
– inputo monitored BER
– outputo failure detection, cause identification and magnitude estimation
– ML algorithms:o ANNo SVMo RF
Failure managementSources 1-2
8
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Our study: Optical Network Failure Management (ONFM)
9
F. Musumeci et al., “A Tutorial on Machine Learning for Failure Management in Optical Networks”, Journal of Lightwave Technology, vol. 37, n. 16, Aug. 2019
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• BER window: two main optimization parameters– Window duration, W– BER sampling period, TBER
• Training of the ML algorithms is done for differentcombinations of these two params
Our study: window analysis
10
Features extracted:- BER statistics:
- mean- min/max- standard dev.
- Window spectralcomponents after FFT
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Our study: failure detection
11
2. Failure Identification
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Our study: failure identification
12
1. Failure Detection
3a. Failure MagnitudeEstimation (Atten.)
3b. Failure MagnitudeEstimation (Filtering)
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Our study: failure magnitude estimation
13
2. Failure Identification
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
• Testbed for real BER traces– Ericsson 80 km transmission system
o 24 hours BER monitoringo 2 seconds sampling interval
– PM-QPSK modulation @ 100Gb/s – 2 Erbium Doped Fiber Amplifiers (EDFA) followed by Variable Optical
Attenuators (VOAs, not shown)– Bandwidth-Variable Wavelength Selective Switch (BV-WSS) is used to emulate
2 types of BER degradation:o Filter misalignment (Filtering)o Additional attenuation in intermediate span, due to EDFA gain-reduction (Attenuation)
– Different failure magnitudes:o Filtering: 50-to-26 GHz at steps of 2 GHzo Attenuation: 0-to-10 dB additional attenuation at steps of 1 dB
Testbed setup (1)
14
F. Musumeci et al., “A Tutorial on Machine Learning for Failure Management in Optical Networks”, Journal of Lightwave Technology, vol. 37, n. 16, Aug. 2019
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Results
15
Takeway1: Accuracy always increases with window duration
Takeway2: Detection (finding anomalies) is accurate also for in short-time windows
Takeway3: Complex tasks (e.g., failure-cause identification) requires more BER info (longer windows) to have sufficient accuracy
F. Musumeci et al., “A Tutorial on Machine Learning for Failure Management in Optical Networks”, Journal of Lightwave Technology, vol. 37, n. 16, Aug. 2019
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Testbed setup (2)
• Testbed for real BER traces– Ericsson 380 km transmission system
o 24 hours BER monitoringo 3 seconds sampling interval
– PM-QPSK modulation @ 100Gb/s – 6 Erbium Doped Fiber Amplifiers (EDFA) followed by Variable Optical
Attenuators (VOAs)– Bandwidth-Variable Wavelength Selective Switch (BV-WSS) is used to
emulate 2 types of BER degradation:o Filter misalignmento Additional attenuation in intermediate span (e.g., due to EDFA gain-reduction)
TX
BVWSS
1
RX
BVWSS
2
60km 80km 80km 80km 80km
E1 E2 E3 E4 E5 E6
S. Shahkarami et al, “Machine-Learning-Based Soft-Failure Detection and Identification in Optical Networks,” in OFC Conference 2018, pp. M3A–5
16
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Numerical results: DetectionAccuracy vs window features
17
• Binary SVM
Take-away 1: Higher performance with low sampling time Fast monitoring equipment isrequired
Take-away 2: For increasing sampling time, longer “Windows” are needed for high accuracy
S. Shahkarami et al, “Machine-Learning-Based Soft-Failure Detection and Identification in Optical Networks,” in OFC Conference 2018, pp. M3A–5
F. Musumeci: ML Methods for Communication Nets & SystemsPart II – 8: Failure management
Numerical results: IdentificationAccuracy vs window features
18
• Neural Network
Take-away 3: To perform failure-cause identification, much smaller sampling period is needed wrt failure detection
S. Shahkarami et al, “Machine-Learning-Based Soft-Failure Detection and Identification in Optical Networks,” in OFC Conference 2018, pp. M3A–5
Diapositiva numero 1Two main failure types in optical networksHandling soft-failures Soft-failure early detectionSoft-failure cause identificationSoft-failure localizationSoft-failure magnitude estimationFailure management�Sources 1-2Our study: Optical Network Failure Management (ONFM)Our study: window analysisOur study: failure detectionOur study: failure identificationOur study: failure magnitude estimationTestbed setup (1)ResultsTestbed setup (2)Numerical results: Detection�Accuracy vs window featuresNumerical results: Identification�Accuracy vs window featuresDiapositiva numero 19Diapositiva numero 20Failure management�Source 3Failure management�Source 3Failure management�Source 3Failure management�Source 3Failure management�Source 3Failure management�Source 3Failure management�Source 3