Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | gabriella-morrow |
View: | 220 times |
Download: | 5 times |
Challenges in Making Tomography Practical
Yiyi Huang, Georgia TechNick Feamster, Georgia Tech
Renata Teixeira, LIP6Christophe Diot, Thomson
2
Problem
• Network operators need to detect and isolate faults quickly, before customers complain
• Plenty of existing alarms– SNMP traps– Active probes– Anomaly detection systems
• Unfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths
3
Network Tomography to the Rescue
• Send end-to-end probes through the network• Monitor paths for differences in reachability• Infer location of reachability problem from these differences
Monitor
x
y
Targets
4
Some Problems
• Scalability vs. speed: Detection must be fast
• Ambiguity: Losses are one-way but don’t always have access to both ends of the path
• Lack of synchronization: Different monitors see different conditions
• Dynamics: Topology can change, loss can be transient
5
Doppler: Making Tomography Practical
• Fast, scalable detection– Solution: Monitor selection algorithm to reduce the
number of monitors and targets so that “cycle times” are fast
• Transient packet loss– Solution: Triggered confirmation of failed paths
• One-way losses– Solution: New algorithm based on IP spoofing
• Dynamic routing– Solution: Periodic snapshots of the network topology
Controlled evaluation on VINI, plus limited wide-area experiments.
6
Fast, Scalable Detection
• Select monitors, targets to satisfy two conditions– All interfaces are “covered” (or diagnosable)– The number of monitors is small enough to ensure a
short round time
• Two goals– Coverage: When a failure occurs, system detects it
• Every interface is covered by at least one path– Diagnosability: When a failure occurs, system locates it
• Every interface is covered by a unique set of paths
7
Offline Path Selection: Diagnosability
• Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic)
• Step 2: Compute hitting set for each interface
• Step 3: Build equivalence classes for interfaces with common hitting set– For each interface in a set with more than one
interface, find path that crosses only that interface
8
Detection, Confirmation, Correlation
• Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failure
• Detection: Periodic probes (once per “cycle time”) detect failure
• Confirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed.
• Correlation: Paths that fail within 10 seconds of one another are grouped.
9
Disambiguating One-Way Losses: Spoofing
• Monitor sends request to spoofer to send probe• Probe has IP address of the monitor• If reply reaches the monitor, reverse path is
working
M
Spoofer: Send spoofed packet with source address of M
T
10
Identification: NetDiagnoser
• Binary network tomography algorithm [Dhamdhere et al.]
• Input: hosts, destinations, topology before the failure
• Output: Set of possible locations for the fault
11
Evaluation of Detection Algorithms
• Controlled experiments on the VINI testbed– Emulated copy of Abilene network on wide-area paths– Probing strategy emulates the paths that would be probed in monitor
selection algorithm– Compare reduced set of paths to “aggressive” measurement
approach
• Varied failure location and duration– Duration varied from 5 to 80 seconds– Test repeated for each failed link
• Measure detection and false alarm rates• Preliminary experiments using data from real-world networks
12
Detection: Scale and Speed
• Compute reduction in the number of paths required to achieve coverage and diagnosability– Reduction from about 27,000 paths to 151 paths
• For real-world networks, compute corresponding reduction in cycle time– Reduction from aout 3.5 minutes to < 5 seconds
13
Single-Link Failures
• More selective probing identifies more of the shorter link failures (due to shorter cycle time)
• Also results in fewer false alarms
14
Single-Node Failures
• Similar results to single-link failures– Selective measurements result in faster detection,
fewer false alarms
15
Does Failure Confirmation Reduce the Total Number of Alarms?
• Confirmation reduces the number of failures by > 35%• Correlation further reduces the number of alarms (by
about a factor of 10)
16
How Quickly can Doppler Identify Failures?
• Answer: Roughly 20 seconds using the reduced set of paths
• Two main components– Detection/Confirmation: Time from when failure was
injected to the time Doppler could detect and confirm the failure
– Correlation: Time to group failures and construct reachability matrix
17
Detection and Confirmation Delay
Most failures are detected within 3-5 seconds
18
Correlation Delay
Reducing the number of paths to probe significantly reduces total correlation time
19
Summary
• Making tomography practical is challenging– Asynchronous measurements– Scale and speed– Changing topologies– Ambiguity about forward and reverse paths
• Doppler: Set of techniques to address many of these problems
• Current analysis is still performed offline– Many additional challenges remain to coordinate
online measurements