Impact of Configuration Errors on DNS Robustness
V. Pappas * Z. Xu *, S. Lu *, D. Massey **, A. Terzis ***, L. Zhang *
* UCLA, ** Colorado State, *** John Hopkins
Motivation
• DNS: part of the Internet core infrastructure– Applications: web, e-mail, e164, CDNs …
• DNS: considered as a very reliable system– Works almost always
• Question: is DNS a robust system?– User-perceived robustness– System robustness
are they the same?
– Thousands or even millions of users affected– All due to a single DNS configuration error
MotivationShort Answer:
“Microsoft's websites were offline for up to 23 hours -- the most dramatic snafu to date on the Internet --because of an equipment misconfiguration”
-- Wired News, Jan 2001
Related Work
• Traffic & implementation errors studies:– Danzig et al. [SIGCOMM92]: bugs– CAIDA : traffic & bugs
• Performance studies: – Jung et al. [IMW01]: caching– Cohen et al. [SAINT01]: proactive caching – Liston et al. [IMW02]: diversity
• Server availability :– To appear [OSDI04, IMC04]
Our Work: Study DNS Robustness
• Classify DNS operational errors:– Study known errors – Identify new types of errors
• Measure their pervasiveness• Quantify their impact on DNS
– availability – performance
Outline
• DNS Overview • Measurement Methodology• DNS Configuration Errors
– Example Cases– Measurement Results
• Discussion & Summary
net com uk ca jp
foo
buz bar
bar1 bar2 bar3
Zone:Occupies a continues subspace Served by the same nameservers
bar.foo.com. NS ns1.bar.foo.com.bar.foo.com. NS ns3.bar.foo.com.bar.foo.com. NS ns2.bar.foo.com.bar.foo.com. MX mail.bar.foo.com. www.bar.foo.com. A 10.10.10.10
bar
name serversresource records
Background
caching server
client
bar zone
foo zone
com zone
root zone
asking for www.bar.foo.comanswer:
www.bar.foo.com A 10.10.10.10
referral:com NS RRscom A RRs
referral:foo NS RRsfoo A RRs
referral:bar NS RRsbar A RRs
Infrastructure RRs
foo.com. NS ns1.foo.com.foo.com. NS ns2.foo.com.foo.com. NS ns3.foo.com.
foo.com. NS ns1.foo.com.foo.com. NS ns2.foo.com.foo.com. NS ns3.foo.com.
foo.com
com ns1.foo.com. A 1.1.1.1ns2.foo.com. A 2.2.2.2ns3.foo.com. A 3.3.3.3
ns1.foo.com. A 1.1.1.1ns2.foo.com. A 2.2.2.2ns3.foo.com. A 3.3.3.3
•NS Resource Record:–Provides the names of a zone’s authoritative servers–Stored both at the parent and at the child zone
•A Resource Record–Associated with a NS resource record–Stored at the parent zone (glue A record)
What Affects DNS Availability
• Name Servers:– Software failures – Network failures – Scheduled maintenance tasks
• Infrastructure Resource Records:– Availability of these records– Configuration errors
focus of our work
Classification of Measured Errors
Inconsistency Dependency
LameDelegation
DelegationInconsistency
DiminishedRedundancy
CyclicDependency
The configuration of infrastructure RRs does not correspond to the actual authoritative name-servers.
More than one name-servers share a common point of failure.
What is Measured?
• Frequency of configuration errors:– System parameters: TLDs , DNS level, zone size (i.e.
the number of delegations)• Impact on availability:
– Number of servers: lost due to these errors– Zone’s availability: probability of resolving a name
• Impact on performance: – Total time to resolve a query
• Starting from the query issuing time• Finishing at the query final answer time
Measurement Methodology
• Error frequency and availability impact:– 3 sets of active measurements
• Random set of 50K zones• 20K zones that allow zone transfers• 500 popular zones
• Performance impact:– 2 sets of passive measurements:1-week DNS
packet traces
Lame Delegation
com
foo
foo.com. NS A.foo.com.foo.com. NS B.foo.com.
A.foo.com
A.foo.com. A 1.1.1.1B.foo.com. A 2.2.2.2
2) DNS error code -- 1 RTT perf. penalty
3) Useless referral -- 1 RTT perf. penalty 4) Non-authoritative
answer (cached)
1) Non-existing server -- 3 seconds perf. penalty
B.foo.com
Lame Delegation Results
Lame Delegation Results
0.06 sec
0.4 sec3 sec
50%
Lame Delegation Results
• Error Frequency:– 15% of the zones– 8% for the 500 most popular zones– independent of the zone’s size, varies a lot per TLD
• Impact:– 70% of the zones with errors lose half or more of the
authoritative servers– 8% of the queries experience increased response times
(up to an order of magnitude) due to lame delegation
C) Geographic location level: - belong to the same city
B) Autonomous system level: - belong to the same AS
Diminished Server Redundancy
com
foo
foo.com. NS A.foo.com.foo.com. NS B.foo.com.
A.foo.com B.foo.com
A.foo.com. A 1.1.1.1B.foo.com. A 2.2.2.2
A) Network level: - belong to the same subnet
Diminished Server Redundancy Results
• Error Frequency:– 45% of all zones have all servers in the same /24 subnet– 75% of all zones have servers in the same AS– large & popular zones: better AS and geo diversity
• Impact:– less than 99.9% availability: all servers in the same /24
subnet– more than 99.99% availability: 3 servers at different
ASs or different cities
Cyclic Zone Dependency (1)
com
foo
foo.com. NS A.foo.com.foo.com. NS B.foo.com.
A.foo.com B.foo.com
A.foo.com. A 1.1.1.1
B.foo.com depends on A.foo.com
The A glue RR for B.foo.com missing
B.foo.com. A 2.2.2.2
If A.foo.com is unavailable then B.foo.com is too
Cyclic Zone Dependency (2)
com
foo
foo.com. NS A.foo.com.foo.com. NS B.bar.com.
A.foo.com B.bar.com
A.foo.com. A 1.1.1.1
bar
B.foo.com A.bar.com
bar.com. NS A.bar.com.bar.com. NS B.foo.com.
A.bar.com. A 2.2.2.2
The foo.com zone seemscorrectly configured
The combination of foo.com and bar.com zones is wrongly
configured
The B serversdepend on A servers
If A.foo and A.bar are unavailable, B addr. are unresolvable
Cyclic Zone Dependency Results
• Error Frequency:– 2% of the zones– None of the 500 most popular zones
• Impact:– 90% of the zones with cyclic dependency errors
lose 25% (or even more) of their servers– 2 or 4 zones are involved in most errors
Discussion: User-Perceived != System Robustness
• User-perceived robustness:– Data replication: only one server is needed– Data caching: temporary masks infrastructure failures– Popular zones: fewer configuration errors
• System robustness:– Fewer available servers: due to inconsistency errors – Fewer redundant servers: due to dependency errors
Discussion: Why so many errors?
• Superficially: are due to operators:– Unaware of these errors – Lack of coordination
• parent-child zone, secondary servers hosting
• Fundamentally: are due to protocol design:– Lack of mechanisms to handle these errors
• proactively or reactively – Design choices that embrace some of them:
• Name-servers are recognized with names • Glue NS & A records necessary to set up the DNS tree
Summary
• DNS operational errors are widespread• DNS operational errors affect availability:
– 50% of the servers lost– less than 99.9% availability
• DNS operational errors affect performance:– 1 or even 2 orders of magnitude
• DNS system robustness lower than user perception– Due to protocol design, not just due to operator errors
Ongoing Work
• Reactive mechanisms:– DNS Troubleshooting [NetTs 04]
• Proactive mechanisms:– Enhancing DNS replication & caching
Thank You!!!