Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Notes on the Domain Name System

VIEWS: 13 PAGES: 11

									 http://cr.yp.to/djbdns/notes.html

D. J. Bernstein
  A-PDF Watermark DEMO: Purchase from www.A-PDF.com to remove the watermark
Internet publication
djbdns


Notes on the Domain Name System
If you've seen my reference manuals on Internet mail, the Internet mail message header format, SMTP,
and FTP, then you might be expecting something similarly comprehensive for DNS. This isn't it. Sorry.


Trusted servers
When a DNS cache---a ``full-service resolver'' under RFC 1123---wants the address of www.w3.org, it
may contact the w3.org DNS servers, the org DNS servers, or the root DNS servers.

For example, as of January 2001, one of the w3.org DNS servers is w3csun1.cis.rl.ac.uk.
This server has the power to define the address of www.w3.org. It can flood the other servers to
prevent them from providing contradictory information.

When the cache wants the address of w3csun1.cis.rl.ac.uk, it may contact the rl.ac.uk DNS
servers, the ac.uk DNS servers, the uk DNS servers, or the root DNS servers. For example, ns.eu.
net, one of the ac.uk DNS servers, has the power to define the address of w3csun1.cis.rl.ac.
uk. Consequently it also has the power to define the address of www.w3.org.

Similarly, all names under eu.net, hence ac.uk and w3.org, are controlled by sunic.sunet.
se; all names under sunet.se, hence eu.net and ac.uk and w3.org, are controlled by beer.
pilsnet.sunet.se; and beer.pilsnet.sunet.se is running an ancient version of BIND,
known to allow anyone on the Internet to take over the machine.

Are the www.w3.org administrators aware that their DNS service relies on beer.pilsnet.
sunet.se and 200 other obscure computers around the world?

In contrast, if w3.org had used in-bailiwick names for its servers, such as a.ns.w3.org and b.ns.
w3.org and c.ns.w3.org and d.ns.w3.org, then it would not be relying on the servers for ac.
uk and eu.net and sunet.se.

I pointed out this type of problem in January 2000. At that time, these same 200 computers had control
over practically all names on the Internet, including *.com. The .com server names were subsequently
fixed to avoid the problem. Most country-code TLDs have not been fixed.

Poison
 http://cr.yp.to/djbdns/notes.html (1 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html




RFC 1034's resolution algorithm allows any server on the Internet to destroy, or take over, yahoo.
com. All the nasty.dom server has to do is delegate www.nasty.dom to the yahoo.com servers
while providing false addresses for those servers:

          www.nasty.dom NS ns1.yahoo.com
          www.nasty.dom NS ns2.dca.yahoo.com
          www.nasty.dom NS ns3.europe.yahoo.com
          www.nasty.dom NS ns5.dcx.yahoo.com
          ns1.yahoo.com A 1.2.3.4
          ns2.dca.yahoo.com A 1.2.3.4
          ns3.europe.yahoo.com A 1.2.3.4
          ns5.dcx.yahoo.com A 1.2.3.4

The nasty.dom server can now wait for (or encourage) the cache to ask about www.nasty.dom.
When the cache receives the answer, it will, according to RFC 1034, save the forged yahoo.com
addresses for future reference. Subsequent queries for yahoo.com will be misdirected.

Cache poisoning was widely known in 1990. But it was viewed as merely a reliability issue, a result of
sloppy administration. Someone who listed munnari.oz.au as a backup server with an out-of-date IP
address would accidentally poison caches and destroy legitimate connections to munnari.oz.au.

Vixie's first BIND release, version 4.9 in 1992, featured a notion of ``credibility'' that managed to
prevent the most severe cases of accidental poisoning. From a security point of view, Vixie's
``credibility'' is garbage; it doesn't even stop the yahoo.com attack described above.

It's obvious how to eliminate all poisoning. Caches must discard yahoo.com information except from
the yahoo.com servers, the com servers, and the root servers. This stops malicious poisoning, so of
course it stops accidental poisoning too. End of problem.

BIND finally adopted this poison-elimination rule in 1997, after cache poisoning became a popular
attack tool. Did Vixie scrap his obsolete ``credibility'' rules? No! As of January 2000, they were still in
BIND 8.2.2-P5, more incoherent than ever. For example, if records had ``additional section credibility,''
and if someone sent a query asking for those records, BIND would reduce the TTL of the records by 5%.
Some of the other rules appear in RFC 2181.

I pointed out on bugtraq in January 2000 that, when a domain changed all its DNS server names (e.g.,
to switch ISPs), an attacker could trivially exploit BIND's ``credibility'' rules to break access to that
domain. I also tried to point this out on namedroppers, but my message was censored by Randy
Bush.



 http://cr.yp.to/djbdns/notes.html (2 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html

dnscache doesn't discriminate against additional records. Valid records are accepted whether they're
additional records in one packet or answer records in the next; timing doesn't affect the semantics.

Limited parents
RFC 1034 assumes that parent servers will list all the NS records of child servers.

In practice, however, some parents limit the number of NS records that they will list; some parents have
painful update procedures; and, for many years, the largest .com registrar pointlessly refused NS
records listing host names with IP addresses that had already been registered under different host names.

So a child server often lists more NS records than its parent. It includes the NS records along with its
answers, so that caches will replace the NS records from the parent with the NS records from the child.
If the NS records (and associated addresses) expire after the answers do, the caches will use the
complete NS list to find the new answers, and will obtain a fresh NS list at that point. The load is spread
among all the servers, though not as evenly as it would be if the parent listed more servers.

Unfortunately, BIND 8.2 won't cache the fresh NS list. After the old list expires, BIND contacts the
parent servers and again obtains the incomplete NS list.

Beware that, because of the ``credibility'' rules described above, the NS records from the child servers
must include the NS records from the parent. Otherwise an attacker can break BIND's access to the child
servers.

Gluelessness
Suppose you're a DNS cache, and you want the address of www.espn.tv. You happen to know the
address of a .tv DNS server, so you ask it for the address of www.espn.tv. ``I don't know, but I
know that .espn.tv has two DNS servers, ns-1.disney.corp and ns-2.disney.corp,'' it
says. ``Try asking them.''

So you contact ns-1.disney.corp. But what's the address of ns-1.disney.corp? You have to
put the original question on hold while you search for the address of ns-1.disney.corp. You
happen to know an address of a .corp DNS server, so you ask it for the address of ns-1.disney.
corp. ``I don't know, but I know that .disney.corp has two DNS servers, zone.espn.tv and
night.espn.tv,'' it says. ``Try asking them.''

Bottom line: You can't reach espn.tv, and you can't reach disney.corp.

If zone.espn.tv had been a DNS server for .espn.tv, the .tv server would have provided glue
for zone.espn.tv, i.e., the IP address of zone.espn.tv. So you would have been able to contact

 http://cr.yp.to/djbdns/notes.html (3 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html

zone.espn.tv. RFC 1034 specifically requires glue for referrals to in-bailiwick DNS servers. (Some
people use the word ``glue'' only in this case.)

For referrals to out-of-bailiwick DNS servers, however, RFC 1034 says that glue is unnecessary. RFC
1537 says the same thing. RFC 1912 says the same thing. The comp.protocols.tcp-ip.
domains FAQ says that ``you do not need a glue record, and, in fact, adding one is a very bad
idea.'' (This is an obsolete reference to accidental poisoning; see above.) Some DNS server
implementations ignore out-of-bailiwick glue by default. So the glueless domains espn.tv and
disney.corp are following the rules---yet neither of them is reachable.

There can be trouble even when there are no loops. Suppose a BIND cache is looking up www.espn.
tv in the following situation:

          espn.tv NS ns-1.disney.corp
          espn.tv NS ns-2.disney.corp

          disney.corp NS ns-1.disney.corp
          disney.corp NS ns-2.disney.corp

When BIND sees the glueless delegation to ns-1.disney.corp, it drops the www.espn.tv query
and begins a ``sysquery'' for ns-1.disney.corp, hoping to have the ns-1.disney.corp
address cached by the time the www.espn.tv query is retried. (The BIND developers refer to this bug
as ``no query restart.'') Clients generally don't retry more than four times, so an initial query for a domain
with four levels of gluelessness will fail; an initial query for a domain with three levels of gluelessness
will be very likely to fail, and very slow if it succeeds.

``As far as I know, the Internet has not yet lost any domains to gluelessness,'' I wrote in 2000. ``But
there are an increasing number of glueless domains, and I've spotted a glueless domain with glueless
DNS servers. How much gluelessness must a cache tolerate? Currently dnscache allows three levels
of gluelessness. This seems to be enough for now, but will it be enough in the future?''

I subsequently learned about www.monty.de, which had so many levels of gluelessness that BIND
caches were completely unable to reach it:

          monty.de NS ns.norplex.net
          monty.de NS ns2.norplex.net

          norplex.net NS vserver.neptun11.de
          norplex.net NS ns1.mars11.de

          neptun11.de NS ns.germany.net
          neptun11.de NS ns2.germany.net

 http://cr.yp.to/djbdns/notes.html (4 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html



          mars11.de NS ns1.neptun11.de
          mars11.de NS www.gilching.de

          gilching.de NS ecrc.de
          gilching.de NS name.muenchen.roses.de

dnscache was able to find the address of www.monty.de, but it needed fourteen queries to various
servers.

I recommend that all DNS servers be in-bailiwick servers with glue. External DNS servers should be
given internal names, with address records copied automatically (preferably by some secure mechanism)
from the external names to the internal names.

DNS should have been designed with addresses, not names, in NS records and MX records. The
``additional section'' of DNS responses should have been eliminated. RFC 1035 observes correctly that
NS indirection and MX indirection ``insure [sic] consistency'' of addresses; however, this indirection
should have been handled by the server, not the client.

I have a separate page discussing A6 and DNAME from this perspective.


Expiring glue
Occasionally the address records for some DNS servers all expire from a cache, even though the servers
weren't glueless in the first place:

          aol.com NS dns-01.ns.aol.com
          aol.com NS dns-02.ns.aol.com

Usually this means that the A records accompanied the NS records but with lower TTLs, and the cache
didn't contact the servers soon enough to refresh the A records as described above. (If the cache is BIND
8.2, then the A records won't be refreshed anyway, and an attacker can force the TTLs down even if they
originally matched.)

In this situation, the RFC 1034 resolution algorithm fails. According to RFC 1034, if the cache wants the
address of yb.mx.aol.com, it looks for the ``best servers'' among ``locally-available name server
RRs,'' obtaining the names dns-01.ns.aol.com and dns-02.ns.aol.com; it then starts
``parallel resolver processes looking for the addresses'' of dns-01.ns.aol.com and dns-02.ns.
aol.com; those resolver processes look for the ``best servers,'' and so on. The cache loops until it runs
out of patience and gives up.



 http://cr.yp.to/djbdns/notes.html (5 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html

Fortunately, real caches use a different algorithm. dnscache starts from the roots, ignoring cached NS
records, when it reaches gluelessness level 2. BIND reportedly starts all its glue requests from the roots.

Aliases
Say a cache is looking for information on www.espn.tv. If it encounters a CNAME record for www.
espn.tv pointing to www.espn.go.com, it is supposed to start over again, looking for the same
information on www.espn.go.com. www.espn.tv is an alias for www.espn.go.com.

RFC 1034 says that an alias ``should'' not point to another alias. In reality, however, if an administrator
decides to set up www.espn.go.com as an alias for espn.go.com, he probably won't remember to
change www.espn.tv---but users will kick and scream if www.espn.tv breaks. ``CNAME chains
should be followed,'' RFC 1034 says.

Aliases, like gluelessness, force DNS clients to chew up time and memory. How many layers of aliases
must a cache tolerate? Currently dnscache allows four levels of aliases. This seems to be enough for
now, but will it be enough in the future?

I recommend that all CNAME records be eliminated. DNS should have been designed without aliases.

Classless in-addr.arpa delegations
Suppose an ISP has assigned IP addresses 1.2.3.100, 1.2.3.101, and 1.2.3.102 to a customer, and the
customer wants to handle reverse lookups for those addresses. The ISP can simply delegate the three
names 100.3.2.1.in-addr.arpa, 101.3.2.1.in-addr.arpa, and 102.3.2.1.in-
addr.arpa to the customer's DNS server.

In practice, however, the ISP might instead use CNAME records. It makes 100.3.2.1.in-addr.
arpa an alias for 100.cust37.3.2.1.in-addr.arpa, and similarly for 101 and 102; and then
it delegates cust37.3.2.1.in-addr.arpa to the customer's DNS server. This is a valid
configuration, although RFC 2317 says that some old versions of BIND can't handle it.

Why would an ISP want to add this extra layer of complication? Answer: With the simple approach, if
the customer is running BIND, he'll have to put the 100 and 101 and 102 records in three separate
files. With the complicated approach, the customer can put the records into a single file.

I recommend that, in this situation, the CNAME records be eliminated, and the customer upgrade to a
better DNS server.

DNS server selection

 http://cr.yp.to/djbdns/notes.html (6 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html


Say a cache has a query to transmit to the .com servers. It has a list of addresses of the .com servers.
Which server does it contact first?

dnscache simply contacts a random server, to balance the load as effectively as possible. BIND keeps
track of the round-trip times for its queries to each server, with various bonuses and penalties, and then
sends all its queries to the ``best'' server.

Simulations show that the increasingly frequent .com server overloads (as of March 2000) could be
caused by BIND's transmission strategy.

The five types of DNS responses
When a cache receives a normal DNS response, it learns exactly one of the following five pieces of
information:

     1. ``The query was not answered because the query name is an alias. I need to change the query
        name and try again.'' This applies if the answer section of the response contains a CNAME record
        for the query name and CNAME does not match the query type.
     2. ``The query name has no records answering the query, and is also guaranteed to have no records
        of any other type.'' This applies if the response code is NXDOMAIN and #1 doesn't apply. The
        amount of time that this information can be cached depends on the contents of the SOA record in
        the authority section of the response, if there is one.
     3. ``The query name has one or more records answering the query.'' This applies if the answer
        section of the response contains one or more records under the query name matching the query
        type, and #1 doesn't apply, and #2 doesn't apply.
     4. ``The query was not answered because the server does not have the answer. I need to contact
        other servers.'' This applies if the authority section of the response contains NS records, and the
        authority section of the response does not contain SOA records, and #1 doesn't apply, and #2
        doesn't apply, and #3 doesn't apply. The ``other servers'' are named in the NS records in the
        authority section.
     5. ``The query name has no records answering the query, but it may have records of another type.''
        This applies if #1 doesn't apply, and #2 doesn't apply, and #3 doesn't apply, and #4 doesn't apply.
        The amount of time that this information can be cached depends on the contents of the SOA
        record in the authority section, if there is one.

This procedure requires an incredible amount of bug-prone parsing for a very small amount of
information. The underlying problem is that DNS was designed to declare information in a human-
oriented format, rather than to support crucial operations in the simplest possible way.

Warning about NXDOMAIN: It is clear from RFC 1034 and RFC 1035 that an NXDOMAIN guarantees
the nonexistence of every subdomain of the query domain. For example, if a cache sees an NXDOMAIN


 http://cr.yp.to/djbdns/notes.html (7 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html

for ns.heaven.af.mil, it can conclude that a.ns.heaven.af.mil and b.ns.heaven.af.
mil don't exist. If a server has records for a.ns.heaven.af.mil and b.ns.heaven.af.mil,
but no records for ns.heaven.af.mil. it sends a zero-records (#5) response, not an NXDOMAIN.
However, RFC 2308 allows NXDOMAIN even when the domain exists, to indicate that there are no
records of any type under the query name. So it is essential for interoperability that caches not draw the
above conclusion.

Truncation
DNS packets 512 bytes or smaller can be transmitted through UDP. DNS packets 65535 bytes or smaller
can be transmitted through TCP.

DNS clients and DNS caches begin by transmitting a query (which always fits into 512 bytes) through
UDP. The response is sent back through UDP. If the response does not fit into a UDP packet, it is
truncated, and the TC bit is set at the beginning of the UDP packet. Clients and caches that support TCP
see the TC bit and retry their query through TCP.

RFC 1035 does not make clear exactly what ``truncated'' means. The obvious interpretation is to end the
packet at exactly 512 bytes. However, this causes interoperability problems: in particular, the Squid
cache dies if a packet is truncated between records. BIND ends the packet before the first record that
went past 512 bytes. dnscache ends the packet before all records.

Compression
DNS packets use an ad-hoc compression method in which portions of domain names can sometimes be
replaced with two-byte pointers to previous domain names. The precise rule is that a name can be
compressed if it is a response owner name, the name in NS data, the name in CNAME data, the name in
PTR data, the name in MX data, or one of the names in SOA data.

One problem with DNS compression is the amount of code required to parse it. Reliably locating all
these names takes quite a bit of work that would otherwise have been unnecessary for a DNS cache.
LZ77 compression would have been much easier to implement.

Another problem with DNS compression is the amount of code required to correctly generate it. (RFC
1035 allowed servers to not bother compressing their responses; however, caches have to implement
compression, so that address lists from some well-known sites don't burst the seams of a DNS UDP
packet.) Not only does the compressor need to figure out which names can be compressed, but it also
needs to keep track of compression targets earlier in the packet. RFC 1035 doesn't make clear exactly
what targets are allowed. (Most versions of BIND do not use pointers except to compressible names;
suffixes of the query name are excluded. dnscache uses pointers to suffixes of the query name.)



 http://cr.yp.to/djbdns/notes.html (8 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html

Another problem with DNS compression is that it's not particularly effective. LZ77 would have done a
noticeably better job on current data, and a much better job on new record types that might become
popular in the future. (BIND versions 4.9.* through 8.1.2 compress names in new record types, such as
RP and SRV, in blatant violation of RFC 1035. The names are not decompressed by caches that do not
know about the new types. This is an interoperability disaster.)

Case independence
Once upon a time, for reasons that no longer matter, hostnames were often typed in uppercase. One user
would type IBM.COM, and another user would type ibm.com, and both of them would expect to find
the same host.

Experienced programmers stored hostnames in lowercase, and converted uppercase to lowercase as part
of the user interface. Hostname comparisons were simple binary comparisons.

DNS, however, was not designed by experienced programmers. DNS clients send hostnames exactly as
typed by the user, without converting uppercase to lowercase. DNS servers send some hostnames as
typed by the system administrator, without converting uppercase to lowercase. All implementors are
forced to waste time worrying about case.

The DNS protocol allows arbitrary bytes in hostnames. This flexibility would have been convenient for
several applications, notably in-addr.arpa, if the designers hadn't screwed up their case handling.
As is, binary names in DNS are practically useless.

Record sets
The list of mail exchangers for a domain is an indivisible unit; if it is truncated, mail can bounce. Other
lists, such as the list of DNS servers or the list of addresses, are also indivisible units, although the
effects of truncation are much less severe.

Unfortunately, in DNS packets, the list of mail exchangers is divided into separate MX records. The MX
records can even be (and, in responses to * queries, often are) interleaved with other records. A cache
has to sort the list of records, preferably using a method that isn't painfully slow for large packets, and
partition the result into complete record sets.

Classes
Each DNS record is in a ``class.'' DNS allows 65536 different classes. In theory, a name can have
several NS records in different classes, delegating the same domain to different servers in different
classes.


 http://cr.yp.to/djbdns/notes.html (9 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html


Queries ask for records in a particular class. RFC 1034 allows queries to ask for records in all classes,
but this makes no sense: if multiple classes were actually used then they would almost never be on the
same server. The client knows what class it's looking for, so it can specify a class; RFC 1123 section
6.1.2.2 requires this for the Internet class and recommends it in all cases.

RFC 1034 says that classes ``allow parallel use of different formats for data of type address.'' This
doesn't make sense. If DNS is used in a network with multiple address formats, then one DNS server
will want to provide addresses in more than one format; but that DNS server is only in charge of one
class. Address format extensibility should have been provided in the address data itself.

dnscache discards queries for non-Internet classes.

Miscellaneous implementation bugs
According to RFC 2308, some clients incorrectly treat an NXDOMAIN or no-records response as a
referral if there are NS records in the authority section, and some clients incorrectly discard
NXDOMAIN responses without the AA bit.

The Ultrix version of BIND sends queries with AD+CD set.

A client will receive a bogus response from a BIND cache if the client asks about X, the cache already
knows X CNAME Y, and the cache has to ask a server about Y. The cache will forward the server's Y
response, with Y as the query, to the client. This bug was fixed in BIND 8.2.3.

There is at least one server that incorrectly produces NXDOMAIN for all non-A queries, even for
domains that exist:

          % date
          Sat Nov 2 15:45:22 CST 2002
          % dnsq any www.css.vtext.com njbdcss.vtext.com
          255 www.css.vtext.com:
          35 bytes, 1+0+0+0 records, response, nxdomain
          query: 255 www.css.vtext.com
          % dnsq aaaa www.css.vtext.com njbdcss.vtext.com
          28 www.css.vtext.com:
          35 bytes, 1+0+0+0 records, response, nxdomain
          query: 28 www.css.vtext.com
          % dnsq a www.css.vtext.com njbdcss.vtext.com
          1 www.css.vtext.com:
          51 bytes, 1+1+0+0 records, response, authoritative, noerror
          query: 1 www.css.vtext.com
          answer: www.css.vtext.com 0 A 66.174.3.10

 http://cr.yp.to/djbdns/notes.html (10 of 11)1/10/2004 6:45:25 AM
 http://cr.yp.to/djbdns/notes.html


          %

If a client looks up AAAA before A, the NXDOMAIN will be cached, so the A query will fail.




 http://cr.yp.to/djbdns/notes.html (11 of 11)1/10/2004 6:45:25 AM

								
To top