Docstoc

Efficient and Dynamic Routing Topology Inference From End

Document Sample
Efficient and Dynamic Routing Topology Inference From End Powered By Docstoc
					                                                                                                                                                   1




 Efficient and Dynamic Routing Topology Inference
          From End-to-End Measurements
                 Jian Ni, Member, IEEE, Haiyong Xie, Member, IEEE, Sekhar Tatikonda, Member, IEEE
                                      and Yang Richard Yang, Member, IEEE



   Abstract—Inferring the routing topology and link performance                 Both have their limitations. One approach is to use tools based
from a node to a set of other nodes is an important component                   on measurements or feedback messages of the internal nodes
in network monitoring and application design. In this paper we                  (e.g., routers). Such an approach is limited as today’s com-
propose a general framework for designing topology inference
algorithms based on additive metrics. The framework can flex-                    munication networks are evolving towards more decentralized
ibly fuse information from multiple measurements to achieve                     and private adminstration. For example, a common approach to
better estimation accuracy. We develop computationally efficient                 obtain the routing topology from a source node to a destination
(polynomial-time) topology inference algorithms based on the                    node in the Internet is to use traceroute. Traceroute relies on
framework. We prove that the probability of correct topology                    internal routers responding to traceroute requests and returning
inference of our algorithms converges to one exponentially fast
in the number of probing packets. In particular, for applications               ICMP (Internet Control Message Protocol) messages. How-
where nodes may join or leave frequently such as overlay                        ever, an increasing number of routers in the Internet today will
network construction, application-layer multicast, peer-to-peer                 block traceroute requests due to privacy and security concerns.
file sharing/streaming, we propose a novel sequential topology                   These routers are known as anonymous routers [30] and their
inference algorithm which significantly reduces the probing over-                existence makes the routing topology inferred by traceroute-
head and can efficiently handle node dynamics. We demonstrate
the effectiveness of the proposed inference algorithms via Internet             like tools inaccurate. Furthermore, traceroute-like tools cannot
experiments.                                                                    discover layer-2 switches and MPLS (Multiprotocol Label
                                                                                Switching) paths that are increasingly being deployed.
  Index Terms—Routing topology inference, network tomogra-
phy, network measurement, network monitoring.                                      The other approach, known as network tomography, utilizes
                                                                                end-to-end packet probing measurements (such as packet loss
                                                                                and delay measurements) conducted by the end hosts and
                          I. I NTRODUCTION
                                                                                does not require extra cooperation from the internal nodes
   Developing a scalable tool to infer the routing topology and                 (except the basic packet forwarding functionality). Under a
link performance from a node to a set of other nodes is an                      network tomography approach, a source node will send probes
important challenge. In network monitoring, this tool can help                  to a set of destination nodes. The basic idea is to utilize
a network operator obtain routing information and network                       the correlations among the observed losses and delays of the
internal characteristics (e.g., loss rate, delay, utilization) from             probes at the destination nodes to infer the network structure
its network to a set of other collaborating networks that                       and internal characteristics. Due to its flexibility and reliability,
are separated by non-participating autonomous networks. In                      network tomography has attracted many recent studies [8],
application design, this tool can be particularly useful for peer-              [12]. Many previous network tomography studies are based
to-peer (P2P) style applications where a node communicates                      on multicast probing because of its effectiveness and probing
with a set of other nodes (called peers) for file sharing and                    efficiency (e.g., [7], [14], [15], [16], [20], [22], [23]). Since
multimedia streaming. For example, a node may want to know                      IP multicast is not widely deployed in the current Internet,
the routing topology to other nodes so that it can select                       unicast network tomography approaches based on back-to-
peers with low or no route overlap to improve resilience                        back unicast packet pairs or strings have also been investigated
against network failures [2]. As another example, a streaming                   (e.g., [4], [11], [13], [17], [25], [28]).
node using multi-path may want to know both the routing                            Two fundamental challenges of network tomography ap-
topology and link loss rates so the selected paths have low                     proaches include computational complexity and probing scal-
loss correlation [3].                                                           ability (especially under unicast probing). These limit the
   So far there are two primary approaches to infer the routing                 number of destination nodes that a source node can infer. In
topology and link performance in a communication network.                       addition, the focus of previous studies is on a relatively stable
  An earlier version of this paper was presented at the 27th IEEE Conference    set of nodes, while in many applications and networks (e.g.,
on Computer Communications (INFOCOM), Phoenix, Arizona, April 2008.             overlay network construction, application-layer multicast, P2P
  J. Ni is with the Coordinated Science Laboratory, University of Illinois at   file sharing and streaming, wireless ad-hoc and sensor net-
Urbana-Champaign, Urbana, IL 61801 USA (e-mail: jianni@illinois.edu).
  H. Xie is with Akamai Technologies, 3125 Clearview Way, San Mateo, CA         works) nodes may join or leave a session frequently [27]. To
94402 USA (email: hxie@akamai.com).                                             handle node dynamics efficiently we need fast and scalable in-
  S. Tatikonda is with the Department of Electrical Engineering, Yale           ference procedures/algorithms which have low computational
University, New Haven, CT 06520 USA (email: sekhar.tatikonda@yale.edu).
  Y. R. Yang is with the Department of Computer Science, Yale University,       complexity, fast convergence rate, and small probing overhead.
New Haven, CT 06520 USA (e-mail: yry@cs.yale.edu).                                 In this paper we study the problem of inferring the network
                                                                                                                                            2



routing topology from a source node to a set of destination                 Similar generalization was made in [4]. The RNJ algorithm
nodes1 , where the set can be dynamic. We summarize our                     proposed in this paper is also a grouping type algorithm which
contributions as follows.                                                   recovers the tree topology by recursively joining the neighbors
   • We present a general framework for designing network                   on the tree. This agglomerative joining/grouping idea has been
      routing topology inference algorithms based on additive               used in clustering for building cluster trees (e.g., [19]) and in
      metrics. We show how to construct additive metrics                    evolutionary biology for building phylogenetic trees (e.g., [18],
      and estimate the (shared) path lengths using end-to-end               [24]).
      multicast and unicast packet probing measurements as                     Unicast routing tree topology inference was studied in [9],
      well as traceroute type measurements. The framework                   [13], [25]. Coates et al. [13] introduced a sandwich probing
      can flexibly fuse information available from multiple                  technique to conduct delay measurements and proposed a
      measurements to achieve better estimation accuracy and                Markov Chain Monte Carlo (MCMC) procedure to search the
      faster convergence rate.                                              most likely tree topologies. Castro et al. [9] and Shih et al.
   • Based on the framework we develop two computation-                     [25] formulated the inference problem as a hierarchical clus-
      ally efficient (polynomial-time) topology inference algo-              tering problem and developed several hierarchical clustering
      rithms. In particular, we propose a novel sequential topol-           algorithms to recover the tree topology.
      ogy inference algorithm which significantly reduces the                   The limitations of existing topology inference algorithms
      probing overhead under unicast probing. In addition, it               are summarized in the beginning of Section VI. To the best of
      can efficiently handle dynamic node joining and leaving,               our knowledge, the sequential topology inference algorithm
      and thus is particularly desirable for applications and               proposed in this paper is a first effort to address the issues
      networks where node dynamics are prevalent.                           of node dynamics and probing scalability for network routing
   • Under some assumptions we prove that the probability of                topology inference.
      correct topology inference of our algorithms converges
      to one exponentially fast in the sample size (number of                   III. N ETWORK M ODEL AND I NFERENCE P ROBLEMS
      probing packets). We also demonstrate the effectiveness                  Let G = (V, E) denote the topology of the network, which is
      of our algorithms via Internet experiments. For the most              a directed graph with node set V (end hosts, internal switches
      effective inference algorithm (a hybrid scheme which                  and routers, etc.) and link set E (communication links that
      incorporates both network tomography measurements and                 join the nodes). For any nodes i and j in the network, if the
      traceroute measurements), the inferred topology is ap-                underlying routing algorithm returns a sequence of links that
      proximately 100% correct when no more than 20% of                     connect j to i, we say j is reachable from i. We assume
      the internal routers do not respond to traceroute probing.            that during the measurement period, the underlying routing
      It can still correctly identify approximately 50% of the              algorithm determines a unique path from a node to another
      internal nodes, solely from network tomography measure-               node that is reachable from it. Hence the physical routing
      ments, even when none of the internal routers respond to              topology from a source node to a set of (reachable) destination
      traceroute probing.                                                   nodes is a (directed) tree.
   We organize the paper as follows. In Section II we review                   From the physical routing topology, we can derive a logical
some related work. In Section III we introduce the network                  routing tree which consists of the source node, the destination
model and the inference problems. In Section IV we describe                 nodes, and the branching nodes (internal nodes with at least
how to construct additive metrics and estimate the (shared)                 two outgoing links) of the physical routing tree [16], [23]. A
path lengths from end-to-end measurements. In Section V and                 logical link may comprise more than one consecutive physical
VI we propose and analyze a neighbor-joining based topology                 links, and the degree of an internal node on the logical routing
inference algorithm and a sequential topology inference algo-               tree is at least three. An example is shown in Fig. 1. In this
rithm which can be applied to any additive metric. We design                paper we consider topology inference of logical routing trees
Internet routing tree topology inference schemes and evaluate               and we use the routing tree to express the logical routing tree
their performance via Internet experiments in Section VII. The              for simplicity.
paper is concluded in Section VIII.                                            Suppose s is a source node in the network, and D is a set
                                                                            of destination nodes that are reachable from s. Let T (s, D) =
                        II. R ELATED W ORK                                  (V, E) denote the routing tree from s to nodes in D, with node
                                                                            set V and link set E. Let U = s ∪ D be the set of terminal
   Multicast routing tree topology inference was studied in
                                                                            nodes (e.g., end hosts) which are nodes of degree 1one.
[14]-[16], [23]. Ratnasamy et al. [23] proposed a grouping
                                                                               Every node k ∈ V has a parent f (k) ∈ V and a set
algorithm to infer the tree topology based on shared losses
                                                                            of children c(k) = {j ∈ V : f (j) = k}, except that the
observed at the destination nodes. Duffield et al. [16] ex-
                                                                            source node (root of the tree) has no parent and the destination
tended the grouping algorithm and also proposed a maximum-
                                                                            nodes (leaves of the tree) have no children. For notational
likelihood approach and a Bayesian approach to estimate the
                                                                            simplification, we also use ek to denote link (f (k), k). We
tree topology. They further generalized the grouping algorithm
                                                                            use P(i, j) to denote the sequence of links that connect j to
to any estimable and monotonic performance metrics [14].
                                                                            i on the routing tree.
  1 We use destination nodes for simplicity, while the nodes can be relay      Each link e ∈ E is associated with a parameter θe (e.g.,
nodes or peer nodes of the source node in real applications.                success rate, delay distribution, utilization, etc.). The network
                                                                                                                                                                        3


                           source                                                                      and the state of link ek = (f (k), k) (i.e., Zek ):
                                                                           Source Node
                                                                                   s                                      Xk     =   g(Xf (k) , Zek ).               (1)
                             router
                    X
                                                                                   1                      In network tomography studies it is normally assumed that
                                               router
                           router     X                                                                the link states are independent from link to link (spatial
   router
              X                                    X                                                   independence) and are stationary during the measurement
                                                        destination        2               3
                                                                                                       period [7], [12]. (Note that these assumptions may not hold in
                  router              X   router
                                                                                                       real networks like the Internet. We develop a hybrid scheme
destination
                  X                                                   4        5       6        7      in Section VII to improve the estimation accuracy of the pure
                                router    X                       Destination Nodes D = {4, 5, 6, 7}   network tomography scheme for Internet routing tree topology
                                               destination                                             inference.) Under those assumptions, we can show that the
              destination                                                                              outcome variables Xk ’s induced by the transmission of a probe
       (a) The physical routing topology.                             (b) The logical routing tree.    form a Markov random field (MRF) on the routing tree [20].
                                                                                                       Specifically, for each node k ∈ V , the conditional distribution
                                                                                                       of Xk given other random variables (Xj : j = k) on T (s, D)
Fig. 1. The physical routing topology and the associated logical routing tree
with a single source node and multiple destination nodes.                                              is the same as the conditional distribution of Xk given just
                                                                                                       its neighboring random variables (Xj : j ∈ f (k) ∪ c(k)) on
                                                                                                       T (s, D). For MRFs on trees, under mild conditions, the tree
inference problems involve using measurements taken at the                                             topology and the link parameters can be identified (uniquely
terminal nodes to infer:                                                                               determined) by the joint distributions of the outcome variables
(1) the topology of the (logical) routing tree;                                                        at pairs and triples of the terminal nodes on the tree [10], [20].
(2) link parameters θe of the links on the routing tree.                                                  In actual network inference problems, however, the joint
   In this paper we focus on routing tree topology inference.                                          distributions of the outcome variables at the terminal nodes
Link parameter estimation with known routing tree topology                                             are not given. We can estimate the joint distributions based
was studied in [7], [11], [21], [22], [28].                                                            on measurements taken at the terminal nodes. Specifically, the
                                                                                                       source node will send a sequence of n probes, and there are
                                                                                                                                 (t)      (t)
                                                                                                       in total n outcomes XV = (Xk : k ∈ V ), t = 1, 2, ..., n,
A. Probing Model                                                                                       one for each probe. For the t-th probe, only the outcomes
                                                                                                          (t)        (t)
   A source node can employ different probing techniques to                                            XU = (Xk : k ∈ U = s ∪ D) at the terminal nodes
send probes (packets) to a set of destination nodes. Under                                             can be measured and observed. We can estimate the joint
multicast probing, when an internal node on the routing tree                                           distributions of the outcome variables at the terminal nodes
receives a packet from its parent, it will send a copy of the                                          using the empirical distributions, which will converge to the
packet to all its children on the tree. Hence the packets of                                           actual stationary distributions almost surely if the link state
the same probe received by different destination nodes have                                            processes are stationary and ergodic during the measurement
exactly the same network experience (loss, delay, etc.) in the                                         period.
shared links.
   Under unicast probing, the source node sends a string of
back-to-back unicast packets to the destination nodes, one                                             B. Network Tomography Examples
packet for each destination node respectively (to mimic the                                               Example 1: Link Loss Inference [7]. The link state variable
transmission of a multicast probe). We call it a 1 × k packet                                          Ze is a Bernoulli random variable which takes value 1 with
string probing if the string size (i.e., number of probed                                              probability αe if the probe can go through link e, and takes
destination nodes) is k. Since the back-to-back packets are                                                                               ∆
                                                                                                                                              ¯
                                                                                                       value 0 with probability 1 − αe = αe if the probe is lost on
very close to each other, it is normally assumed that these                                            the link. αe is called the success rate or packet delivery rate
packets have the same network experience in the shared links                                                          ¯
                                                                                                       of link e, and αe is called the loss rate of link e. The outcome
just like a multicast probe. We will relax this assumption in                                          variable Lk is also a Bernoulli random variable, which takes
Section IV-B.                                                                                          value 1 if the probe successfully reaches node k. For this
   For a probe sent by source node s to the destination nodes                                          example we have (Ls ≡ 1)
in D, we define a set of link state variables Ze for all links
e ∈ E on the routing tree T (s, D). Ze takes value in a set Z.                                                        Lk = Lf (k) · Zek =                Ze .        (2)
The distribution of Ze is parameterized by θe , e.g., P(Ze =                                                                                 e∈P(s,k)
z) = θe (z), ∀z ∈ Z.
   The transmission of a probe from s to nodes in D will                                                  Example 2: Link Utilization Inference [15]. The link state
induce a set of outcome variables on the routing tree. For                                             variable Ze is a Bernoulli random variable which takes value
each node k ∈ V , we use Xk to denote the (random) outcome                                             1 with probability γe if the probe does not experience any
of the probe at node k. Xk takes value in a set X . By causality                                       queueing delay on link e, and takes value 0 with probability
                                                                                                               ∆
the outcome of the probe at node k (i.e., Xk ) is determined by                                                  ¯            ¯
                                                                                                       1 − γe = γe otherwise. γe can be viewed as the utilization of
the outcome of the probe at node k’s parent f (k) (i.e., Xf (k) )                                      link e. The outcome variable Uk is also a Bernoulli random
                                                                                                                                   4



variable, which takes value 1 if the packet reaches node k with       Under the spatial independence assumption that the link
no queueing delay. For this example we also have ((Us ≡ 1)         states are independent from link to link, ρl (s, D) ∪ ρl (s, D2 )
                                                                   can be obtained by
               Uk = Uf (k) · Zek =                  Ze .     (3)
                                        e∈P(s,k)                            ρl (i) = − log P(Li = 1), i ∈ D;
   Example 3: Link Delay Inference [22]. The link state vari-                              P(Li = 1)P(Lj = 1)                   (6)
                                                                         ρl (i, j) = − log                    , i, j ∈ D.
able Ze is a random variable denoting the random (queueing)                                   P(Li Lj = 1)
delay of link e. θe can be a certain moment of Ze , e.g.,             2) Utilization-Based Additive Metric: Similarly for Exam-
θe = var(Ze ); or the distribution of Ze is parameterized by       ple 2 in Section III-B, if 0 < γe < 1, ∀e, then we can construct
θe , e.g., θe (i) = P(Ze = i), i ∈ Z. The outcome variable Tk      an additive metric du with link length
is the cumulative (queueing) delay experienced by the probe
from s to node k. For link delay inference we have (Ts ≡ 0)                           du (e) = − log γe ,          ∀e ∈ E.      (7)
              Tk = Tf (k) + Zek =                   Ze .     (4)     ρu (s, D) ∪ ρu (s, D2 ) can be obtained by
                                         e∈P(s,k)
                                                                           ρu (i) = − log P(Ui = 1), i ∈ D;
           IV. C ONSTRUCT A DDITIVE M ETRICS                                               P(Ui = 1)P(Uj = 1)                   (8)
                                                                         ρu (i, j) = − log                    , i, j ∈ D.
   Let T (s, D) = (V, E) be a routing tree with source node s                                 P(Ui Uj = 1)
and destination nodes D. We say d is an additive metric on            3) Delay-Based Additive Metric: For Example 3 in Section
T (s, D) if                                                        III-B, if 0 < var(Ze ) < ∞, ∀e, then we can construct an
        (a)   0 < d(e) < ∞,          ∀e ∈ E;                       additive metric dv with link length
        (b)   d(i, j) =              d(e),     ∀i, j ∈ V.                             dv (e) = var(Ze ),           ∀e ∈ E.      (9)
                          e∈P(i,j)
                                                                     ρv (s, D) ∪ ρv (s, D2 ) can be obtained by
   d(e) can be viewed as the length of link e and d(i, j) can
be viewed as the distance between nodes i and j. Remember                        ρv (i)        = var(Ti ), i ∈ D;
U = s ∪ D is the set of terminal nodes on the tree. We use                     ρv (i, j)       = cov(Ti , Tj ), i, j ∈ D.      (10)
d(U 2 ) = {d(i, j) : i, j ∈ U } to denote the distances between
the terminal nodes. It is known that the topology and link            As in (6), (8), (10), if we know the pairwise joint distribu-
lengths of a tree are uniquely determined by the distances         tions of the outcome variables at the terminal nodes, then we
between the terminal nodes under an additive metric [6].           can construct an additive metric and derive ρ(s, D)∪ρ(s, D2 ).
   Suppose the source node s is fixed. For any destination node     In actual network inference problems we are not given such
i ∈ D, let ρ(i) = d(s, i) denote the path length from s to i       distributions. We can use measurements taken at the terminal
(under additive metric d).                                         nodes to estimate the distributions.
   For any pair of destination nodes i, j ∈ D, let ij denote          Let s send a sequence of n probes to (a subset of) des-
                                                                                                                        (t)
their nearest common ancestor on T (s, D) (i.e., the ancestor      tination nodes in D. For any probed node i, let Ti be the
of both nodes i and j that is closest to i and j on the routing    measured (one-way) delay of the t-th probe from s to i, with
                                                                     (t)
tree). For example, in Fig. 1(b), the nearest common ancestor      Ti = ∞ means that i does not receive the t-th probe. We
                                                                                         (t)
of destination nodes 4 and 5 is node 2, and the nearest common     use Timin = mint Ti to approximate the propagation delay
ancestor of destination nodes 4 and 6 is node 1. Let ρ(i, j) =     from s to i.
d(s, ij) denote the shared path length from s to i and j (i.e.,       The loss outcomes can be derived from the delay measure-
the distance between s and the nearest common ancestor of i        ments as follows:
and j).                                                                                                      (t)
   Let ρ(s, D) = {ρ(i) : i ∈ D} denote the path lengths from                             (t)        1, Ti < ∞,
                                                                                       Li =             (t)
s to nodes in D, ρ(s, D2 ) = {ρ(i, j) : i, j ∈ D} denote the                                        0, Ti = ∞.
shared path lengths from s to pairs of nodes in D. Note that
                                                                     As in [15], the utilization outcomes can be derived from the
there is a one-to-one mapping between d(U 2 ) and ρ(s, D) ∪
                                                                   delay measurements as follow:
ρ(s, D2 ). We can recover the topology of the routing tree if we
know either d(U 2 ) or ρ(s, D) ∪ ρ(s, D2 ). The key thing is to                    (t)
                                                                                                       (t)
                                                                                                 1, Ti − Timin ≤ ,
construct an additive metric for which we can derive/estimate                    Ui      =           (t)
                                                                                                 0, Ti − Timin > ,
d(U 2 ) or ρ(s, D) ∪ ρ(s, D2 ) from end-to-end measurements.
                                                                   where is a small value, e.g., 0.1 ms, to account for possible
A. Additive Metrics Based on Multicast Probing                     measurement error.
   1) Loss-Based Additive Metric: For Example 1 in Section           We can construct explicit estimators for the path lengths and
III-B, if 0 < αe < 1, ∀e, then we can construct an additive        shared path lengths in (6), (8) as follows:
metric dl with link length                                               ˆ              ¯ ˆ                    ¯ ¯ ¯
                                                                         ρl (i) = − log Li , ρl (i, j) = − log Li Lj /Lij ;    (11)
                 dl (e) = − log αe ,         ∀e ∈ E.         (5)        ˆ               ¯ ˆ                    ¯ ¯ ¯
                                                                        ρu (i) = − log Ui , ρu (i, j) = − log Ui Uj /Uij ;     (12)
                                                                                                                                                                  5



where                                                                                 If 0 < αe < βe ≤ 1 for all links, then 0 < αe < 1, and we
                                                                                                                                  β
                                                                                                                                    e


              1
                   n
                                 1
                                              n                                     can construct an additive metric dl with link length
         ¯             (t) ¯                        (t)       (t)
         Li =         Li , Lij =                   Li Lj ;                                                         αe
              n   t=1
                                 n           t=1                                                    dl (e) = − log , ∀e ∈ E.
                  n                           n                                                                    βe
        ¯    1         (t) ¯     1                  (t)       (t)
        Ui =          Ui , Uij =                   Ui Uj .                             In real networks, we would expect αe < βe , because the fact
             n    t=1
                                 n           t=1                                    that the first packet successfully goes through a link indicates
   Similarly, we can construct explicit estimators for the path                     that the link is in good state and the second packet, which
lengths and shared path lengths in (10) using sample variances                      closely follows the first packet, can also go through the link.
and sample covariances:                                                             This phenomenon was observed in real Internet measurements
                                                                                    (e.g., [5], [29]).
        ˆ         ˆ        ˆ            ˆ
        ρv (i) = var(Ti ), ρv (i, j) = cov(Ti , Tj );                        (13)      Let La and Lb be the loss outcome variable of packet a and
                                                                                             i         j
where                                                                               b at node i and j, respectively. Under the spatial independence
                                  n                                                 assumption, we have
                        1                (t)        ¯
                                                          2
         ˆ
        var(Ti ) =                      Ti        − Ti        ,                        P(La = 1) =               αe , P(Lb = 1) =                      αe ,
                       n−1       t=1
                                                                                          i                              j
                                  n                                                                   e∈P(s,i)                              e∈P(s,j)
                        1                (t)        ¯             (t)     ¯
   cov(Ti , Tj ) =
    ˆ                                   Ti        − Ti        Tj        − Tj ,       P(La Lb = 1) =
                                                                                        i j                        αe βe               αe               αe .
                       n−1       t=1                                                                   e∈P(s,ij)           e∈P(ij,i)        e∈P(ij,j)
                           n
             ¯         1          (t)
             Ti   =              Ti .                                                 Hence ρl (s, D) ∪ ρl (s, D2 ) can be obtained by
                       n   t=1
                                                                                                          P(La = 1)P(Lb = 1)
                                                                                                              i         i
                                          (t)  (t)                                          ρl (i) = − log                    , i ∈ D;
In the above equations we assume         Ti , Tj < ∞ (i.e., there                                             P(La Lb = 1)
                                                                                                                  i i
is no packet loss). For lost packets we will not count them in                                           P(La = 1)P(Lb = 1)
                                                                                                            i         j
the computation.                                                                       ρl (i, j) = − log                    , i, j ∈ D.                        (14)
   Notice that possible clock asynchronization between the                                                  P(La Lb = 1)
                                                                                                                i j

source node and the destination nodes will not affect the                                                                           a   b
                                                                                       Now consider link delay inference. If cov(Ze , Ze ) > 0 for
estimators in (11), (12), (13).                                                     all links (which we would expect to hold in real networks be-
   A convex combination of several additive metrics is still                        cause the two back-to-back packets are very close, hence their
an additive metric. In order to fuse information from multiple                      experienced delays in a shared link are positively correlated),
measurements, we can construct a new additive metric using                          then we can construct an additive metric dv with link length
a convex combination of dl , du , dv : dt = al dl + au du + av dv                                                a    b
with al + au + av = 1. The (estimated) path lengths and                                            dv (e) = cov(Ze , Ze ),        ∀e ∈ E.
shared path lengths under the new additive metric can be                              Let Tia and Tjb be the delay outcome variable of packet
                   ˆ        ˆ       ˆ        ˆ
easily computed: ρt = al ρl + au ρu + av ρv . In practice we                        a and b at node i and j, respectively. We have Tia =
can select the coefficients based on the current network state                                   a    b               b
                                                                                      e∈P(s,i) Ze , Tj =   e∈P(s,j) Ze .
or to minimize the variance of the new estimator ρt .ˆ                                Under the spatial independence assumption, we have
                                                                                              cov(Tia , Tjb )    =                    a    b
                                                                                                                                 cov(Ze , Ze ),
B. Additive Metrics Based on Unicast Packet Pair Probing
                                                                                                                     e∈P(s,ij)
   The validity of (6), (8), (10) depends on the fact that the
packets of the same multicast probe received by different                                     cov(Tia , Tib )    =                   a    b
                                                                                                                                cov(Ze , Ze ).
destination nodes have the same network experience (loss,                                                            e∈P(s,i)
delay, etc.) in the shared links, which may not hold for a                                                           2
                                                                                      Hence ρv (s, D) ∪ ρv (s, D ) can be obtained by
unicast packet pair/string probe. Can we still construct additive
metrics from unicast probing? The answer is yes, if the packets                                   ρv (i)   = cov(Tia , Tib ),          i ∈ D;
are positively correlated (not necessarily perfect correlated) in                              ρv (i, j)   = cov(Tia , Tjb ),          i, j ∈ D.               (15)
the shared links.
                                                                                       Similarly as in (11), (12), (13), we can construct explicit
   Suppose the source node s sends two back-to-back packets
                                                                                    estimators for the path lengths and shared path lengths in
to destination nodes i and j, for which the first packet (denoted
                                                                                    (14) and (15) using empirical distributions measured by the
by a) is sent to node i and the second packet (denoted by b)
                            a       b                                               terminal nodes.
is sent to node j. Let Ze and Ze be the link state variables
experienced by packet a and packet b in link e, respectively.
   First consider link loss (or utilization) inference. Let αe =                    C. Additive Metric Based on Traceroute-like Probing
     x
P(Ze = 1) for x = {a, b} be the marginal success rate of link                          Using traceroute-like probing, a source node can obtain the
                  b        a
e. Let βe = P(Ze = 1|Ze = 1) be the conditional success rate                        unique labels (IP addresses) of the internal nodes (routers) in
of link e, i.e., βe is the conditional probability of the second                    the path from it to any destination node. We can construct an
packet b successfully goes through link e given that the first                       additive metric dh by defining the link length dh (e) to be the
packet a successfully goes through link e.                                          number of hops (physical links) contained in logical link e.
                                                                                                                                             6



   The path length ρh (i) is the number of hops contained in         measurements from only two children of f . If f has more
the path from s to i. The shared path length ρh (i, j) is the        than two children, we could utilize measurements from all of
number of hops contained in the shared portion of the paths          them as follows:
from s to i and j. The shared portion of two paths can be                                                1
determined by comparing the labels of the internal nodes in                              ˆ
                                                                                         ρ(k, f ) =                       ˆ
                                                                                                                          ρ(k, i).       (17)
                                                                                                      |c(f )|
the two paths.                                                                                                  i∈c(f )
   If some internal nodes do not respond to traceroute-like
                                                                     This modification improves the accuracy of the RNJ algorithm
probing (e.g., anonymous routers, layer-2 switches, MPLS
                                                                     in our simulation2 .
switches), then the derived path lengths and shared path
                                                                        The computational complexity of the RNJ algorithm is
lengths can be distorted. We use ρh (s, D) and ρh (s, D2 ) to
                                   ˆ             ˆ
                                                                     O(N 2 log N ) for a routing tree with N destination nodes. Note
denote the estimated path lengths and shared path lengths with
                                                                     that the RNJ algorithm only requires (estimated) shared path
possible measurement errors.
                                                                     lengths, ρ(s, D2 ), to infer the tree topology (steps without (+)).
                                                                              ˆ
                                                                                                        ˆ
                                                                     If the (estimated) path lengths ρ(s, D) are also available, then
  V. T REE T OPOLOGY I NFERENCE BASED ON N EIGHBOR
                                                                     the RNJ algorithm can infer the link lengths as well (steps
                        J OINING
                                                                     with (+)). If there is a one-to-one mapping between the link
   We first present a topology inference algorithm using              performance parameters (e.g., success rate, utilization, delay
(estimated) path lengths and shared path lengths as the input.       variance) and the link lengths, as in (5), (7), (9), then we can
The algorithm is a grouping type algorithm as in [16] and            use the link lengths returned by the RNJ algorithm to estimate
[23]. It can be viewed as a rooted version of the widely used        the link performance parameters.
neighbor-joining algorithm for constructing phylogenetic trees
from distances [18], [24]. The algorithm begins with a leaf
set including all the destination nodes. In each step it selects     A. Analysis of RNJ Algorithm
a group of nodes that are likely to be neighbors (i.e., siblings,
                                                                        Let T be the true topology of the routing tree. Let d(e)’s
nodes with the same parent on the tree), deletes them from the
                                                                     be the true link lengths and ρ(s, D2 ) be the true shared path
leaf set, creates a new node as their parent and adds that node
                                                                     lengths under additive metric d on T .
to the leaf set. The whole process is iterated until there is only
                                                                        Proposition 1: Let ∆ ≤ mine∈E d(e) (the minimum link
one node left in the leaf set, which will be the child of the
                                                                     length on the routing tree) be the input parameter. A sufficient
root (source node). To avoid trivial cases, we assume |D| ≥ 2.
                                                                     condition for the RNJ algorithm to return the correct tree
                                                                     topology is:
Algorithm 1: Rooted Neighbor-Joining (RNJ) Algorithm                                                             ∆
                                                                                       ρ
                                                                                      |ˆ(i, j) − ρ(i, j)| <        , ∀i, j ∈ D.          (18)
                                                                                                                 4
                                                     2
                                       ˆ       ˆ
Input: Source s, Destinations D, ρ(s, D), ρ(s, D ), ∆ > 0.
                                                                           Proof: We prove by induction on the cardinality of D.
 1. V = {s} ∪ D, E = ∅.
                                                                     (1) If |D| = 2, then clearly the RNJ algorithm will return the
2.1 Find i∗ , j ∗ ∈ D with the largest ρ(i, j) (break the tie
                                             ˆ
                                                                     correct tree topology.
    arbitrarily). Create a node f as the parent of i∗ and j ∗ .
                                                                     (2) Assume the RNJ algorithm returns correct topology under
    D = D \ {i∗ , j ∗ },
                                                                     condition (18) for |D| ≤ N . Now consider |D| = N + 1.
    V = V ∪ {f },
                                                                     Claim 1. i∗ , j ∗ found in Step 2.1 which maximize ρ(i, j) are
                                                                                                                            ˆ
    E = E ∪ {(f, i∗ ), (f, j ∗ )}.
         ˆ                                                           siblings.
    (+) d(f, i∗ ) = ρ(i∗ ) − ρ(i∗ , j ∗ ),
                      ˆ         ˆ
         ˆ                                                           If i∗ and j ∗ are not siblings, then ∃k ∈ D such that
    (+) d(f, j ∗ ) = ρ(j ∗ ) − ρ(i∗ , j ∗ ).
                      ˆ         ˆ
                                                                     either i∗ k or j ∗ k is descended from i∗ j ∗ . Without loss of
2.2 For every k ∈ D such that ρ(i∗ , j ∗ ) − ρ(i∗ , k) ≤ ∆ :
                                    ˆ           ˆ        2           generality, suppose i∗ k is descended from i∗ j ∗ . This implies
    D = D \ k,
                                                                     ρ(i∗ , k) > ρ(i∗ , j ∗ ). Since link lengths ≥ ∆, we have
    E = E ∪ (f, k).
         ˆ                                                           ρ(i∗ , k) ≥ ρ(i∗ , j ∗ ) + ∆, then under condition (18),
    (+) d(f, k) = ρ(k) − ρ(i∗ , j ∗ ).
                     ˆ        ˆ
2.3 For each k ∈ D, compute:                                                                          ∆                  ∆
                                                                         ρ(i∗ , k) > ρ(i∗ , k) −
                                                                         ˆ                              > ρ(i∗ , j ∗ ) +   > ρ(i∗ , j ∗ ),
                                                                                                                             ˆ
                               1                                                                      4                  4
                  ρ(k, f ) = [ˆ(k, i∗ ) + ρ(k, j ∗ )].
                   ˆ              ρ          ˆ              (16)
                               2                                     a contradiction to the maximality of ρ(i∗ , j ∗ ).
                                                                                                             ˆ
    D = D ∪ f.                                                       Claim 2. k will be selected in Step 2.2 if and only if it is a
    (+) ρ(f ) = ρ(i∗ , j ∗ ).
        ˆ         ˆ                                                  sibling of i∗ and j ∗ .
 3. If |D| = 1, for the k ∈ D: E = E ∪ (s, k).                       If k is a sibling of i∗ and j ∗ , then ρ(i∗ , j ∗ ) = ρ(i∗ , k). This,
    Otherwise, repeat Step 2.                                        together with condition (18), implies ρ(i∗ , j ∗ ) − ρ(i∗ , k) < ∆ .
                                                                                                             ˆ             ˆ            2
                ˆ                              ˆ
Output: Tree T = (V, E), and link length d(e) for all e ∈ E.         Hence k will be selected in Step 2.2.
                                                                     If k is not a sibling of i∗ and j ∗ , and since i∗ and j ∗ are
                                                                     siblings, then i∗ j ∗ is descended from i∗ k. Since link lengths
  Note that in Equation (16) of Step 2.3, we compute the
                                            ˆ
shared path length between nodes k and f , ρ(k, f ), using             2 We   thank an anonymous reviewer for this suggestion.
                                                                                                                                                                                                 7



≥ ∆, we have ρ(i∗ , j ∗ ) ≥ ρ(i∗ , k) + ∆, then under condition                                                             Random General Trees, 12 Destination Nodes, Link Loss Rates [1%, 10%]

(18),                                                                                                                                                                          RNJ Algorithm
                                                                                                                                                                               BLTP Algorithm
                                ∆               3∆              ∆
ρ(i∗ , j ∗ ) > ρ(i∗ , j ∗ ) −
ˆ                                 ≥ ρ(i∗ , k) +    > ρ(i∗ , k) + ,
                                                     ˆ                                                                     1
                                4                4              2




                                                                                   fraction of correctly inferred trees
which implies k will not be selected in Step 2.2.                                                                         0.8
Claim 3. Condition (18) is maintained after Step 2.3.
We have |ˆ(k, i∗ ) − ρ(k, i∗ )| < ∆ and |ˆ(k, j ∗ ) − ρ(k, j ∗ )| <
           ρ                      4      ρ
∆                                                                                                                         0.6
 4 . Since ρ(k, f ) = 1 (ˆ(k, i∗ ) + ρ(k, j ∗ )), ρ(k, f ) =
             ˆ              2 ρ          ˆ
1        ∗             ∗
2 (ρ(k, i ) + ρ(k, j )), by triangular inequality we have
|ˆ(k, f ) − ρ(k, f )| < ∆ .
 ρ                       4
                                                                                                                          0.4
   From claims 1, 2, 3, after one iteration of Step 2, the RNJ
algorithm will correctly find out a pair of siblings and all their                                                         0.2
other siblings (if any), and condition (18) is maintained for
the new set of leaf nodes. Then |D| is decreased at least by 1.
By induction assumption, the algorithm will return the correct                                                             0
                                                                                                                            5         6       7       8          9       10   11      12        13
topology of the rest of the tree. This completes our proof of                                                                                             log(sample size)
the proposition.
                                                                                   Fig. 2.                                      Comparison of RNJ and BLTP under large link loss rates.
   Therefore, if the estimated shared path lengths are close
enough to the true values, the RNJ algorithm will return                                                                    Random General Trees, 12 Destination Nodes, Link Loss Rates [0.1%, 1%]
the correct tree topology. We can derive exponential error
bounds for the shared path length estimators in (11), (12) using                                                                     RNJ Algorithm
                                                                                                                                     BLTP Algorithm
Chernoff bounds [21].                                                                                                      1
                                                                                   fraction of correctly inferred trees


   Proposition 2: For any pair of nodes i, j ∈ D, a sample
                                            ˆ       ˆ
size of n (number of probes to estimate ρl or ρu ), and any                                                               0.8
small > 0:

             P |ˆl (i, j) − ρl (i, j)| ≥
                ρ                                  ≤   e−cij (    )n
                                                                            (19)                                          0.6
                                                           −bij ( )n
               ρ
            P |ˆu (i, j) − ρu (i, j)| ≥            ≤   e                    (20)
                                                                                                                          0.4
where cij ( )’s and bij ( )’s are some constants.
        ˆ
   Let Tn be the inferred tree topology returned by the RNJ
                                                  ˆ
algorithm with a sample size n. Let Pn = P{Tn = T }                                                                       0.2

denote the probability of correct topology inference of the
RNJ algorithm.                                                                                                             0
                                                                                                                            5         6       7       8       9        10     11      12        13
   Proposition 3: Let ∆ ≤ mine∈E d(e) be the input parame-                                                                                             log(sample size)
ter of the RNJ algorithm. If
                                                                                   Fig. 3.                                      Comparison of RNJ and BLTP under small link loss rates.
                                   ∆
    P{|ˆ(i, j) − ρ(i, j)| ≥
       ρ                             } ≤ e−cij (∆)n ,          ∀i, j ∈ D,
                                   4
where n is the sample size and cij (∆) is some constant, then                      B. Comparison with Previous Grouping Algorithms
for a routing tree with N destination nodes:                                          The grouping algorithms in [16], [23] aggregate the mea-
                                                                                   surement data from the destination nodes up the tree, which
                        Pn ≥ 1 − N 2 e−c(∆)n ,                              (21)   is particularly designed for multicast probing. In contrast, the
                                                                                   RNJ algorithm only requires (estimated) shared path lengths
i.e., the probability of correct topology inference of the RNJ
                                                                                   between pairs of the destination nodes, which is applicable to
algorithm converges to one exponentially fast in the sample
                                                                                   both multicast probing and unicast packet pair probing.
size.
                                                                                      Under mulitcast probing, for general (nonbinary) routing
       Proof: By Proposition 1 and union bound we have
                                                                                   trees, the RNJ algorithm has a much lower computational
                                                           ∆                       complexity while it may also require a larger sample size to
       Pn     ≥    P              ρ
                                 |ˆ(i, j) − ρ(i, j)| <
                                                           4                       achieve the same level of accuracy compared to the maximum-
                       i,j∈D
                                                                                   likelihood based grouping algorithm in [16]. Nevertheless, we
                                                                 ∆
              = 1−P                      ρ
                                        |ˆ(i, j) − ρ(i, j)| ≥                      have shown that the probability of correct topology inference
                                                                 4                 of the RNJ algorithm converges to one exponentially fast in
                                i,j∈D
                                                                                   the sample size.
              ≥    1−            e−cij (∆)n ≥ 1 − N 2 e−c(∆)n
                                                                                      We compare the accuracy of the RNJ algorithm with the
                        i,j∈D
                                                                                   BLTP algorithm (the reference grouping algorithm in [16]
where c(∆) = mini,j∈D cij (∆).                                                     which has best accuracy and complexity) via model simula-
                                                                                                                                       8



tion. For each experiment, we first randomly generate the tree        experiments and we found that it only has decent accuracy for
topology and select the link loss rates in a certain range. We       a small number of destination nodes (less than six). Therefore,
compare the inferred tree topology returned by RNJ and BLTP          poor probing scalability of unicast packet pair probing will
with the true tree topology. Each experiment is repeated 200         limit the number of destination nodes that a source node can
times. For each inference algorithm, we compute the fraction         infer when multicast probing is not supported.
of correctly inferred trees among all 200 trials (which can be          We address these issues in this section. We design pro-
viewed as the probability of correct topology inference of the       cedures to add a node to (add_node) and delete a node
algorithm).                                                          from (delete_node) a routing tree. These procedures can
   The results are shown in Figs. 2-3. The x axis is in log          handle node joining and leaving efficiently, and are particularly
scale, i.e., it is log2 n for a sample size of n probes. Both        useful for applications where node dynamics are prevalent.
the RNJ algorithm and the BLTP algorithm are consistent: the         Based on the add_node procedure, we propose a novel
fraction of correctly inferred trees of both algorithms goes to      sequential topology inference algorithm, which greatly reduces
1 exponentially fast as we increase the sample size. When            the probing overhead under unicast packet pair probing.
the link loss rates are within the range of [1%, 10%], the
BLTP algorithm has a noticeable better accuracy than the RNJ         A. Procedure add node
algorithm; while when the link loss rates are within the range
of [0.1%, 1%], the difference is small (this is consistent with         add_node(T , k, j, ∆) is a recursive procedure that adds
our analysis in [21] where we show that the simple estimator         a new destination node j to the routing tree T = (V, E) via
for shared path lengths (11) is as efficient as the MLE for           an existing node k on the tree, with the initial condition that
networks with small loss rates). We conduct experiments for          j is a sibling or descendant of node k. ∆ is the (estimated)
trees with different sizes and ranges of link loss rates and we      minimum link length. Let f (k) be the parent of k on the
observe the same pattern of the results.                             (old) tree T .

         VI. DYNAMIC T REE T OPOLOGY I NFERENCE
   In practice, the RNJ algorithm (and other existing topology Procedure: add_node(T , k, j, ∆)
inference algorithms) may have some limitations. First, the          IF k is a leaf node on the tree T = (V, E):
focus of previous studies is on a relatively stable set of nodes.       (j will be a sibling of k on the new tree.)
In real applications (e.g., P2P applications), the destination          1. Create a node p as the parent of k and j.
nodes that a source node communicates with will often change               V = V ∪ {p, j},
over time. Hence the routing tree topology will also change                E = E \ (f (k), k) ∪ {(f (k), p), (p, k), (p, j)}.
over time. When an existing destination node leaves, it is        ELSE Suppose k has l children c1 , ..., cl .
straightforward to derive the updated routing tree topology.            2. Select a destination node di descended from ci .
When a new destination node joins, running the RNJ algorithm                                   ˆ              ˆ
                                                                        3. Measure/estimate ρ(d1 , d2 ) and ρ(j, di ) for i = 1, ..., l.
over the new set of destination nodes to infer the updated                                             ˆ
                                                                        4. Find di∗ with the largest ρ(j, di ).
routing tree topology is not efficient when the nodes join and           Case (a): ρ(d1 , d2 ) − ρ(j, di∗ ) ≥ ∆ :
                                                                                    ˆ             ˆ           2
leave frequently.                                                       (j will be a sibling of k on the new tree.)
   The second limitation is the probing scalability problem             5. Create a node p as the parent of k and j.
under unicast probing. The RNJ algorithm requires estimated                V = V ∪ {p, j},
shared path lengths from the source node to all pairs of the               E = E \ (f (k), k) ∪ {(f (k), p), (p, k), (p, j)}.
destination nodes as the input. Suppose there are N destination         Case (b): |ˆ(d1 , d2 ) − ρ(j, di∗ )| < ∆ :
                                                                                    ρ              ˆ            2
nodes. If multicast probing is available, then the source node          (j will be a child of k on the new tree.)
can use a 1 × N multicast probing to obtain the required                6. V = V ∪ j, E = E ∪ (k, j).
measurements. The probing overhead is O(N ). On the other               Case (c): ρ(j, di∗ ) − ρ(d1 , d2 ) ≥ ∆ :
                                                                                   ˆ             ˆ            2
hand, if multicast probing is not supported and N is large, then        (j will be a sibling or descendant of ci∗ on the new tree.)
it is difficult to obtain ρ(s, D2 ) using a single 1 × N unicast
                         ˆ                                              7. Execute add_node(T , ci∗ , j, ∆).
packet string probing without violating the assumption that the
string of packets have the same or even positively correlated
network experiences in the shared links.                              By running add_node(T , s, j, ∆), we add a new destina-
   The source node could use back-to-back (unicast) packet tion node j to the routing tree T rooted at s.
pair probings. This requires O(N 2 ) 1 × 2 probings. The              In Step 3 of add_node(T , k, j, ∆), in order to estimate
probing overhead is O(N 2 ). If these probings are conducted the shared path lengths ρ(d1 , d2 ) and ρ(j, di ) for i = 1, ..., l,
                                                                                             ˆ               ˆ
in parallel, then this will quickly consume the outgoing s can use a 1 × (l + 1) (multicast) probing, by sending probes
bandwidth of the source node; while if these probings are to destination nodes j, d1 , ..., dl ; alternatively, s can use l + 1
conducted in sequence, then it will take a long time to (unicast) packet pair probings, by sending probes to node pairs
obtain the measurements, and it is likely that the network (d1 , d2 ), (j, d1 ), ..., (j, dl ).
states (routing topology, link performance metrics) will change       For an l-ary (balanced) tree with N destination nodes,
during the measurement period which will violate the station- the depth of the tree is O(logl N ). In the worst case, the
arity assumption. We tested the RNJ algorithm via Internet add_node procedure needs to be executed O(logl N ) times
                                                                                                                                                                                                              9



                                    s                                                                                                      ρ(d1 , d2 ) − ρ(j, di∗ ) > ∆ , so Step 5 will be executed which
                                                                                                                                           ˆ             ˆ             2
                                                              s                                             s
                                                                                                                                           correctly adds j to the tree.
                                   f(k)                                                                                                       Case (b): j is a child of k, as shown in Fig. 4(b). In this
                                                                                                                                                                                                   ρ
                                                                                                                                           case ρ(d1 , d2 ) = ρ(j, di∗ ). Under (22) this implies |ˆ(d1 , d2 )−
                           p                                                                                                               ρ(j, di∗ )| < ∆ , so Step 6 will be executed which correctly
                                                                                                                                           ˆ               2
                                                              k                                             k
                                                                                                                                           adds j to the tree.
                k                       j                                        j
                                                                                                                      c i*                    Case (c): j is a sibling or descendant of a child of k, as
                                                                                                                                     j     shown in Fig. 4(c). Suppose ci∗ is the child and di∗ is the
                                            d1             d2             d i*             d1            d2           d i*                 selected destination node descended from ci∗ in Step 2. Then
 d1            d2           d i*                                                                                                           ρ(j, di∗ )−ρ(j, di ) ≥ ∆ for i = i∗ and ρ(j, di∗ )−ρ(d1 , d2 ) ≥
      (a) j is a sibling of k.                   (b) j is a child of k.                    (c) j is sibling or descendant of ci*.
                                                                                                                                                                            ˆ            ˆ
                                                                                                                                           ∆. Under (22) this implies ρ(j, di∗ ) > ρ(j, di ) so di∗ will
                                                                                                                                           be selected in Step 4, and ρ(j, di∗ ) − ρ(d1 , d2 ) > ∆ hence
                                                                                                                                                                           ˆ           ˆ             2
Fig. 4.         Three cases of adding a new node j to the tree via a node k on the                                                         add_node(T , ci∗ , j, ∆) will be executed in Step 7.
tree.                                                                                                                                         Proposition 5: Let ∆ be less than or equal to the minimum
                                                                                                                                           link length in the new routing tree. If for all the nodes k visited
                                                                                                                                           by the recursive procedure add_node(T , s, j, ∆), we have
in order to add a new destination node to the tree. Under uni-
                                                                                                                                                                           ∆
cast packet pair probing, if we apply the add_node procedure                                                                                   ρ
                                                                                                                                            P{|ˆ(d1 , d2 ) − ρ(d1 , d2 )| ≥  } ≤e−cd1 d2 (∆)n ,
to infer the topology of the new tree, we need O(l logl N )                                                                                                                4
                                                                                                                                                                           ∆
packet pair probings, and the computational complexity is                                                                                       P{|ˆ(j, di ) − ρ(j, di )| ≥ } ≤e−cjdi (∆)n , i = 1, ..., l,
                                                                                                                                                   ρ
                                                                                                                                                                           4
O(l logl N ). While if we apply the RNJ algorithm to infer
the topology of the new tree, we need O(N 2 ) packet pair                                                                                  where n is the sample size and cd1 d2 (∆), cjdi (∆)’s are some
probings, and the computational complexity is O(N 2 log N ).                                                                               constants, then the probability of correct topology inference of
                                                                                                                                           add_node(T , s, j, ∆) for an l-ary tree with N destination
                                                                                                                                           nodes satisfies:
B. Analysis of Procedure add node
                                                                                                                                                          Pn ≥ 1 − (l + 1)(logl N )e−c(∆)n .              (23)
   If the estimated shared path lengths in Step 3 are close
enough to the true values, then add_node(T , s, j, ∆) will                                                                                      Proof: The proof is similar to the proof of Proposition 3.
correctly add a new destination node to the tree.
   Proposition 4: Let ∆ be less than or equal to the minimum
link length in the new routing tree (including existing des-                                                                               C. Procedure delete node
tination nodes and the new destination node j). A sufficient                                                                                   Procedure delete_node(T , j) deletes a destination node
condition for the recursive procedure add_node(T , s, j, ∆)                                                                                j from routing tree T . It will first remove node j and link
to return the correct tree topology (after adding node j) is                                                                               (f (j), j) from the tree. If f (j) has only one child left after
that for all the nodes k visited by the recursive procedure:                                                                               deleting j, it will then further remove node f (j) and connect
                                                                                                      ∆                                    the child of f (j) to the parent of f (j), so that the new
                                             ρ
                                            |ˆ(d1 , d2 ) − ρ(d1 , d2 )| <                               ,                                  routing tree maintains the property that each internal node
                                                                                                      4
                                                                   ∆                                                                       has at least two children.
                 ρ
                |ˆ(j, di ) − ρ(j, di )| <                            ,               i = 1, 2, ..., l.                              (22)
                                                                   4
      Proof: We prove that if k is a sibling or ancestor of j                                                                              Procedure: delelte_node(T , j)
in the new routing tree and condition (22) is satisfied, then                                                                                1. V = V \ j, E = E \ (f (j), j)).
procedure add_node(T , k, j, ∆) will either correctly add j                                                                                 2. If f (j) has only one child c left:
to the tree, or find a child c of k that is a sibling or ancestor                                                                               V = V \ f (j),
of j and execute add_node(T , c, j, ∆) recursively. Hence
                                                                                                                                               E = E \ (f (f (j)), f (j)), (f (j), c)) ∪ (f (f (j)), c).
add_node(T , s, j, ∆) finally returns the correct new routing
tree topology.
   Now assume k is a sibling or ancestor of j and condition
(22) is satisfied. If k is a leaf node (i.e., k is a destina-                                                                               D. Sequential Topology Inference Algorithm
tion node), then k and j must be siblings so Step 1 of                                                                                        For a source node s and a set of destination nodes D, we
add_node(T , k, j, ∆) correctly adds j to the tree. Otherwise                                                                              can apply the add_node procedure over the nodes in D in
suppose k has l children c1 , ..., cl , and di is a destination node                                                                       sequence to construct the routing tree topology incrementally,
                                                       ˆ
descended from ci selected in Step 2. In Step 3 ρ(d1 , d2 ) and                                                                            as described in Algorithm 2.
ˆ
ρ(j, di ) for i = 1, ..., l are measured and estimated. There are                                                                             We compare the RNJ algorithm and the sequential
three cases to consider.                                                                                                                   topology inference algorithm in Table I. We assume all
   Case (a): j is a sibling of k in the new routing tree, as                                                                               probings have the same sample size and time interval
shown in Fig. 4(a). In this case for the di∗ found in Step 4                                                                               between two consecutive probes. Under multicast probing,
we have ρ(d1 , d2 ) − ρ(j, di∗ ) ≥ ∆. Under (22) this implies                                                                              the RNJ algorithm is more efficient (for building the whole
                                                                                                                                      10


                                                             TABLE I
                           C OMPARISON OF RNJ A LGORITHM AND S EQUENTIAL T OPOLOGY I NFERENCE A LGORITHM
                                      N Destination Nodes, l-ary Tree with Depth O(logl N )
                                          Multicast Probing           Unicast Packet Pair Probing            Computational
                                   Probing Traffic Probing Time Probing Traffic Probing Time                    Complexity
           Add        RNJ              O(N )             O(1)          O(N 2 )           O(N 2 )             O(N 2 log N )
           Node     Sequential       O(l logl N )     O(logl N )      O(l logl N )    O(l logl N )            O(l logl N )
           Build      RNJ              O(N )             O(1)          O(N 2 )           O(N 2 )             O(N 2 log N )
           Tree     Sequential     O(N l logl N )    O(N logl N )   O(N l logl N ) O(N l logl N )            O(N l logl N )



tree); while under unicast packet pair probing, the sequential            In order to utilize information collected from both traceroute
topology inference algorithm is more efficient, in terms of the         measurements and network tomography measurements, we
probing traffic and probing time. In both cases the sequential          propose the following hybrid scheme for Internet routing
topology inference algorithm is more computationally efficient          topology inference.
than the RNJ algorithm.                                                   3. Traceroute+Tomography inference scheme (TRTomo): we
                                                                       use both traceroute measurements and network tomography
                                                                       measurements to construct additive metrics dh and dt , respec-
Algorithm 2: Sequential Topology Inference Algorithm                   tively, and we construct a new additive metric dht = A·dh +dt
                                                                       with a large constant A which makes traceroute measurements
Input: Source Node s, Destination Nodes D = {1, 2, ..., N },           dominate network tomography measurements. The reason for
∆ > 0.                                                                 selecting a large A is because that traceroute measurements
 1. V0 = {s}, E0 = ∅, T0 = (V0 , E0 ).                                 have certain “consistent” property. An anonymous router
 2. For j = 1 to N :                                                   will affect all the paths passing that router (i.e., the path
    Tj = add_node(Tj−1 , s, j, ∆).                                     lengths of those paths are all reduced by one). Hence if
              ˆ
Output: Tree T = TN .                                                  ˆ           ˆ
                                                                       ρh (i, j) > ρh (i, k), then we know for sure that ij is descended
                                                                       from ik on the routing tree. The reverse, however, may not
                                                                       be true. Even if ij is descended from ik, we may have
 VII. I NTERNET ROUTING T REE T OPOLOGY I NFERENCE                     ˆ           ˆ
                                                                       ρh (i, j) = ρh (i, k) due to anonymous routers, hence network
                                                                       tomography measurements are needed to further determine the
A. Schemes for Internet Routing Tree Topology Inference                topology.
   In this section we design schemes for Internet routing tree            For a large number of destination nodes, we propose to infer
topology inference. We consider the following schemes:                 the routing tree topology using a two-step procedure: first use
   1. Traceroute-based inference scheme (TR): we use tracer-                                        ρ
                                                                       traceroute measurements (ˆh ) (or other heuristics, e.g., round
oute measurements to construct additive metric dh and derive           trip times, AS information) to build a skeleton of the tree; then
the shared path lengths ρh (s, D2 ) as we described in Section
                             ˆ                                                                              ρ     ˆ
                                                                       apply tomography measurements (ˆt or ρht ) on subtrees (with
IV.                                                                    relatively a small number of destination nodes) to determine
   2. Tomography-based inference scheme (Tomo): we use                 the topology of the subtrees. We find this hybrid approach
unicast packet pair/string measurements to construct addi-             significantly reduces the probing scalability problem of pure
tive metrics dl , du , dv and estimate the shared path lengths         tomography-based approach. It also leads to better accuracy
ρl (s, D2 ), ρu (s, D2 ), ρv (s, D2 ) as we described in Section IV.
ˆ            ˆ            ˆ                                            than pure traceroute-based approach or pure tomography-based
We construct a new additive metric using a convex combina-             approach via information fusion.
tion of the additive metrics to fuse information from different           We refer to the above schemes as TR, Tomo and TRTomo
measurements: dt = al dl +au du +av dv with al +au +av = 1.            for short hereafter. We evaluate their performance via Internet
   We have shown that if the estimated shared path lengths             experiments.
are close enough to the true values (e.g., condition (18) or
(22) is satisfied), then the RNJ algorithm and the sequential
topology inference algorithm will return the correct routing           B. Evaluation Methodology
tree topology.                                                            We choose an idle host in our local network as the source
   For traceroute measurements, the estimated shared path              node, and two sets of PlanetLab [1] nodes as the destination
lengths can be distorted due to the existence of anonymous             nodes. We have implemented a sender utility program (running
routers, layer-2 switches, and MPLS switches. For network              at the source node) that can send unicast probing packet
tomography measurements, the assumption of independent and             pairs/strings, and a receiver utility program (running at the
stationary link states can be violated, so a larger sample size        destination nodes) to receive the probing packets and measure
with longer measurement period may not return more accurate            their one-way delays. We collect the measured one-way delays
estimation of shared path lengths. Hence the condition for             from the destination nodes using the sender utility program.
correct topology inference (18) or (22) may not hold for both             The first destination node set, referred to as US nodes,
type of measurements.                                                  consists of 30 hosts in the US (most of them are located in
                                                                                                                                                   11



US universities). The second set, referred to as International                      C. Experimental Results
nodes, consists of 30 international nodes (10 in North America,                        We run experiments using the US nodes and International
10 in Europe, and 10 in East Asia). The reliability of the                          nodes, and refer to them as US experiments and International
chosen nodes is important to the experiments, hence we choose                       experiments, respectively. We plot the correctness ratios (Fig. 5
nodes that have low CPU load and long running time.                                 and 6) and node ratios (Fig. 7 and 8) of different schemes
   Each probing from the source node to a subset of the                             with varying levels of underlying routers being anonymized.
destination nodes consists of a sequence of 1200 packet                                1) Correctness Ratio: As shown in Fig. 5 and 6, both
strings. Each probing packet is of size 80 bytes. The probing                       TR and TRTomo can correctly infer most of the internal
interval between two consecutive strings is 10 milliseconds                         nodes in the ground-truth topology when the anonymization
(contributing to a probing rate of 64 kbps per destination                          ratio is small. As the anonymization ratio increases, the
node).                                                                              correctness ratio of TR decreases to 0, because TR heavily
   We evaluate the performance of the three topology inference                      relies on routers’ support for traceroute probing (note that it
schemes by artificially varying the anonymization ratio which                        is not exactly 0 because the source node must be attached
is the fraction of the underlying routers not responding to                         to an access router and we always include that router in the
traceroute probing. For each anonymiztion ratio, we test the                        inferred routing tree topology); while the correctness ratio of
topology inference schemes for 20 rounds.                                           TRTomo stabilizes around 0.5, because TRTomo can improve
                                                                                    TR’s accuracy by utilizing both traceroute measurements and
   In each round, we first obtain the sequence of the underlying
                                                                                    tomography measurements.
routers from the source node to every destination node using
                                                                                       When the anonymization ratio is 1 (no routers response to
traceroute. The destination nodes we choose have the property
                                                                                    traceroute probing), TRTomo becomes the pure tomography-
that the paths from the source node to them contain no or
                                                                                    based scheme (Tomo), so we determine the correctness ratio of
very few anonymous routers so we can obtain the ground-truth
                                                                                    Tomo using the correctness ratio of TRTomo at anonymization
topology in order to test the topology inference schemes. We
                                                                                    ratio 1, which is around 0.5.
count the total number of unique routers we have seen for all
                                                                                       From our experiences, we would like to comment on why
destination nodes, and compute how many of the routers in
                                                                                    the pure Tomo scheme alone can only infer about 50% of
total should be anonymized according to the anonymization
                                                                                    the internal nodes but cannot infer all the internal nodes in
ratio. We then iteratively choose a destination node randomly,
                                                                                    the ground-truth topology. First, the routing topology and link
anonymize the last m routers along its route3 , where m is
                                                                                    states may be time-varying instead of stationary during the
computed as the anonymization ratio times the route length.
                                                                                    measurement period. Second, there are several limitations of
We also keep track of the number of unique routers we have
                                                                                    the PlanetLab testbed. We observed that the network con-
anonymized in each iteration, and terminate the anonymization
                                                                                    nections from the source node to the PlanetLab nodes are
procedure once the total number of unique anonymized routers
                                                                                    pretty good in most of the time, hence the shared path lengths
reaches the number we compute a priori.
                                                                                    derived from loss and delay metrics (the signals) are quite
   We use the following two metrics to evaluate the perfor-                         small and can be easily distorted by measurement noises. In
mance of the topology inference schemes:                                            addition, most PlanetLab nodes are often running multiple
   •   Correctness Ratio: which is the fraction of the internal                     applications and processes. This introduces non-negligible
       nodes in the ground-truth topology that are correctly                        node delays to the delay measurements which will affect the
       inferred averaged over all rounds. An internal node in the                   delay and utilization measurements. (Such phenomenon has
       ground-truth topology is correctly inferred if and only if                   been observed and addressed in [26].)
       there is an internal node in the inferred topology with                         2) Node Ratio: As shown in Fig. 7 and 8, the node ratio
       the same set of destination nodes descending from it.                        of TR is close to 1 when the anonymization ratio is small, but
       A higher correctness ratio means better accuracy of the                      it decreases to 0 with an increasing anonymization ratio. In
       inference scheme.                                                            contrast, TRTomo has a node ratio close to 1 in all experiments
   •   Node Ratio: which is the ratio of the number of internal                     regardless of anonymization ratio, although it may introduce a
       nodes in the inferred topology to the number of internal                     few more or less internal nodes in the inferred tree topology.
       nodes in the ground-truth topology, averaged over all                        The node ratio of Tomo is determined by the node ratio of
       rounds. An accurate inference scheme has a node ratio                        TRTomo at anonymization ratio 1.
       close to one. If the node ratio is larger than one (or                                             VIII. C ONCLUSION
       less than one), then the inference algorithm returns more
                                                                                       In this paper, we developed fast and scalable algorithms for
       internal nodes (or less internal nodes) in the inferred
                                                                                    network routing tree topology inference using a framework
       topology.
                                                                                    based on additive metrics. In particular, we proposed a se-
                                                                                    quential topology inference algorithm to address the probing
   3 When choosing the PlanetLab nodes, we find that a lot of them are behind        scalability problem and handle dynamic node joining and
routers that do not respond to traceroute probing. Most of these routers are        leaving efficiently. We proved the correctness of our algorithms
edge routers or access routers of the network in which the destination nodes        and demonstrated their effectiveness via Internet experiments.
are located in. This suggests that traceroute probings are likely to be discarded
in enterprise networks to protect their internal hosts; hence, the routers in the   The proposed algorithms provide powerful tools for large-
last few hops to a destination node are more likely to be anonymous routers.        scale network inference in communication networks. In the
                                                                                                                                                                         12




                      1                                                                                1.2
                                                                TRTomo
                     0.9                                          Tomo                                 1.1
                                                                    TR                                   1
                     0.8
                                                                                                       0.9
                     0.7
 Correctness Ratio




                                                                                                       0.8




                                                                                          Node Ratio
                     0.6                                                                               0.7
                     0.5                                                                               0.6
                     0.4                                                                               0.5
                                                                                                       0.4
                     0.3
                                                                                                       0.3
                     0.2
                                                                                                       0.2     TRTomo
                     0.1                                                                               0.1       Tomo
                                                                                                                   TR
                      0                                                                                  0
                           0.1   0.2   0.3    0.4 0.5 0.6 0.7          0.8    0.9    1                       0.1   0.2   0.3   0.4 0.5 0.6 0.7           0.8   0.9   1
                                              Anonymization Ratio                                                              Anonymization Ratio
Fig. 5.              US-experiment: correctness ratio of inferred topology.              Fig. 7.       US-experiment: node ratio of inferred topology.



                      1                                                                                1.2
                                                                TRTomo
                     0.9                                          Tomo                                 1.1
                                                                    TR                                   1
                     0.8
                                                                                                       0.9
                     0.7
 Correctness Ratio




                                                                                                       0.8
                                                                                          Node Ratio

                     0.6
                                                                                                       0.7
                     0.5                                                                               0.6
                     0.4                                                                               0.5
                     0.3                                                                               0.4
                                                                                                       0.3
                     0.2
                                                                                                       0.2     TRTomo
                     0.1                                                                                         Tomo
                                                                                                       0.1
                                                                                                                   TR
                      0                                                                                  0
                           0.1   0.2   0.3    0.4 0.5 0.6 0.7          0.8    0.9    1                       0.1   0.2   0.3   0.4 0.5 0.6 0.7           0.8   0.9   1
                                              Anonymization Ratio                                                              Anonymization Ratio
Fig. 6.              International-experiment: correctness ratio of inferred topology.   Fig. 8.       International-experiment: node ratio of inferred topology.


future we will study how to utilize the inferred information                             [8] R. Castro, M. Coates, G. Liang, R. Nowak, B. Yu, “Network Tomography:
and extend the framework for efficient and effective network                                  Recent Developments,” Statistical Science, vol. 19, no. 3, pp. 499-517,
monitoring and application design.                                                           2004.
                                                                                         [9] R. Castro, M. Coates, R. Nowak, “Likelihood Based Hierarchical Cluster-
                                                                                             ing,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2308-
                                       ACKNOWLEDGMENTS                                       2321, Aug. 2004.
   The authors would like to thank Dr. Nick Duffield and                                  [10] J. T. Chang, “Full Reconstruction of Markov Models on Evolutionary
                                                                                             Trees: Identifiability and Consistency,” Mathematical Biosciences, vol.
the anonymous reviewers for their helpful comments and                                       137, pp. 51-73, 1996.
suggestions.                                                                             [11] M. Coates and R. Nowak, “Network Loss Inference using Unicast End-
                                                                                             to-End Measurement,” Proc. ITC Conference on IP Traffic, Modelling
                                                                                             and Management, Monterey, CA, Sept. 2000.
                                             R EFERENCES                                 [12] M. Coates, A. O. Hero III, R. Nowak, B. Yu, “Internet Tomography,”
[1] PlanetLab, http://www.planet-lab.org.                                                    IEEE Signal Processing Magazine, vol. 19, no. 3, pp. 47-65, May 2002.
[2] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, R. Morris, “Resilient               [13] M. Coates, R. Castro, M. Gadhiok, R. King, Y. Tsang, R. Nowak,
    Overlay Networks,” Proc. SOSP, Oct. 2001.                                                “Maximum Likelihood Network Topology Identification from Edge-
[3] D. Antonova, A. Krishnamurthy, Z. Ma, R. Sundaram, “Managing a                           Based Unicast Measurements,” Proc. ACM Sigmetrics, June 2002.
    Portfolio of Overlay Paths,” Proc. NOSSDAV, Kinsale, Ireland, June 2004.             [14] N. G. Duffield, J. Horowitz, F. Lo Presti, D. Towsley, “Multicast Topol-
[4] A. Bestavros, J. Byers, K. Harfoush, “Inference and Labeling of Metric-                  ogy Inference from End-to-End Measurements,” Advances in Performance
    Induced Network Topologies,” Proc. IEEE INFOCOM, June 2002.                              Analysis, vol. 3, pp. 207-226, 2000.
[5] J.-C. Bolot, “End-to-End Packet Delay and Loss Behavior in the Internet,”            [15] N. G. Duffield, J. Horowitz, F. Lo Presti, “Adaptive Mutlticast Topology
    Proc. SIGCOMM, Sept. 1993.                                                               Inference,” Proc. IEEE INFOCOM, Anchorage, Alaska, Apr. 2001.
[6] P. Buneman, “The Recovery of Trees from Measures of Dissimilarity,”                  [16] N. G. Duffield, J. Horowitz, F. Lo Presti, D. Towsley, “Multicast Topol-
    Mathematics in the Archaeological and Historical Sciences, Edinburgh                     ogy Inference From Measured End-to-End Loss,” IEEE Transactions on
    University Press, pp. 387-395, 1971.                                                     Information Theory, vol. 48, no. 1, pp. 26-45, Jan. 2002.
[7] R. Caceres, N. G. Duffield, J. Horowitz, D. Towsley, “Multicast-Based                 [17] N. G. Duffiled, F. Lo Presti, V. Paxson, D. Towsley, “Network Loss
    Inference of Network-Internal Loss Characteristics,” IEEE Transactions                   Tomography Using Striped Unicast Probes,” IEEE/ACM Transactions on
    on Information Theory, vol. 45, no. 7, pp. 2462-2480, Nov. 1999.                         Networking, vol. 14, no. 4, pp. 697-710, Aug. 2006.
                                                                              13



[18] O. Gascuel and M. Steel, “Neighbor-Joining Revealed,” Molecular
    Biology and Evolution, vol. 23, no. 11, pp. 1997-2000, 2006.
[19] J. Hartigan, Clustering Algorihtms, John Wiley & Sons, 1975.
[20] J. Ni and S. Tatikonda, “A Markov Random Field Approach to Multicast-
    Based Network Inference Problems,” Proc. IEEE ISIT, Seattle, July 2006.
[21] J. Ni and S. Tatikonda, “Explicit Link Parameter Estimators Based on
    End-to-End Measurements,” Proc. Allerton Conference on Communica-
    tion, Control, and Computing, Sept. 2007.
[22] F. L. Presti, N. G. Duffield, J. Horowitz, D. Towsley, “Multicast-
    Based Inference of Network-Internal Delay Distributions,” IEEE/ACM
    Transactions on Networking, vol. 10, no. 6, pp. 761-775, Dec. 2002.
[23] S. Ratnasamy and S. McCanne, “Inference of Multicast Routing Trees
    and Bottleneck Bandwidths using End-to-end Measurements,” Proc. IEEE
    INFOCOM, Mar. 1999.
[24] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New Method
    for Reconstruction of Phylogenetic Trees,” Molecular Biology and Evo-
    lution, vol. 4, no. 4, pp. 406-425, 1987.
[25] M. Shih, A. O. Hero III, “Hierarchical Inference of Unicast Network
    Topologies Based on End-to-End Measurements,” IEEE Transactions on
    Signal Processing, vol. 55, no. 5, pp. 1708-1718, May 2007.
[26] J. Sommers and P. Barford, “An Active Measurement System for Shared
    Environments,” Proc. ACM Internet Measurement Conference, Oct. 2007.
[27] D. Stutzbach and R. Rejaie, “Understanding Churn in Peer-to-Peer
    Networks,” Proc. ACM SIGCOMM Conference on Internet Measurement,
    2006.
[28] Y. Tsang, M. Coates, R. Nowak, “Network Delay Tomography,” IEEE
    Transactions on Signal Processing, vol. 51, no. 8, pp. 2125-36, Aug.
    2003.
[29] M. Yajnik, S. Moon, J. Kurose, D. Towsley, “Measurement and Mod-
    elling of the Temporal Dependence in Packet Loss,” Proc. IEEE INFO-
    COM, Mar. 1999.
[30] B. Yao, R. Viswanathan, F. Chang, D. Waddington, “Topology Inference
    in the Presence of Anonymous Routers,” Proc. IEEE INFOCOM, Apr.
    2003.

				
DOCUMENT INFO
Shared By:
Stats:
views:9
posted:6/27/2012
language:English
pages:13