VIEWS: 9 PAGES: 13 CATEGORY: Software POSTED ON: 6/27/2012 Public Domain
1 Efﬁcient and Dynamic Routing Topology Inference From End-to-End Measurements Jian Ni, Member, IEEE, Haiyong Xie, Member, IEEE, Sekhar Tatikonda, Member, IEEE and Yang Richard Yang, Member, IEEE Abstract—Inferring the routing topology and link performance Both have their limitations. One approach is to use tools based from a node to a set of other nodes is an important component on measurements or feedback messages of the internal nodes in network monitoring and application design. In this paper we (e.g., routers). Such an approach is limited as today’s com- propose a general framework for designing topology inference algorithms based on additive metrics. The framework can ﬂex- munication networks are evolving towards more decentralized ibly fuse information from multiple measurements to achieve and private adminstration. For example, a common approach to better estimation accuracy. We develop computationally efﬁcient obtain the routing topology from a source node to a destination (polynomial-time) topology inference algorithms based on the node in the Internet is to use traceroute. Traceroute relies on framework. We prove that the probability of correct topology internal routers responding to traceroute requests and returning inference of our algorithms converges to one exponentially fast in the number of probing packets. In particular, for applications ICMP (Internet Control Message Protocol) messages. How- where nodes may join or leave frequently such as overlay ever, an increasing number of routers in the Internet today will network construction, application-layer multicast, peer-to-peer block traceroute requests due to privacy and security concerns. ﬁle sharing/streaming, we propose a novel sequential topology These routers are known as anonymous routers [30] and their inference algorithm which signiﬁcantly reduces the probing over- existence makes the routing topology inferred by traceroute- head and can efﬁciently handle node dynamics. We demonstrate the effectiveness of the proposed inference algorithms via Internet like tools inaccurate. Furthermore, traceroute-like tools cannot experiments. discover layer-2 switches and MPLS (Multiprotocol Label Switching) paths that are increasingly being deployed. Index Terms—Routing topology inference, network tomogra- phy, network measurement, network monitoring. The other approach, known as network tomography, utilizes end-to-end packet probing measurements (such as packet loss and delay measurements) conducted by the end hosts and I. I NTRODUCTION does not require extra cooperation from the internal nodes Developing a scalable tool to infer the routing topology and (except the basic packet forwarding functionality). Under a link performance from a node to a set of other nodes is an network tomography approach, a source node will send probes important challenge. In network monitoring, this tool can help to a set of destination nodes. The basic idea is to utilize a network operator obtain routing information and network the correlations among the observed losses and delays of the internal characteristics (e.g., loss rate, delay, utilization) from probes at the destination nodes to infer the network structure its network to a set of other collaborating networks that and internal characteristics. Due to its ﬂexibility and reliability, are separated by non-participating autonomous networks. In network tomography has attracted many recent studies [8], application design, this tool can be particularly useful for peer- [12]. Many previous network tomography studies are based to-peer (P2P) style applications where a node communicates on multicast probing because of its effectiveness and probing with a set of other nodes (called peers) for ﬁle sharing and efﬁciency (e.g., [7], [14], [15], [16], [20], [22], [23]). Since multimedia streaming. For example, a node may want to know IP multicast is not widely deployed in the current Internet, the routing topology to other nodes so that it can select unicast network tomography approaches based on back-to- peers with low or no route overlap to improve resilience back unicast packet pairs or strings have also been investigated against network failures [2]. As another example, a streaming (e.g., [4], [11], [13], [17], [25], [28]). node using multi-path may want to know both the routing Two fundamental challenges of network tomography ap- topology and link loss rates so the selected paths have low proaches include computational complexity and probing scal- loss correlation [3]. ability (especially under unicast probing). These limit the So far there are two primary approaches to infer the routing number of destination nodes that a source node can infer. In topology and link performance in a communication network. addition, the focus of previous studies is on a relatively stable An earlier version of this paper was presented at the 27th IEEE Conference set of nodes, while in many applications and networks (e.g., on Computer Communications (INFOCOM), Phoenix, Arizona, April 2008. overlay network construction, application-layer multicast, P2P J. Ni is with the Coordinated Science Laboratory, University of Illinois at ﬁle sharing and streaming, wireless ad-hoc and sensor net- Urbana-Champaign, Urbana, IL 61801 USA (e-mail: jianni@illinois.edu). H. Xie is with Akamai Technologies, 3125 Clearview Way, San Mateo, CA works) nodes may join or leave a session frequently [27]. To 94402 USA (email: hxie@akamai.com). handle node dynamics efﬁciently we need fast and scalable in- S. Tatikonda is with the Department of Electrical Engineering, Yale ference procedures/algorithms which have low computational University, New Haven, CT 06520 USA (email: sekhar.tatikonda@yale.edu). Y. R. Yang is with the Department of Computer Science, Yale University, complexity, fast convergence rate, and small probing overhead. New Haven, CT 06520 USA (e-mail: yry@cs.yale.edu). In this paper we study the problem of inferring the network 2 routing topology from a source node to a set of destination Similar generalization was made in [4]. The RNJ algorithm nodes1 , where the set can be dynamic. We summarize our proposed in this paper is also a grouping type algorithm which contributions as follows. recovers the tree topology by recursively joining the neighbors • We present a general framework for designing network on the tree. This agglomerative joining/grouping idea has been routing topology inference algorithms based on additive used in clustering for building cluster trees (e.g., [19]) and in metrics. We show how to construct additive metrics evolutionary biology for building phylogenetic trees (e.g., [18], and estimate the (shared) path lengths using end-to-end [24]). multicast and unicast packet probing measurements as Unicast routing tree topology inference was studied in [9], well as traceroute type measurements. The framework [13], [25]. Coates et al. [13] introduced a sandwich probing can ﬂexibly fuse information available from multiple technique to conduct delay measurements and proposed a measurements to achieve better estimation accuracy and Markov Chain Monte Carlo (MCMC) procedure to search the faster convergence rate. most likely tree topologies. Castro et al. [9] and Shih et al. • Based on the framework we develop two computation- [25] formulated the inference problem as a hierarchical clus- ally efﬁcient (polynomial-time) topology inference algo- tering problem and developed several hierarchical clustering rithms. In particular, we propose a novel sequential topol- algorithms to recover the tree topology. ogy inference algorithm which signiﬁcantly reduces the The limitations of existing topology inference algorithms probing overhead under unicast probing. In addition, it are summarized in the beginning of Section VI. To the best of can efﬁciently handle dynamic node joining and leaving, our knowledge, the sequential topology inference algorithm and thus is particularly desirable for applications and proposed in this paper is a ﬁrst effort to address the issues networks where node dynamics are prevalent. of node dynamics and probing scalability for network routing • Under some assumptions we prove that the probability of topology inference. correct topology inference of our algorithms converges to one exponentially fast in the sample size (number of III. N ETWORK M ODEL AND I NFERENCE P ROBLEMS probing packets). We also demonstrate the effectiveness Let G = (V, E) denote the topology of the network, which is of our algorithms via Internet experiments. For the most a directed graph with node set V (end hosts, internal switches effective inference algorithm (a hybrid scheme which and routers, etc.) and link set E (communication links that incorporates both network tomography measurements and join the nodes). For any nodes i and j in the network, if the traceroute measurements), the inferred topology is ap- underlying routing algorithm returns a sequence of links that proximately 100% correct when no more than 20% of connect j to i, we say j is reachable from i. We assume the internal routers do not respond to traceroute probing. that during the measurement period, the underlying routing It can still correctly identify approximately 50% of the algorithm determines a unique path from a node to another internal nodes, solely from network tomography measure- node that is reachable from it. Hence the physical routing ments, even when none of the internal routers respond to topology from a source node to a set of (reachable) destination traceroute probing. nodes is a (directed) tree. We organize the paper as follows. In Section II we review From the physical routing topology, we can derive a logical some related work. In Section III we introduce the network routing tree which consists of the source node, the destination model and the inference problems. In Section IV we describe nodes, and the branching nodes (internal nodes with at least how to construct additive metrics and estimate the (shared) two outgoing links) of the physical routing tree [16], [23]. A path lengths from end-to-end measurements. In Section V and logical link may comprise more than one consecutive physical VI we propose and analyze a neighbor-joining based topology links, and the degree of an internal node on the logical routing inference algorithm and a sequential topology inference algo- tree is at least three. An example is shown in Fig. 1. In this rithm which can be applied to any additive metric. We design paper we consider topology inference of logical routing trees Internet routing tree topology inference schemes and evaluate and we use the routing tree to express the logical routing tree their performance via Internet experiments in Section VII. The for simplicity. paper is concluded in Section VIII. Suppose s is a source node in the network, and D is a set of destination nodes that are reachable from s. Let T (s, D) = II. R ELATED W ORK (V, E) denote the routing tree from s to nodes in D, with node set V and link set E. Let U = s ∪ D be the set of terminal Multicast routing tree topology inference was studied in nodes (e.g., end hosts) which are nodes of degree 1one. [14]-[16], [23]. Ratnasamy et al. [23] proposed a grouping Every node k ∈ V has a parent f (k) ∈ V and a set algorithm to infer the tree topology based on shared losses of children c(k) = {j ∈ V : f (j) = k}, except that the observed at the destination nodes. Dufﬁeld et al. [16] ex- source node (root of the tree) has no parent and the destination tended the grouping algorithm and also proposed a maximum- nodes (leaves of the tree) have no children. For notational likelihood approach and a Bayesian approach to estimate the simpliﬁcation, we also use ek to denote link (f (k), k). We tree topology. They further generalized the grouping algorithm use P(i, j) to denote the sequence of links that connect j to to any estimable and monotonic performance metrics [14]. i on the routing tree. 1 We use destination nodes for simplicity, while the nodes can be relay Each link e ∈ E is associated with a parameter θe (e.g., nodes or peer nodes of the source node in real applications. success rate, delay distribution, utilization, etc.). The network 3 source and the state of link ek = (f (k), k) (i.e., Zek ): Source Node s Xk = g(Xf (k) , Zek ). (1) router X 1 In network tomography studies it is normally assumed that router router X the link states are independent from link to link (spatial router X X independence) and are stationary during the measurement destination 2 3 period [7], [12]. (Note that these assumptions may not hold in router X router real networks like the Internet. We develop a hybrid scheme destination X 4 5 6 7 in Section VII to improve the estimation accuracy of the pure router X Destination Nodes D = {4, 5, 6, 7} network tomography scheme for Internet routing tree topology destination inference.) Under those assumptions, we can show that the destination outcome variables Xk ’s induced by the transmission of a probe (a) The physical routing topology. (b) The logical routing tree. form a Markov random ﬁeld (MRF) on the routing tree [20]. Speciﬁcally, for each node k ∈ V , the conditional distribution of Xk given other random variables (Xj : j = k) on T (s, D) Fig. 1. The physical routing topology and the associated logical routing tree with a single source node and multiple destination nodes. is the same as the conditional distribution of Xk given just its neighboring random variables (Xj : j ∈ f (k) ∪ c(k)) on T (s, D). For MRFs on trees, under mild conditions, the tree inference problems involve using measurements taken at the topology and the link parameters can be identiﬁed (uniquely terminal nodes to infer: determined) by the joint distributions of the outcome variables (1) the topology of the (logical) routing tree; at pairs and triples of the terminal nodes on the tree [10], [20]. (2) link parameters θe of the links on the routing tree. In actual network inference problems, however, the joint In this paper we focus on routing tree topology inference. distributions of the outcome variables at the terminal nodes Link parameter estimation with known routing tree topology are not given. We can estimate the joint distributions based was studied in [7], [11], [21], [22], [28]. on measurements taken at the terminal nodes. Speciﬁcally, the source node will send a sequence of n probes, and there are (t) (t) in total n outcomes XV = (Xk : k ∈ V ), t = 1, 2, ..., n, A. Probing Model one for each probe. For the t-th probe, only the outcomes (t) (t) A source node can employ different probing techniques to XU = (Xk : k ∈ U = s ∪ D) at the terminal nodes send probes (packets) to a set of destination nodes. Under can be measured and observed. We can estimate the joint multicast probing, when an internal node on the routing tree distributions of the outcome variables at the terminal nodes receives a packet from its parent, it will send a copy of the using the empirical distributions, which will converge to the packet to all its children on the tree. Hence the packets of actual stationary distributions almost surely if the link state the same probe received by different destination nodes have processes are stationary and ergodic during the measurement exactly the same network experience (loss, delay, etc.) in the period. shared links. Under unicast probing, the source node sends a string of back-to-back unicast packets to the destination nodes, one B. Network Tomography Examples packet for each destination node respectively (to mimic the Example 1: Link Loss Inference [7]. The link state variable transmission of a multicast probe). We call it a 1 × k packet Ze is a Bernoulli random variable which takes value 1 with string probing if the string size (i.e., number of probed probability αe if the probe can go through link e, and takes destination nodes) is k. Since the back-to-back packets are ∆ ¯ value 0 with probability 1 − αe = αe if the probe is lost on very close to each other, it is normally assumed that these the link. αe is called the success rate or packet delivery rate packets have the same network experience in the shared links ¯ of link e, and αe is called the loss rate of link e. The outcome just like a multicast probe. We will relax this assumption in variable Lk is also a Bernoulli random variable, which takes Section IV-B. value 1 if the probe successfully reaches node k. For this For a probe sent by source node s to the destination nodes example we have (Ls ≡ 1) in D, we deﬁne a set of link state variables Ze for all links e ∈ E on the routing tree T (s, D). Ze takes value in a set Z. Lk = Lf (k) · Zek = Ze . (2) The distribution of Ze is parameterized by θe , e.g., P(Ze = e∈P(s,k) z) = θe (z), ∀z ∈ Z. The transmission of a probe from s to nodes in D will Example 2: Link Utilization Inference [15]. The link state induce a set of outcome variables on the routing tree. For variable Ze is a Bernoulli random variable which takes value each node k ∈ V , we use Xk to denote the (random) outcome 1 with probability γe if the probe does not experience any of the probe at node k. Xk takes value in a set X . By causality queueing delay on link e, and takes value 0 with probability ∆ the outcome of the probe at node k (i.e., Xk ) is determined by ¯ ¯ 1 − γe = γe otherwise. γe can be viewed as the utilization of the outcome of the probe at node k’s parent f (k) (i.e., Xf (k) ) link e. The outcome variable Uk is also a Bernoulli random 4 variable, which takes value 1 if the packet reaches node k with Under the spatial independence assumption that the link no queueing delay. For this example we also have ((Us ≡ 1) states are independent from link to link, ρl (s, D) ∪ ρl (s, D2 ) can be obtained by Uk = Uf (k) · Zek = Ze . (3) e∈P(s,k) ρl (i) = − log P(Li = 1), i ∈ D; Example 3: Link Delay Inference [22]. The link state vari- P(Li = 1)P(Lj = 1) (6) ρl (i, j) = − log , i, j ∈ D. able Ze is a random variable denoting the random (queueing) P(Li Lj = 1) delay of link e. θe can be a certain moment of Ze , e.g., 2) Utilization-Based Additive Metric: Similarly for Exam- θe = var(Ze ); or the distribution of Ze is parameterized by ple 2 in Section III-B, if 0 < γe < 1, ∀e, then we can construct θe , e.g., θe (i) = P(Ze = i), i ∈ Z. The outcome variable Tk an additive metric du with link length is the cumulative (queueing) delay experienced by the probe from s to node k. For link delay inference we have (Ts ≡ 0) du (e) = − log γe , ∀e ∈ E. (7) Tk = Tf (k) + Zek = Ze . (4) ρu (s, D) ∪ ρu (s, D2 ) can be obtained by e∈P(s,k) ρu (i) = − log P(Ui = 1), i ∈ D; IV. C ONSTRUCT A DDITIVE M ETRICS P(Ui = 1)P(Uj = 1) (8) ρu (i, j) = − log , i, j ∈ D. Let T (s, D) = (V, E) be a routing tree with source node s P(Ui Uj = 1) and destination nodes D. We say d is an additive metric on 3) Delay-Based Additive Metric: For Example 3 in Section T (s, D) if III-B, if 0 < var(Ze ) < ∞, ∀e, then we can construct an (a) 0 < d(e) < ∞, ∀e ∈ E; additive metric dv with link length (b) d(i, j) = d(e), ∀i, j ∈ V. dv (e) = var(Ze ), ∀e ∈ E. (9) e∈P(i,j) ρv (s, D) ∪ ρv (s, D2 ) can be obtained by d(e) can be viewed as the length of link e and d(i, j) can be viewed as the distance between nodes i and j. Remember ρv (i) = var(Ti ), i ∈ D; U = s ∪ D is the set of terminal nodes on the tree. We use ρv (i, j) = cov(Ti , Tj ), i, j ∈ D. (10) d(U 2 ) = {d(i, j) : i, j ∈ U } to denote the distances between the terminal nodes. It is known that the topology and link As in (6), (8), (10), if we know the pairwise joint distribu- lengths of a tree are uniquely determined by the distances tions of the outcome variables at the terminal nodes, then we between the terminal nodes under an additive metric [6]. can construct an additive metric and derive ρ(s, D)∪ρ(s, D2 ). Suppose the source node s is ﬁxed. For any destination node In actual network inference problems we are not given such i ∈ D, let ρ(i) = d(s, i) denote the path length from s to i distributions. We can use measurements taken at the terminal (under additive metric d). nodes to estimate the distributions. For any pair of destination nodes i, j ∈ D, let ij denote Let s send a sequence of n probes to (a subset of) des- (t) their nearest common ancestor on T (s, D) (i.e., the ancestor tination nodes in D. For any probed node i, let Ti be the of both nodes i and j that is closest to i and j on the routing measured (one-way) delay of the t-th probe from s to i, with (t) tree). For example, in Fig. 1(b), the nearest common ancestor Ti = ∞ means that i does not receive the t-th probe. We (t) of destination nodes 4 and 5 is node 2, and the nearest common use Timin = mint Ti to approximate the propagation delay ancestor of destination nodes 4 and 6 is node 1. Let ρ(i, j) = from s to i. d(s, ij) denote the shared path length from s to i and j (i.e., The loss outcomes can be derived from the delay measure- the distance between s and the nearest common ancestor of i ments as follows: and j). (t) Let ρ(s, D) = {ρ(i) : i ∈ D} denote the path lengths from (t) 1, Ti < ∞, Li = (t) s to nodes in D, ρ(s, D2 ) = {ρ(i, j) : i, j ∈ D} denote the 0, Ti = ∞. shared path lengths from s to pairs of nodes in D. Note that As in [15], the utilization outcomes can be derived from the there is a one-to-one mapping between d(U 2 ) and ρ(s, D) ∪ delay measurements as follow: ρ(s, D2 ). We can recover the topology of the routing tree if we know either d(U 2 ) or ρ(s, D) ∪ ρ(s, D2 ). The key thing is to (t) (t) 1, Ti − Timin ≤ , construct an additive metric for which we can derive/estimate Ui = (t) 0, Ti − Timin > , d(U 2 ) or ρ(s, D) ∪ ρ(s, D2 ) from end-to-end measurements. where is a small value, e.g., 0.1 ms, to account for possible A. Additive Metrics Based on Multicast Probing measurement error. 1) Loss-Based Additive Metric: For Example 1 in Section We can construct explicit estimators for the path lengths and III-B, if 0 < αe < 1, ∀e, then we can construct an additive shared path lengths in (6), (8) as follows: metric dl with link length ˆ ¯ ˆ ¯ ¯ ¯ ρl (i) = − log Li , ρl (i, j) = − log Li Lj /Lij ; (11) dl (e) = − log αe , ∀e ∈ E. (5) ˆ ¯ ˆ ¯ ¯ ¯ ρu (i) = − log Ui , ρu (i, j) = − log Ui Uj /Uij ; (12) 5 where If 0 < αe < βe ≤ 1 for all links, then 0 < αe < 1, and we β e 1 n 1 n can construct an additive metric dl with link length ¯ (t) ¯ (t) (t) Li = Li , Lij = Li Lj ; αe n t=1 n t=1 dl (e) = − log , ∀e ∈ E. n n βe ¯ 1 (t) ¯ 1 (t) (t) Ui = Ui , Uij = Ui Uj . In real networks, we would expect αe < βe , because the fact n t=1 n t=1 that the ﬁrst packet successfully goes through a link indicates Similarly, we can construct explicit estimators for the path that the link is in good state and the second packet, which lengths and shared path lengths in (10) using sample variances closely follows the ﬁrst packet, can also go through the link. and sample covariances: This phenomenon was observed in real Internet measurements (e.g., [5], [29]). ˆ ˆ ˆ ˆ ρv (i) = var(Ti ), ρv (i, j) = cov(Ti , Tj ); (13) Let La and Lb be the loss outcome variable of packet a and i j where b at node i and j, respectively. Under the spatial independence n assumption, we have 1 (t) ¯ 2 ˆ var(Ti ) = Ti − Ti , P(La = 1) = αe , P(Lb = 1) = αe , n−1 t=1 i j n e∈P(s,i) e∈P(s,j) 1 (t) ¯ (t) ¯ cov(Ti , Tj ) = ˆ Ti − Ti Tj − Tj , P(La Lb = 1) = i j αe βe αe αe . n−1 t=1 e∈P(s,ij) e∈P(ij,i) e∈P(ij,j) n ¯ 1 (t) Ti = Ti . Hence ρl (s, D) ∪ ρl (s, D2 ) can be obtained by n t=1 P(La = 1)P(Lb = 1) i i (t) (t) ρl (i) = − log , i ∈ D; In the above equations we assume Ti , Tj < ∞ (i.e., there P(La Lb = 1) i i is no packet loss). For lost packets we will not count them in P(La = 1)P(Lb = 1) i j the computation. ρl (i, j) = − log , i, j ∈ D. (14) Notice that possible clock asynchronization between the P(La Lb = 1) i j source node and the destination nodes will not affect the a b Now consider link delay inference. If cov(Ze , Ze ) > 0 for estimators in (11), (12), (13). all links (which we would expect to hold in real networks be- A convex combination of several additive metrics is still cause the two back-to-back packets are very close, hence their an additive metric. In order to fuse information from multiple experienced delays in a shared link are positively correlated), measurements, we can construct a new additive metric using then we can construct an additive metric dv with link length a convex combination of dl , du , dv : dt = al dl + au du + av dv a b with al + au + av = 1. The (estimated) path lengths and dv (e) = cov(Ze , Ze ), ∀e ∈ E. shared path lengths under the new additive metric can be Let Tia and Tjb be the delay outcome variable of packet ˆ ˆ ˆ ˆ easily computed: ρt = al ρl + au ρu + av ρv . In practice we a and b at node i and j, respectively. We have Tia = can select the coefﬁcients based on the current network state a b b e∈P(s,i) Ze , Tj = e∈P(s,j) Ze . or to minimize the variance of the new estimator ρt .ˆ Under the spatial independence assumption, we have cov(Tia , Tjb ) = a b cov(Ze , Ze ), B. Additive Metrics Based on Unicast Packet Pair Probing e∈P(s,ij) The validity of (6), (8), (10) depends on the fact that the packets of the same multicast probe received by different cov(Tia , Tib ) = a b cov(Ze , Ze ). destination nodes have the same network experience (loss, e∈P(s,i) delay, etc.) in the shared links, which may not hold for a 2 Hence ρv (s, D) ∪ ρv (s, D ) can be obtained by unicast packet pair/string probe. Can we still construct additive metrics from unicast probing? The answer is yes, if the packets ρv (i) = cov(Tia , Tib ), i ∈ D; are positively correlated (not necessarily perfect correlated) in ρv (i, j) = cov(Tia , Tjb ), i, j ∈ D. (15) the shared links. Similarly as in (11), (12), (13), we can construct explicit Suppose the source node s sends two back-to-back packets estimators for the path lengths and shared path lengths in to destination nodes i and j, for which the ﬁrst packet (denoted (14) and (15) using empirical distributions measured by the by a) is sent to node i and the second packet (denoted by b) a b terminal nodes. is sent to node j. Let Ze and Ze be the link state variables experienced by packet a and packet b in link e, respectively. First consider link loss (or utilization) inference. Let αe = C. Additive Metric Based on Traceroute-like Probing x P(Ze = 1) for x = {a, b} be the marginal success rate of link Using traceroute-like probing, a source node can obtain the b a e. Let βe = P(Ze = 1|Ze = 1) be the conditional success rate unique labels (IP addresses) of the internal nodes (routers) in of link e, i.e., βe is the conditional probability of the second the path from it to any destination node. We can construct an packet b successfully goes through link e given that the ﬁrst additive metric dh by deﬁning the link length dh (e) to be the packet a successfully goes through link e. number of hops (physical links) contained in logical link e. 6 The path length ρh (i) is the number of hops contained in measurements from only two children of f . If f has more the path from s to i. The shared path length ρh (i, j) is the than two children, we could utilize measurements from all of number of hops contained in the shared portion of the paths them as follows: from s to i and j. The shared portion of two paths can be 1 determined by comparing the labels of the internal nodes in ˆ ρ(k, f ) = ˆ ρ(k, i). (17) |c(f )| the two paths. i∈c(f ) If some internal nodes do not respond to traceroute-like This modiﬁcation improves the accuracy of the RNJ algorithm probing (e.g., anonymous routers, layer-2 switches, MPLS in our simulation2 . switches), then the derived path lengths and shared path The computational complexity of the RNJ algorithm is lengths can be distorted. We use ρh (s, D) and ρh (s, D2 ) to ˆ ˆ O(N 2 log N ) for a routing tree with N destination nodes. Note denote the estimated path lengths and shared path lengths with that the RNJ algorithm only requires (estimated) shared path possible measurement errors. lengths, ρ(s, D2 ), to infer the tree topology (steps without (+)). ˆ ˆ If the (estimated) path lengths ρ(s, D) are also available, then V. T REE T OPOLOGY I NFERENCE BASED ON N EIGHBOR the RNJ algorithm can infer the link lengths as well (steps J OINING with (+)). If there is a one-to-one mapping between the link We ﬁrst present a topology inference algorithm using performance parameters (e.g., success rate, utilization, delay (estimated) path lengths and shared path lengths as the input. variance) and the link lengths, as in (5), (7), (9), then we can The algorithm is a grouping type algorithm as in [16] and use the link lengths returned by the RNJ algorithm to estimate [23]. It can be viewed as a rooted version of the widely used the link performance parameters. neighbor-joining algorithm for constructing phylogenetic trees from distances [18], [24]. The algorithm begins with a leaf set including all the destination nodes. In each step it selects A. Analysis of RNJ Algorithm a group of nodes that are likely to be neighbors (i.e., siblings, Let T be the true topology of the routing tree. Let d(e)’s nodes with the same parent on the tree), deletes them from the be the true link lengths and ρ(s, D2 ) be the true shared path leaf set, creates a new node as their parent and adds that node lengths under additive metric d on T . to the leaf set. The whole process is iterated until there is only Proposition 1: Let ∆ ≤ mine∈E d(e) (the minimum link one node left in the leaf set, which will be the child of the length on the routing tree) be the input parameter. A sufﬁcient root (source node). To avoid trivial cases, we assume |D| ≥ 2. condition for the RNJ algorithm to return the correct tree topology is: Algorithm 1: Rooted Neighbor-Joining (RNJ) Algorithm ∆ ρ |ˆ(i, j) − ρ(i, j)| < , ∀i, j ∈ D. (18) 4 2 ˆ ˆ Input: Source s, Destinations D, ρ(s, D), ρ(s, D ), ∆ > 0. Proof: We prove by induction on the cardinality of D. 1. V = {s} ∪ D, E = ∅. (1) If |D| = 2, then clearly the RNJ algorithm will return the 2.1 Find i∗ , j ∗ ∈ D with the largest ρ(i, j) (break the tie ˆ correct tree topology. arbitrarily). Create a node f as the parent of i∗ and j ∗ . (2) Assume the RNJ algorithm returns correct topology under D = D \ {i∗ , j ∗ }, condition (18) for |D| ≤ N . Now consider |D| = N + 1. V = V ∪ {f }, Claim 1. i∗ , j ∗ found in Step 2.1 which maximize ρ(i, j) are ˆ E = E ∪ {(f, i∗ ), (f, j ∗ )}. ˆ siblings. (+) d(f, i∗ ) = ρ(i∗ ) − ρ(i∗ , j ∗ ), ˆ ˆ ˆ If i∗ and j ∗ are not siblings, then ∃k ∈ D such that (+) d(f, j ∗ ) = ρ(j ∗ ) − ρ(i∗ , j ∗ ). ˆ ˆ either i∗ k or j ∗ k is descended from i∗ j ∗ . Without loss of 2.2 For every k ∈ D such that ρ(i∗ , j ∗ ) − ρ(i∗ , k) ≤ ∆ : ˆ ˆ 2 generality, suppose i∗ k is descended from i∗ j ∗ . This implies D = D \ k, ρ(i∗ , k) > ρ(i∗ , j ∗ ). Since link lengths ≥ ∆, we have E = E ∪ (f, k). ˆ ρ(i∗ , k) ≥ ρ(i∗ , j ∗ ) + ∆, then under condition (18), (+) d(f, k) = ρ(k) − ρ(i∗ , j ∗ ). ˆ ˆ 2.3 For each k ∈ D, compute: ∆ ∆ ρ(i∗ , k) > ρ(i∗ , k) − ˆ > ρ(i∗ , j ∗ ) + > ρ(i∗ , j ∗ ), ˆ 1 4 4 ρ(k, f ) = [ˆ(k, i∗ ) + ρ(k, j ∗ )]. ˆ ρ ˆ (16) 2 a contradiction to the maximality of ρ(i∗ , j ∗ ). ˆ D = D ∪ f. Claim 2. k will be selected in Step 2.2 if and only if it is a (+) ρ(f ) = ρ(i∗ , j ∗ ). ˆ ˆ sibling of i∗ and j ∗ . 3. If |D| = 1, for the k ∈ D: E = E ∪ (s, k). If k is a sibling of i∗ and j ∗ , then ρ(i∗ , j ∗ ) = ρ(i∗ , k). This, Otherwise, repeat Step 2. together with condition (18), implies ρ(i∗ , j ∗ ) − ρ(i∗ , k) < ∆ . ˆ ˆ 2 ˆ ˆ Output: Tree T = (V, E), and link length d(e) for all e ∈ E. Hence k will be selected in Step 2.2. If k is not a sibling of i∗ and j ∗ , and since i∗ and j ∗ are siblings, then i∗ j ∗ is descended from i∗ k. Since link lengths Note that in Equation (16) of Step 2.3, we compute the ˆ shared path length between nodes k and f , ρ(k, f ), using 2 We thank an anonymous reviewer for this suggestion. 7 ≥ ∆, we have ρ(i∗ , j ∗ ) ≥ ρ(i∗ , k) + ∆, then under condition Random General Trees, 12 Destination Nodes, Link Loss Rates [1%, 10%] (18), RNJ Algorithm BLTP Algorithm ∆ 3∆ ∆ ρ(i∗ , j ∗ ) > ρ(i∗ , j ∗ ) − ˆ ≥ ρ(i∗ , k) + > ρ(i∗ , k) + , ˆ 1 4 4 2 fraction of correctly inferred trees which implies k will not be selected in Step 2.2. 0.8 Claim 3. Condition (18) is maintained after Step 2.3. We have |ˆ(k, i∗ ) − ρ(k, i∗ )| < ∆ and |ˆ(k, j ∗ ) − ρ(k, j ∗ )| < ρ 4 ρ ∆ 0.6 4 . Since ρ(k, f ) = 1 (ˆ(k, i∗ ) + ρ(k, j ∗ )), ρ(k, f ) = ˆ 2 ρ ˆ 1 ∗ ∗ 2 (ρ(k, i ) + ρ(k, j )), by triangular inequality we have |ˆ(k, f ) − ρ(k, f )| < ∆ . ρ 4 0.4 From claims 1, 2, 3, after one iteration of Step 2, the RNJ algorithm will correctly ﬁnd out a pair of siblings and all their 0.2 other siblings (if any), and condition (18) is maintained for the new set of leaf nodes. Then |D| is decreased at least by 1. By induction assumption, the algorithm will return the correct 0 5 6 7 8 9 10 11 12 13 topology of the rest of the tree. This completes our proof of log(sample size) the proposition. Fig. 2. Comparison of RNJ and BLTP under large link loss rates. Therefore, if the estimated shared path lengths are close enough to the true values, the RNJ algorithm will return Random General Trees, 12 Destination Nodes, Link Loss Rates [0.1%, 1%] the correct tree topology. We can derive exponential error bounds for the shared path length estimators in (11), (12) using RNJ Algorithm BLTP Algorithm Chernoff bounds [21]. 1 fraction of correctly inferred trees Proposition 2: For any pair of nodes i, j ∈ D, a sample ˆ ˆ size of n (number of probes to estimate ρl or ρu ), and any 0.8 small > 0: P |ˆl (i, j) − ρl (i, j)| ≥ ρ ≤ e−cij ( )n (19) 0.6 −bij ( )n ρ P |ˆu (i, j) − ρu (i, j)| ≥ ≤ e (20) 0.4 where cij ( )’s and bij ( )’s are some constants. ˆ Let Tn be the inferred tree topology returned by the RNJ ˆ algorithm with a sample size n. Let Pn = P{Tn = T } 0.2 denote the probability of correct topology inference of the RNJ algorithm. 0 5 6 7 8 9 10 11 12 13 Proposition 3: Let ∆ ≤ mine∈E d(e) be the input parame- log(sample size) ter of the RNJ algorithm. If Fig. 3. Comparison of RNJ and BLTP under small link loss rates. ∆ P{|ˆ(i, j) − ρ(i, j)| ≥ ρ } ≤ e−cij (∆)n , ∀i, j ∈ D, 4 where n is the sample size and cij (∆) is some constant, then B. Comparison with Previous Grouping Algorithms for a routing tree with N destination nodes: The grouping algorithms in [16], [23] aggregate the mea- surement data from the destination nodes up the tree, which Pn ≥ 1 − N 2 e−c(∆)n , (21) is particularly designed for multicast probing. In contrast, the RNJ algorithm only requires (estimated) shared path lengths i.e., the probability of correct topology inference of the RNJ between pairs of the destination nodes, which is applicable to algorithm converges to one exponentially fast in the sample both multicast probing and unicast packet pair probing. size. Under mulitcast probing, for general (nonbinary) routing Proof: By Proposition 1 and union bound we have trees, the RNJ algorithm has a much lower computational ∆ complexity while it may also require a larger sample size to Pn ≥ P ρ |ˆ(i, j) − ρ(i, j)| < 4 achieve the same level of accuracy compared to the maximum- i,j∈D likelihood based grouping algorithm in [16]. Nevertheless, we ∆ = 1−P ρ |ˆ(i, j) − ρ(i, j)| ≥ have shown that the probability of correct topology inference 4 of the RNJ algorithm converges to one exponentially fast in i,j∈D the sample size. ≥ 1− e−cij (∆)n ≥ 1 − N 2 e−c(∆)n We compare the accuracy of the RNJ algorithm with the i,j∈D BLTP algorithm (the reference grouping algorithm in [16] where c(∆) = mini,j∈D cij (∆). which has best accuracy and complexity) via model simula- 8 tion. For each experiment, we ﬁrst randomly generate the tree experiments and we found that it only has decent accuracy for topology and select the link loss rates in a certain range. We a small number of destination nodes (less than six). Therefore, compare the inferred tree topology returned by RNJ and BLTP poor probing scalability of unicast packet pair probing will with the true tree topology. Each experiment is repeated 200 limit the number of destination nodes that a source node can times. For each inference algorithm, we compute the fraction infer when multicast probing is not supported. of correctly inferred trees among all 200 trials (which can be We address these issues in this section. We design pro- viewed as the probability of correct topology inference of the cedures to add a node to (add_node) and delete a node algorithm). from (delete_node) a routing tree. These procedures can The results are shown in Figs. 2-3. The x axis is in log handle node joining and leaving efﬁciently, and are particularly scale, i.e., it is log2 n for a sample size of n probes. Both useful for applications where node dynamics are prevalent. the RNJ algorithm and the BLTP algorithm are consistent: the Based on the add_node procedure, we propose a novel fraction of correctly inferred trees of both algorithms goes to sequential topology inference algorithm, which greatly reduces 1 exponentially fast as we increase the sample size. When the probing overhead under unicast packet pair probing. the link loss rates are within the range of [1%, 10%], the BLTP algorithm has a noticeable better accuracy than the RNJ A. Procedure add node algorithm; while when the link loss rates are within the range of [0.1%, 1%], the difference is small (this is consistent with add_node(T , k, j, ∆) is a recursive procedure that adds our analysis in [21] where we show that the simple estimator a new destination node j to the routing tree T = (V, E) via for shared path lengths (11) is as efﬁcient as the MLE for an existing node k on the tree, with the initial condition that networks with small loss rates). We conduct experiments for j is a sibling or descendant of node k. ∆ is the (estimated) trees with different sizes and ranges of link loss rates and we minimum link length. Let f (k) be the parent of k on the observe the same pattern of the results. (old) tree T . VI. DYNAMIC T REE T OPOLOGY I NFERENCE In practice, the RNJ algorithm (and other existing topology Procedure: add_node(T , k, j, ∆) inference algorithms) may have some limitations. First, the IF k is a leaf node on the tree T = (V, E): focus of previous studies is on a relatively stable set of nodes. (j will be a sibling of k on the new tree.) In real applications (e.g., P2P applications), the destination 1. Create a node p as the parent of k and j. nodes that a source node communicates with will often change V = V ∪ {p, j}, over time. Hence the routing tree topology will also change E = E \ (f (k), k) ∪ {(f (k), p), (p, k), (p, j)}. over time. When an existing destination node leaves, it is ELSE Suppose k has l children c1 , ..., cl . straightforward to derive the updated routing tree topology. 2. Select a destination node di descended from ci . When a new destination node joins, running the RNJ algorithm ˆ ˆ 3. Measure/estimate ρ(d1 , d2 ) and ρ(j, di ) for i = 1, ..., l. over the new set of destination nodes to infer the updated ˆ 4. Find di∗ with the largest ρ(j, di ). routing tree topology is not efﬁcient when the nodes join and Case (a): ρ(d1 , d2 ) − ρ(j, di∗ ) ≥ ∆ : ˆ ˆ 2 leave frequently. (j will be a sibling of k on the new tree.) The second limitation is the probing scalability problem 5. Create a node p as the parent of k and j. under unicast probing. The RNJ algorithm requires estimated V = V ∪ {p, j}, shared path lengths from the source node to all pairs of the E = E \ (f (k), k) ∪ {(f (k), p), (p, k), (p, j)}. destination nodes as the input. Suppose there are N destination Case (b): |ˆ(d1 , d2 ) − ρ(j, di∗ )| < ∆ : ρ ˆ 2 nodes. If multicast probing is available, then the source node (j will be a child of k on the new tree.) can use a 1 × N multicast probing to obtain the required 6. V = V ∪ j, E = E ∪ (k, j). measurements. The probing overhead is O(N ). On the other Case (c): ρ(j, di∗ ) − ρ(d1 , d2 ) ≥ ∆ : ˆ ˆ 2 hand, if multicast probing is not supported and N is large, then (j will be a sibling or descendant of ci∗ on the new tree.) it is difﬁcult to obtain ρ(s, D2 ) using a single 1 × N unicast ˆ 7. Execute add_node(T , ci∗ , j, ∆). packet string probing without violating the assumption that the string of packets have the same or even positively correlated network experiences in the shared links. By running add_node(T , s, j, ∆), we add a new destina- The source node could use back-to-back (unicast) packet tion node j to the routing tree T rooted at s. pair probings. This requires O(N 2 ) 1 × 2 probings. The In Step 3 of add_node(T , k, j, ∆), in order to estimate probing overhead is O(N 2 ). If these probings are conducted the shared path lengths ρ(d1 , d2 ) and ρ(j, di ) for i = 1, ..., l, ˆ ˆ in parallel, then this will quickly consume the outgoing s can use a 1 × (l + 1) (multicast) probing, by sending probes bandwidth of the source node; while if these probings are to destination nodes j, d1 , ..., dl ; alternatively, s can use l + 1 conducted in sequence, then it will take a long time to (unicast) packet pair probings, by sending probes to node pairs obtain the measurements, and it is likely that the network (d1 , d2 ), (j, d1 ), ..., (j, dl ). states (routing topology, link performance metrics) will change For an l-ary (balanced) tree with N destination nodes, during the measurement period which will violate the station- the depth of the tree is O(logl N ). In the worst case, the arity assumption. We tested the RNJ algorithm via Internet add_node procedure needs to be executed O(logl N ) times 9 s ρ(d1 , d2 ) − ρ(j, di∗ ) > ∆ , so Step 5 will be executed which ˆ ˆ 2 s s correctly adds j to the tree. f(k) Case (b): j is a child of k, as shown in Fig. 4(b). In this ρ case ρ(d1 , d2 ) = ρ(j, di∗ ). Under (22) this implies |ˆ(d1 , d2 )− p ρ(j, di∗ )| < ∆ , so Step 6 will be executed which correctly ˆ 2 k k adds j to the tree. k j j c i* Case (c): j is a sibling or descendant of a child of k, as j shown in Fig. 4(c). Suppose ci∗ is the child and di∗ is the d1 d2 d i* d1 d2 d i* selected destination node descended from ci∗ in Step 2. Then d1 d2 d i* ρ(j, di∗ )−ρ(j, di ) ≥ ∆ for i = i∗ and ρ(j, di∗ )−ρ(d1 , d2 ) ≥ (a) j is a sibling of k. (b) j is a child of k. (c) j is sibling or descendant of ci*. ˆ ˆ ∆. Under (22) this implies ρ(j, di∗ ) > ρ(j, di ) so di∗ will be selected in Step 4, and ρ(j, di∗ ) − ρ(d1 , d2 ) > ∆ hence ˆ ˆ 2 Fig. 4. Three cases of adding a new node j to the tree via a node k on the add_node(T , ci∗ , j, ∆) will be executed in Step 7. tree. Proposition 5: Let ∆ be less than or equal to the minimum link length in the new routing tree. If for all the nodes k visited by the recursive procedure add_node(T , s, j, ∆), we have in order to add a new destination node to the tree. Under uni- ∆ cast packet pair probing, if we apply the add_node procedure ρ P{|ˆ(d1 , d2 ) − ρ(d1 , d2 )| ≥ } ≤e−cd1 d2 (∆)n , to infer the topology of the new tree, we need O(l logl N ) 4 ∆ packet pair probings, and the computational complexity is P{|ˆ(j, di ) − ρ(j, di )| ≥ } ≤e−cjdi (∆)n , i = 1, ..., l, ρ 4 O(l logl N ). While if we apply the RNJ algorithm to infer the topology of the new tree, we need O(N 2 ) packet pair where n is the sample size and cd1 d2 (∆), cjdi (∆)’s are some probings, and the computational complexity is O(N 2 log N ). constants, then the probability of correct topology inference of add_node(T , s, j, ∆) for an l-ary tree with N destination nodes satisﬁes: B. Analysis of Procedure add node Pn ≥ 1 − (l + 1)(logl N )e−c(∆)n . (23) If the estimated shared path lengths in Step 3 are close enough to the true values, then add_node(T , s, j, ∆) will Proof: The proof is similar to the proof of Proposition 3. correctly add a new destination node to the tree. Proposition 4: Let ∆ be less than or equal to the minimum link length in the new routing tree (including existing des- C. Procedure delete node tination nodes and the new destination node j). A sufﬁcient Procedure delete_node(T , j) deletes a destination node condition for the recursive procedure add_node(T , s, j, ∆) j from routing tree T . It will ﬁrst remove node j and link to return the correct tree topology (after adding node j) is (f (j), j) from the tree. If f (j) has only one child left after that for all the nodes k visited by the recursive procedure: deleting j, it will then further remove node f (j) and connect ∆ the child of f (j) to the parent of f (j), so that the new ρ |ˆ(d1 , d2 ) − ρ(d1 , d2 )| < , routing tree maintains the property that each internal node 4 ∆ has at least two children. ρ |ˆ(j, di ) − ρ(j, di )| < , i = 1, 2, ..., l. (22) 4 Proof: We prove that if k is a sibling or ancestor of j Procedure: delelte_node(T , j) in the new routing tree and condition (22) is satisﬁed, then 1. V = V \ j, E = E \ (f (j), j)). procedure add_node(T , k, j, ∆) will either correctly add j 2. If f (j) has only one child c left: to the tree, or ﬁnd a child c of k that is a sibling or ancestor V = V \ f (j), of j and execute add_node(T , c, j, ∆) recursively. Hence E = E \ (f (f (j)), f (j)), (f (j), c)) ∪ (f (f (j)), c). add_node(T , s, j, ∆) ﬁnally returns the correct new routing tree topology. Now assume k is a sibling or ancestor of j and condition (22) is satisﬁed. If k is a leaf node (i.e., k is a destina- D. Sequential Topology Inference Algorithm tion node), then k and j must be siblings so Step 1 of For a source node s and a set of destination nodes D, we add_node(T , k, j, ∆) correctly adds j to the tree. Otherwise can apply the add_node procedure over the nodes in D in suppose k has l children c1 , ..., cl , and di is a destination node sequence to construct the routing tree topology incrementally, ˆ descended from ci selected in Step 2. In Step 3 ρ(d1 , d2 ) and as described in Algorithm 2. ˆ ρ(j, di ) for i = 1, ..., l are measured and estimated. There are We compare the RNJ algorithm and the sequential three cases to consider. topology inference algorithm in Table I. We assume all Case (a): j is a sibling of k in the new routing tree, as probings have the same sample size and time interval shown in Fig. 4(a). In this case for the di∗ found in Step 4 between two consecutive probes. Under multicast probing, we have ρ(d1 , d2 ) − ρ(j, di∗ ) ≥ ∆. Under (22) this implies the RNJ algorithm is more efﬁcient (for building the whole 10 TABLE I C OMPARISON OF RNJ A LGORITHM AND S EQUENTIAL T OPOLOGY I NFERENCE A LGORITHM N Destination Nodes, l-ary Tree with Depth O(logl N ) Multicast Probing Unicast Packet Pair Probing Computational Probing Trafﬁc Probing Time Probing Trafﬁc Probing Time Complexity Add RNJ O(N ) O(1) O(N 2 ) O(N 2 ) O(N 2 log N ) Node Sequential O(l logl N ) O(logl N ) O(l logl N ) O(l logl N ) O(l logl N ) Build RNJ O(N ) O(1) O(N 2 ) O(N 2 ) O(N 2 log N ) Tree Sequential O(N l logl N ) O(N logl N ) O(N l logl N ) O(N l logl N ) O(N l logl N ) tree); while under unicast packet pair probing, the sequential In order to utilize information collected from both traceroute topology inference algorithm is more efﬁcient, in terms of the measurements and network tomography measurements, we probing trafﬁc and probing time. In both cases the sequential propose the following hybrid scheme for Internet routing topology inference algorithm is more computationally efﬁcient topology inference. than the RNJ algorithm. 3. Traceroute+Tomography inference scheme (TRTomo): we use both traceroute measurements and network tomography measurements to construct additive metrics dh and dt , respec- Algorithm 2: Sequential Topology Inference Algorithm tively, and we construct a new additive metric dht = A·dh +dt with a large constant A which makes traceroute measurements Input: Source Node s, Destination Nodes D = {1, 2, ..., N }, dominate network tomography measurements. The reason for ∆ > 0. selecting a large A is because that traceroute measurements 1. V0 = {s}, E0 = ∅, T0 = (V0 , E0 ). have certain “consistent” property. An anonymous router 2. For j = 1 to N : will affect all the paths passing that router (i.e., the path Tj = add_node(Tj−1 , s, j, ∆). lengths of those paths are all reduced by one). Hence if ˆ Output: Tree T = TN . ˆ ˆ ρh (i, j) > ρh (i, k), then we know for sure that ij is descended from ik on the routing tree. The reverse, however, may not be true. Even if ij is descended from ik, we may have VII. I NTERNET ROUTING T REE T OPOLOGY I NFERENCE ˆ ˆ ρh (i, j) = ρh (i, k) due to anonymous routers, hence network tomography measurements are needed to further determine the A. Schemes for Internet Routing Tree Topology Inference topology. In this section we design schemes for Internet routing tree For a large number of destination nodes, we propose to infer topology inference. We consider the following schemes: the routing tree topology using a two-step procedure: ﬁrst use 1. Traceroute-based inference scheme (TR): we use tracer- ρ traceroute measurements (ˆh ) (or other heuristics, e.g., round oute measurements to construct additive metric dh and derive trip times, AS information) to build a skeleton of the tree; then the shared path lengths ρh (s, D2 ) as we described in Section ˆ ρ ˆ apply tomography measurements (ˆt or ρht ) on subtrees (with IV. relatively a small number of destination nodes) to determine 2. Tomography-based inference scheme (Tomo): we use the topology of the subtrees. We ﬁnd this hybrid approach unicast packet pair/string measurements to construct addi- signiﬁcantly reduces the probing scalability problem of pure tive metrics dl , du , dv and estimate the shared path lengths tomography-based approach. It also leads to better accuracy ρl (s, D2 ), ρu (s, D2 ), ρv (s, D2 ) as we described in Section IV. ˆ ˆ ˆ than pure traceroute-based approach or pure tomography-based We construct a new additive metric using a convex combina- approach via information fusion. tion of the additive metrics to fuse information from different We refer to the above schemes as TR, Tomo and TRTomo measurements: dt = al dl +au du +av dv with al +au +av = 1. for short hereafter. We evaluate their performance via Internet We have shown that if the estimated shared path lengths experiments. are close enough to the true values (e.g., condition (18) or (22) is satisﬁed), then the RNJ algorithm and the sequential topology inference algorithm will return the correct routing B. Evaluation Methodology tree topology. We choose an idle host in our local network as the source For traceroute measurements, the estimated shared path node, and two sets of PlanetLab [1] nodes as the destination lengths can be distorted due to the existence of anonymous nodes. We have implemented a sender utility program (running routers, layer-2 switches, and MPLS switches. For network at the source node) that can send unicast probing packet tomography measurements, the assumption of independent and pairs/strings, and a receiver utility program (running at the stationary link states can be violated, so a larger sample size destination nodes) to receive the probing packets and measure with longer measurement period may not return more accurate their one-way delays. We collect the measured one-way delays estimation of shared path lengths. Hence the condition for from the destination nodes using the sender utility program. correct topology inference (18) or (22) may not hold for both The ﬁrst destination node set, referred to as US nodes, type of measurements. consists of 30 hosts in the US (most of them are located in 11 US universities). The second set, referred to as International C. Experimental Results nodes, consists of 30 international nodes (10 in North America, We run experiments using the US nodes and International 10 in Europe, and 10 in East Asia). The reliability of the nodes, and refer to them as US experiments and International chosen nodes is important to the experiments, hence we choose experiments, respectively. We plot the correctness ratios (Fig. 5 nodes that have low CPU load and long running time. and 6) and node ratios (Fig. 7 and 8) of different schemes Each probing from the source node to a subset of the with varying levels of underlying routers being anonymized. destination nodes consists of a sequence of 1200 packet 1) Correctness Ratio: As shown in Fig. 5 and 6, both strings. Each probing packet is of size 80 bytes. The probing TR and TRTomo can correctly infer most of the internal interval between two consecutive strings is 10 milliseconds nodes in the ground-truth topology when the anonymization (contributing to a probing rate of 64 kbps per destination ratio is small. As the anonymization ratio increases, the node). correctness ratio of TR decreases to 0, because TR heavily We evaluate the performance of the three topology inference relies on routers’ support for traceroute probing (note that it schemes by artiﬁcially varying the anonymization ratio which is not exactly 0 because the source node must be attached is the fraction of the underlying routers not responding to to an access router and we always include that router in the traceroute probing. For each anonymiztion ratio, we test the inferred routing tree topology); while the correctness ratio of topology inference schemes for 20 rounds. TRTomo stabilizes around 0.5, because TRTomo can improve TR’s accuracy by utilizing both traceroute measurements and In each round, we ﬁrst obtain the sequence of the underlying tomography measurements. routers from the source node to every destination node using When the anonymization ratio is 1 (no routers response to traceroute. The destination nodes we choose have the property traceroute probing), TRTomo becomes the pure tomography- that the paths from the source node to them contain no or based scheme (Tomo), so we determine the correctness ratio of very few anonymous routers so we can obtain the ground-truth Tomo using the correctness ratio of TRTomo at anonymization topology in order to test the topology inference schemes. We ratio 1, which is around 0.5. count the total number of unique routers we have seen for all From our experiences, we would like to comment on why destination nodes, and compute how many of the routers in the pure Tomo scheme alone can only infer about 50% of total should be anonymized according to the anonymization the internal nodes but cannot infer all the internal nodes in ratio. We then iteratively choose a destination node randomly, the ground-truth topology. First, the routing topology and link anonymize the last m routers along its route3 , where m is states may be time-varying instead of stationary during the computed as the anonymization ratio times the route length. measurement period. Second, there are several limitations of We also keep track of the number of unique routers we have the PlanetLab testbed. We observed that the network con- anonymized in each iteration, and terminate the anonymization nections from the source node to the PlanetLab nodes are procedure once the total number of unique anonymized routers pretty good in most of the time, hence the shared path lengths reaches the number we compute a priori. derived from loss and delay metrics (the signals) are quite We use the following two metrics to evaluate the perfor- small and can be easily distorted by measurement noises. In mance of the topology inference schemes: addition, most PlanetLab nodes are often running multiple • Correctness Ratio: which is the fraction of the internal applications and processes. This introduces non-negligible nodes in the ground-truth topology that are correctly node delays to the delay measurements which will affect the inferred averaged over all rounds. An internal node in the delay and utilization measurements. (Such phenomenon has ground-truth topology is correctly inferred if and only if been observed and addressed in [26].) there is an internal node in the inferred topology with 2) Node Ratio: As shown in Fig. 7 and 8, the node ratio the same set of destination nodes descending from it. of TR is close to 1 when the anonymization ratio is small, but A higher correctness ratio means better accuracy of the it decreases to 0 with an increasing anonymization ratio. In inference scheme. contrast, TRTomo has a node ratio close to 1 in all experiments • Node Ratio: which is the ratio of the number of internal regardless of anonymization ratio, although it may introduce a nodes in the inferred topology to the number of internal few more or less internal nodes in the inferred tree topology. nodes in the ground-truth topology, averaged over all The node ratio of Tomo is determined by the node ratio of rounds. An accurate inference scheme has a node ratio TRTomo at anonymization ratio 1. close to one. If the node ratio is larger than one (or VIII. C ONCLUSION less than one), then the inference algorithm returns more In this paper, we developed fast and scalable algorithms for internal nodes (or less internal nodes) in the inferred network routing tree topology inference using a framework topology. based on additive metrics. In particular, we proposed a se- quential topology inference algorithm to address the probing 3 When choosing the PlanetLab nodes, we ﬁnd that a lot of them are behind scalability problem and handle dynamic node joining and routers that do not respond to traceroute probing. Most of these routers are leaving efﬁciently. We proved the correctness of our algorithms edge routers or access routers of the network in which the destination nodes and demonstrated their effectiveness via Internet experiments. are located in. This suggests that traceroute probings are likely to be discarded in enterprise networks to protect their internal hosts; hence, the routers in the The proposed algorithms provide powerful tools for large- last few hops to a destination node are more likely to be anonymous routers. scale network inference in communication networks. In the 12 1 1.2 TRTomo 0.9 Tomo 1.1 TR 1 0.8 0.9 0.7 Correctness Ratio 0.8 Node Ratio 0.6 0.7 0.5 0.6 0.4 0.5 0.4 0.3 0.3 0.2 0.2 TRTomo 0.1 0.1 Tomo TR 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anonymization Ratio Anonymization Ratio Fig. 5. US-experiment: correctness ratio of inferred topology. Fig. 7. US-experiment: node ratio of inferred topology. 1 1.2 TRTomo 0.9 Tomo 1.1 TR 1 0.8 0.9 0.7 Correctness Ratio 0.8 Node Ratio 0.6 0.7 0.5 0.6 0.4 0.5 0.3 0.4 0.3 0.2 0.2 TRTomo 0.1 Tomo 0.1 TR 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anonymization Ratio Anonymization Ratio Fig. 6. International-experiment: correctness ratio of inferred topology. Fig. 8. International-experiment: node ratio of inferred topology. future we will study how to utilize the inferred information [8] R. Castro, M. Coates, G. Liang, R. Nowak, B. Yu, “Network Tomography: and extend the framework for efﬁcient and effective network Recent Developments,” Statistical Science, vol. 19, no. 3, pp. 499-517, monitoring and application design. 2004. [9] R. Castro, M. Coates, R. Nowak, “Likelihood Based Hierarchical Cluster- ing,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2308- ACKNOWLEDGMENTS 2321, Aug. 2004. The authors would like to thank Dr. Nick Dufﬁeld and [10] J. T. Chang, “Full Reconstruction of Markov Models on Evolutionary Trees: Identiﬁability and Consistency,” Mathematical Biosciences, vol. the anonymous reviewers for their helpful comments and 137, pp. 51-73, 1996. suggestions. [11] M. Coates and R. Nowak, “Network Loss Inference using Unicast End- to-End Measurement,” Proc. ITC Conference on IP Trafﬁc, Modelling and Management, Monterey, CA, Sept. 2000. R EFERENCES [12] M. Coates, A. O. Hero III, R. Nowak, B. Yu, “Internet Tomography,” [1] PlanetLab, http://www.planet-lab.org. IEEE Signal Processing Magazine, vol. 19, no. 3, pp. 47-65, May 2002. [2] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, R. Morris, “Resilient [13] M. Coates, R. Castro, M. Gadhiok, R. King, Y. Tsang, R. Nowak, Overlay Networks,” Proc. SOSP, Oct. 2001. “Maximum Likelihood Network Topology Identiﬁcation from Edge- [3] D. Antonova, A. Krishnamurthy, Z. Ma, R. Sundaram, “Managing a Based Unicast Measurements,” Proc. ACM Sigmetrics, June 2002. Portfolio of Overlay Paths,” Proc. NOSSDAV, Kinsale, Ireland, June 2004. [14] N. G. Dufﬁeld, J. Horowitz, F. Lo Presti, D. Towsley, “Multicast Topol- [4] A. Bestavros, J. Byers, K. Harfoush, “Inference and Labeling of Metric- ogy Inference from End-to-End Measurements,” Advances in Performance Induced Network Topologies,” Proc. IEEE INFOCOM, June 2002. Analysis, vol. 3, pp. 207-226, 2000. [5] J.-C. Bolot, “End-to-End Packet Delay and Loss Behavior in the Internet,” [15] N. G. Dufﬁeld, J. Horowitz, F. Lo Presti, “Adaptive Mutlticast Topology Proc. SIGCOMM, Sept. 1993. Inference,” Proc. IEEE INFOCOM, Anchorage, Alaska, Apr. 2001. [6] P. Buneman, “The Recovery of Trees from Measures of Dissimilarity,” [16] N. G. Dufﬁeld, J. Horowitz, F. Lo Presti, D. Towsley, “Multicast Topol- Mathematics in the Archaeological and Historical Sciences, Edinburgh ogy Inference From Measured End-to-End Loss,” IEEE Transactions on University Press, pp. 387-395, 1971. Information Theory, vol. 48, no. 1, pp. 26-45, Jan. 2002. [7] R. Caceres, N. G. Dufﬁeld, J. Horowitz, D. Towsley, “Multicast-Based [17] N. G. Dufﬁled, F. Lo Presti, V. Paxson, D. Towsley, “Network Loss Inference of Network-Internal Loss Characteristics,” IEEE Transactions Tomography Using Striped Unicast Probes,” IEEE/ACM Transactions on on Information Theory, vol. 45, no. 7, pp. 2462-2480, Nov. 1999. Networking, vol. 14, no. 4, pp. 697-710, Aug. 2006. 13 [18] O. Gascuel and M. Steel, “Neighbor-Joining Revealed,” Molecular Biology and Evolution, vol. 23, no. 11, pp. 1997-2000, 2006. [19] J. Hartigan, Clustering Algorihtms, John Wiley & Sons, 1975. [20] J. Ni and S. Tatikonda, “A Markov Random Field Approach to Multicast- Based Network Inference Problems,” Proc. IEEE ISIT, Seattle, July 2006. [21] J. Ni and S. Tatikonda, “Explicit Link Parameter Estimators Based on End-to-End Measurements,” Proc. Allerton Conference on Communica- tion, Control, and Computing, Sept. 2007. [22] F. L. Presti, N. G. Dufﬁeld, J. Horowitz, D. Towsley, “Multicast- Based Inference of Network-Internal Delay Distributions,” IEEE/ACM Transactions on Networking, vol. 10, no. 6, pp. 761-775, Dec. 2002. [23] S. Ratnasamy and S. McCanne, “Inference of Multicast Routing Trees and Bottleneck Bandwidths using End-to-end Measurements,” Proc. IEEE INFOCOM, Mar. 1999. [24] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New Method for Reconstruction of Phylogenetic Trees,” Molecular Biology and Evo- lution, vol. 4, no. 4, pp. 406-425, 1987. [25] M. Shih, A. O. Hero III, “Hierarchical Inference of Unicast Network Topologies Based on End-to-End Measurements,” IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 1708-1718, May 2007. [26] J. Sommers and P. Barford, “An Active Measurement System for Shared Environments,” Proc. ACM Internet Measurement Conference, Oct. 2007. [27] D. Stutzbach and R. Rejaie, “Understanding Churn in Peer-to-Peer Networks,” Proc. ACM SIGCOMM Conference on Internet Measurement, 2006. [28] Y. Tsang, M. Coates, R. Nowak, “Network Delay Tomography,” IEEE Transactions on Signal Processing, vol. 51, no. 8, pp. 2125-36, Aug. 2003. [29] M. Yajnik, S. Moon, J. Kurose, D. Towsley, “Measurement and Mod- elling of the Temporal Dependence in Packet Loss,” Proc. IEEE INFO- COM, Mar. 1999. [30] B. Yao, R. Viswanathan, F. Chang, D. Waddington, “Topology Inference in the Presence of Anonymous Routers,” Proc. IEEE INFOCOM, Apr. 2003.