Sampling networks by the union of m shortest path trees

doi:10.1016/j.comnet.2009.10.023

Computer Networks

Volume 54, Issue 6, 29 April 2010, Pages 1042-1053

https://doi.org/10.1016/j.comnet.2009.10.023 Get rights and content

Abstract

Many network topology measurements capture or sample only a partial view of the actual network structure, which we call the underlying network. Sampling bias is a critical problem in the field of complex networks ranging from biological networks, social networks and artificial networks like the Internet. This bias phenomenon depends on both the sampling method of the measurements and the features of the underlying networks. In RIPE NCC and the PlanetLab measurement architectures, the Internet is mapped as $G_{\cup_{m} spt}$ , the union of shortest paths between each pair of a set $M$ of m testboxes, or equivalently, m shortest path trees. In this paper, we investigate this sampling method on a wide class of real-world complex networks as well as on the weighted Erdös–Rényi random graphs. This general framework examines the effect of the set of testboxes on $G_{\cup_{m} spt}$ . We establish the correlation between the subgraph $G_{M}$ of the underlying network, i.e. the set $M$ and the direct links between nodes of set $M$ , and the sampled network $G_{\cup_{m} spt}$ . Furthermore, we illustrate that in order to obtain an increasingly accurate view of a given network, a higher than linear detection/measuring effort (the relative size $m / N$ of set $M$ ) is needed, where N is the size of the underlying network. Finally, when the relative size $m / N$ of set $M$ is small, we characterize the kind of networks possessing small sampling bias, which provides insights on how to place the testboxes for good network topology measurement.

Introduction

Topologies of complex networks ranging from biological networks such as gene regulatory networks [1], metabolic networks [2], artificial networks like the Internet, the WWW to social networks, e.g. paper citations and collaboration networks [3], have been accumulated by active investigation in recent years. However, many surveyed networks to date are, in fact, subnets of the actual network, which we call the “underlying network”. For example, only a subset of the molecular entities in a cell have been sampled in protein interaction, gene regulation and metabolic networks. The topology of the Internet is inferred by aggregating paths, which reveals only a part of the whole Internet. Thus, these identified networks are sampled networks of the underlying networks according to different mapping or sampling methods.

In this work, we study the bias phenomenon of a sampling method that originated from the Internet. The topology of the Internet has typically been measured by the union of sampling traceroutes [4], which are approximately shortest paths. Mainly two sampling methods exist: (a) The topology is built from the union of traceroutes from a small set of sources to a larger set of destinations as in the CAIDA skitter project [5] . The sampled map can be modeled as the union of the spanning trees rooted at the sources. (b) The traceroute measurements are carried out between each pair of a set $M$ of m testboxes or testbeds. The sampled network, denoted as $G_{\cup_{m} spt},$ is the union of m shortest path trees SPTs, where each SPT is the union of shortest paths from the root $\in M$ to the other $m - 1$ testboxes $\in M$ . Equivalently, $G_{\cup_{m} spt}$ is the union of shortest paths between each node pair in the set $M$ of m testboxes. The RIPE NCC [6] and the PlanetLab [7] measurement architectures are examples of this type. The methodology in (a) has been argued and even proved to introduce such intrinsic biases that statistical properties of the sampled topology may sharply differ from that of the underlying graph (see e.g. [8], [9], [10]). While most related works on Internet exploration have been devoted to the sampling method (a), we investigate the other sampling method (b). Although the number of destinations may be limited to the number m of measurement boxes, the spurious effects in (a), where nodes and links closer to the sources are more likely to be sampled than those surrounding the destinations, can be reduced.

With statistical and graph theory methodologies, we investigate this sampling method (m shortest path trees) on a wide class of networks: the weighted Erdös–Rényi random graphs, which represent dense and homogeneous networks, and the unweighted real-world complex networks which are generally sparse and inhomogeneous graphs. Various underlying networks are investigated, because network sampling is a generic problem residing in various disciplines and the actual underlying network topology is mostly uncertain. Here, we focus on the sampling bias (the incompleteness of the network mapping) introduced purely by the sampling method. Technical limitations in the topology measurements may also introduce significant sampling bias. For example, the network measured by traceroute represents the interconnections of IP addresses. The bias in mapping the router level Internet topology depends highly on the alias resolution technique, which maps IP addresses to the corresponding routers [11]. Such specific technical concerns, which vary in the measuring of different complex networks, are not explored in this paper.

The sampled network $G_{\cup_{m} spt}$ depends on the set $M$ of m boxes as well as the underlying network. In this work, we focus on the effect of the testboxes, in particular, (1) the subgraph $G_{M}$ of the underlying network, consisting of the set $M$ and the direct links between nodes of set $M$ , and (2) the relative size $m / N$ of set $M$ , where N is the size of the underlying network. With a given set of testboxes, the sampling bias varies for different networks. The kind of networks with small sampling bias will also be briefly mentioned in this paper.

The main contributions of this study can be summarized as follows:

1.
Introduction of a general framework for network sampling on both weighted and unweighted complex networks.
2.
Establishment of the correlation between the interconnections of set $M$ , i.e. the subgraph $G_{M}$ , and the sampled network $G_{\cup_{m} spt}$ .
3.
Illustration of the detection/measuring effort (the relative size $m / N$ of set $M$ ) to obtain an increasingly accurate view of a given network.
4.
Characterization of networks bearing small sampling bias when $m / N$ is small and the corresponding proposal of testbox placement for good network topology measurements.

Section snippets

Modeling the sampling process of large networks

Assuming that traceroutes used in RIPE NCC and the PlanetLab are shortest paths, the sampled topology is then the union $G_{\cup_{m} spt}$ of shortest paths between each pair of a small group of $m ≪ N$ nodes, while the number of nodes in the underlying graph N is much larger. When $m = N$ , the graph $G_{\cup_{m} spt}$ becomes $G_{\cup spt}$ , the union of all shortest paths between any node pair. $G_{\cup spt}$ is thus the maximal measurable or observable part of a network by traceroute measurements [12]. It is also regarded as the “transport

Effect of $G_{M}$ on the sampled overlay $G_{\cup_{m} spt}$

Recall that a network is mapped as $G_{\cup_{m} spt}$ , the union of shortest paths between each pair of a set $M$ of m testboxes. The overlay network $G_{\cup spt}$ is the union of the shortest paths between all node pairs. We examine first the effect of $G_{M}$ on the sampled overlay $G_{\cup_{m} spt}$ when the underlying network or substrate is a weighted Erdös–Rényi random graph. As shown in Fig. 2, the subgraph $G_{M}$ of a underlying network $G (N, L)$ is the set $M$ and the direct links between nodes of set $M$ . The maximal observable part

Effect of the relative size $m / N$ of the testboxes on the sampling bias

In this section, we first explain why $E [L_{mspt}] / E [L_{o}]$ quantifies the sampling bias well. Then, we investigate the effect of the relative size $m / N$ of the testboxes on the sampling bias. Given the ratio $m / N$ , the sampling bias differs for various networks depending on their topologies. We will briefly discuss which type of network tends to possess small sampling bias.

Conclusions

In this paper, we study a network sampling method originated from the Internet, namely $G_{\cup_{m} spt}$ the union of m shortest path trees, or equivalently, the union of shortest paths between each pair of a set M of m testboxes. The analysis covers a wide class of networks, ranging from real-world unweighted complex networks to weighted Erdös–Rényi random graphs.

The interconnections of set $M$ , i.e. the subgraph $G_{M}$ , are correlated with the sampled network $G_{\cup_{m} spt}$ as follows: When the underlying network is

Acknowledgement

This research was supported by the Netherlands Organization for Scientific Research (NWO) under Project No. 643.000.503.

Huijuan Wang received her Master’s and Ph.D. degree in Electrical Engineering at the Delft University of Technology, the Netherlands, in the year 2005 and 2009. She is currently an assistant professor in the Network Architecture and Services (NAS) Group at Delft University of Technology. Her work mainly focuses on performance analysis of large complex networks, robust network design and bio-inspired networking.

References (34)

A. Beygelzimer et al.
Improving network robustness
Physica A
(2005)
N.M. Luscombe
Genomic analysis of regulatory network dynamics reveals large topological changes
Nature
(2004)
H. Jeong
The large-scale organization of metabolic networks
Nature
(2000)
A.-L. Barabasi
Linked The New Science of Networks
(2002)
W. Richard Stevens
TCP/IP illustrated
...
Ripe test traffic measurements....
...
A. Lakhina, J. Byers, M. Crovella, P. Xie, Sampling biases in IP topology measurements, in: Proc. of IEEE INFOCOM, San...
D. Achlioptas, A. Clauset, D. Kempe, C. Moore, On the bias of traceroute sampling: or, power-law degree distributions...

A. Clauset et al.

Accuracy and scaling phenomena in internet mapping

Phys. Rev. Lett.

(2005)

R. Sherwood, A. Bender, N. Spring, DisCarte: a disjunctive internet cartographer, in: ACM SIGCOMM’08, Washington, USA,...

P. Van Mieghem et al.

Properties of the observable part of a network

IEEE ACM T. Network.

(2009)

H. Wang et al.

Betweenness centrality in weighted networks

Phys. Rev. E

(2008)

A. Ganesh, L. Massoulie, D. Towsley, The effect of network topology on the spread of epidemics, in: Proc. IEEE Infocom,...

L. Zhao et al.

Attack vulnerability of scale-free networks due to cascading breakdown

Phys. Rev. E

(2004)

B. Bollobás

Random Graphs

(2001)

Cited by (2)

A Survey of Sampling Method for Social Media Embeddedness Relationship
2022, ACM Computing Surveys
Performance analysis of complex networks and systems
2010, Performance Analysis of Complex Networks and Systems

Piet Van Mieghem is professor at the Delft University of Technology with a chair in telecommunication networks and chairman of the section Network Architectures and Services (NAS). His main research interests lie in new Internet-like architectures for future, broadband and QoS-aware networks and in the modeling and performance analysis of network behavior and complex infrastructures. He received a Master’s and Ph.D. in Electrical Engineering from the K.U.Leuven (Belgium) in 1987 and 1991, respectively. Before joining Delft, he worked at the Interuniversity Micro Electronic Center (IMEC) from 1987 to 1991. During 1993–1998, he was a member of the Alcatel Corporate Research Center in Antwerp where he was engaged in performance analysis of ATM systems and in network architectural concepts of both ATM networks (PNNI) and the Internet. He was a visiting scientist at MIT (department of Electrical Engineering, 1992–1993) and, in 2005, he was visiting professor at ULCA (department of Electrical Engineering). Currently, he serves on the editorial board of the IEEE/ACM Transactions on Networking. He was a visiting scientist at MIT (department of Electrical Engineering, 1992–1993) and a visiting professor at UCLA (department of Electrical Engineering, 2005) and at Cornell University (Center of Applied Mathematics, 2009).

View full text

Sampling networks by the union of m shortest path trees

Abstract

Introduction

Section snippets

Modeling the sampling process of large networks

Effect of GM on the sampled overlay G∪mspt

Effect of the relative size m/N of the testboxes on the sampling bias

Conclusions

Acknowledgement

Physica A

Genomic analysis of regulatory network dynamics reveals large topological changes

Nature

The large-scale organization of metabolic networks

Nature

Linked The New Science of Networks

TCP/IP illustrated

Accuracy and scaling phenomena in internet mapping

Phys. Rev. Lett.

Properties of the observable part of a network

IEEE ACM T. Network.

Betweenness centrality in weighted networks

Phys. Rev. E

Attack vulnerability of scale-free networks due to cascading breakdown

Phys. Rev. E

Random Graphs

Effect of $G_{M}$ on the sampled overlay $G_{\cup_{m} spt}$

Effect of the relative size $m / N$ of the testboxes on the sampling bias