Sampling networks by the union of m shortest path trees
Introduction
Topologies of complex networks ranging from biological networks such as gene regulatory networks [1], metabolic networks [2], artificial networks like the Internet, the WWW to social networks, e.g. paper citations and collaboration networks [3], have been accumulated by active investigation in recent years. However, many surveyed networks to date are, in fact, subnets of the actual network, which we call the “underlying network”. For example, only a subset of the molecular entities in a cell have been sampled in protein interaction, gene regulation and metabolic networks. The topology of the Internet is inferred by aggregating paths, which reveals only a part of the whole Internet. Thus, these identified networks are sampled networks of the underlying networks according to different mapping or sampling methods.
In this work, we study the bias phenomenon of a sampling method that originated from the Internet. The topology of the Internet has typically been measured by the union of sampling traceroutes [4], which are approximately shortest paths. Mainly two sampling methods exist: (a) The topology is built from the union of traceroutes from a small set of sources to a larger set of destinations as in the CAIDA skitter project [5] . The sampled map can be modeled as the union of the spanning trees rooted at the sources. (b) The traceroute measurements are carried out between each pair of a set of m testboxes or testbeds. The sampled network, denoted as is the union of m shortest path trees SPTs, where each SPT is the union of shortest paths from the root to the other testboxes . Equivalently, is the union of shortest paths between each node pair in the set of m testboxes. The RIPE NCC [6] and the PlanetLab [7] measurement architectures are examples of this type. The methodology in (a) has been argued and even proved to introduce such intrinsic biases that statistical properties of the sampled topology may sharply differ from that of the underlying graph (see e.g. [8], [9], [10]). While most related works on Internet exploration have been devoted to the sampling method (a), we investigate the other sampling method (b). Although the number of destinations may be limited to the number m of measurement boxes, the spurious effects in (a), where nodes and links closer to the sources are more likely to be sampled than those surrounding the destinations, can be reduced.
With statistical and graph theory methodologies, we investigate this sampling method (m shortest path trees) on a wide class of networks: the weighted Erdös–Rényi random graphs, which represent dense and homogeneous networks, and the unweighted real-world complex networks which are generally sparse and inhomogeneous graphs. Various underlying networks are investigated, because network sampling is a generic problem residing in various disciplines and the actual underlying network topology is mostly uncertain. Here, we focus on the sampling bias (the incompleteness of the network mapping) introduced purely by the sampling method. Technical limitations in the topology measurements may also introduce significant sampling bias. For example, the network measured by traceroute represents the interconnections of IP addresses. The bias in mapping the router level Internet topology depends highly on the alias resolution technique, which maps IP addresses to the corresponding routers [11]. Such specific technical concerns, which vary in the measuring of different complex networks, are not explored in this paper.
The sampled network depends on the set of m boxes as well as the underlying network. In this work, we focus on the effect of the testboxes, in particular, (1) the subgraph of the underlying network, consisting of the set and the direct links between nodes of set , and (2) the relative size of set , where N is the size of the underlying network. With a given set of testboxes, the sampling bias varies for different networks. The kind of networks with small sampling bias will also be briefly mentioned in this paper.
The main contributions of this study can be summarized as follows:
- 1.
Introduction of a general framework for network sampling on both weighted and unweighted complex networks.
- 2.
Establishment of the correlation between the interconnections of set , i.e. the subgraph , and the sampled network .
- 3.
Illustration of the detection/measuring effort (the relative size of set ) to obtain an increasingly accurate view of a given network.
- 4.
Characterization of networks bearing small sampling bias when is small and the corresponding proposal of testbox placement for good network topology measurements.
Section snippets
Modeling the sampling process of large networks
Assuming that traceroutes used in RIPE NCC and the PlanetLab are shortest paths, the sampled topology is then the union of shortest paths between each pair of a small group of nodes, while the number of nodes in the underlying graph N is much larger. When , the graph becomes , the union of all shortest paths between any node pair. is thus the maximal measurable or observable part of a network by traceroute measurements [12]. It is also regarded as the “transport
Effect of on the sampled overlay
Recall that a network is mapped as , the union of shortest paths between each pair of a set of m testboxes. The overlay network is the union of the shortest paths between all node pairs. We examine first the effect of on the sampled overlay when the underlying network or substrate is a weighted Erdös–Rényi random graph. As shown in Fig. 2, the subgraph of a underlying network is the set and the direct links between nodes of set . The maximal observable part
Effect of the relative size of the testboxes on the sampling bias
In this section, we first explain why quantifies the sampling bias well. Then, we investigate the effect of the relative size of the testboxes on the sampling bias. Given the ratio , the sampling bias differs for various networks depending on their topologies. We will briefly discuss which type of network tends to possess small sampling bias.
Conclusions
In this paper, we study a network sampling method originated from the Internet, namely the union of m shortest path trees, or equivalently, the union of shortest paths between each pair of a set M of m testboxes. The analysis covers a wide class of networks, ranging from real-world unweighted complex networks to weighted Erdös–Rényi random graphs.
The interconnections of set , i.e. the subgraph , are correlated with the sampled network as follows: When the underlying network is
Acknowledgement
This research was supported by the Netherlands Organization for Scientific Research (NWO) under Project No. 643.000.503.
Huijuan Wang received her Master’s and Ph.D. degree in Electrical Engineering at the Delft University of Technology, the Netherlands, in the year 2005 and 2009. She is currently an assistant professor in the Network Architecture and Services (NAS) Group at Delft University of Technology. Her work mainly focuses on performance analysis of large complex networks, robust network design and bio-inspired networking.
References (34)
- et al.
Improving network robustness
Physica A
(2005) Genomic analysis of regulatory network dynamics reveals large topological changes
Nature
(2004)The large-scale organization of metabolic networks
Nature
(2000)Linked The New Science of Networks
(2002)TCP/IP illustrated
- ...
- Ripe test traffic measurements....
- ...
- A. Lakhina, J. Byers, M. Crovella, P. Xie, Sampling biases in IP topology measurements, in: Proc. of IEEE INFOCOM, San...
- D. Achlioptas, A. Clauset, D. Kempe, C. Moore, On the bias of traceroute sampling: or, power-law degree distributions...
Accuracy and scaling phenomena in internet mapping
Phys. Rev. Lett.
Properties of the observable part of a network
IEEE ACM T. Network.
Betweenness centrality in weighted networks
Phys. Rev. E
Attack vulnerability of scale-free networks due to cascading breakdown
Phys. Rev. E
Random Graphs
Cited by (2)
A Survey of Sampling Method for Social Media Embeddedness Relationship
2022, ACM Computing SurveysPerformance analysis of complex networks and systems
2010, Performance Analysis of Complex Networks and Systems
Huijuan Wang received her Master’s and Ph.D. degree in Electrical Engineering at the Delft University of Technology, the Netherlands, in the year 2005 and 2009. She is currently an assistant professor in the Network Architecture and Services (NAS) Group at Delft University of Technology. Her work mainly focuses on performance analysis of large complex networks, robust network design and bio-inspired networking.
Piet Van Mieghem is professor at the Delft University of Technology with a chair in telecommunication networks and chairman of the section Network Architectures and Services (NAS). His main research interests lie in new Internet-like architectures for future, broadband and QoS-aware networks and in the modeling and performance analysis of network behavior and complex infrastructures. He received a Master’s and Ph.D. in Electrical Engineering from the K.U.Leuven (Belgium) in 1987 and 1991, respectively. Before joining Delft, he worked at the Interuniversity Micro Electronic Center (IMEC) from 1987 to 1991. During 1993–1998, he was a member of the Alcatel Corporate Research Center in Antwerp where he was engaged in performance analysis of ATM systems and in network architectural concepts of both ATM networks (PNNI) and the Internet. He was a visiting scientist at MIT (department of Electrical Engineering, 1992–1993) and, in 2005, he was visiting professor at ULCA (department of Electrical Engineering). Currently, he serves on the editorial board of the IEEE/ACM Transactions on Networking. He was a visiting scientist at MIT (department of Electrical Engineering, 1992–1993) and a visiting professor at UCLA (department of Electrical Engineering, 2005) and at Cornell University (Center of Applied Mathematics, 2009).