Elsevier

Computer Networks

Volume 56, Issue 3, 23 February 2012, Pages 1092-1102
Computer Networks

Measuring the validity of peer-to-peer data for information retrieval applications

https://doi.org/10.1016/j.comnet.2011.10.026Get rights and content

Abstract

Peer-to-peer (p2p) networks are being increasingly adopted as an invaluable resource for various information retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information.

This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. We identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music files shared in the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the popular content using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the “long tail”, hence a much more exhaustive crawl is needed. Furthermore, we show that content and search queries are highly localized, indicating that location-crossing conclusions require a wide spread spatial crawl. Finally, we present techniques for overcoming noise originating from user generated content and for filtering non-informative data, while minimizing information loss.

Introduction

Peer-to-peer (p2p) networks provide a fruitful ground [1] for abundance of information, including files shared by users, search queries, and spatial and temporal changes that take place in the network. This data is often adopted as an invaluable resource for various information retrieval (IR) tasks, including user and content similarity [2], [3], [4], recommendation [5], [6], ranking [7], and trend prediction [8], [9], [10], [11].

Traditionally, these tasks use datasets extracted from server-based services, such as NetFlix, Last.FM, Yahoo! Music and other similar web 2.0 services. Web based services have the potential to provide a complete view of their data, either by commercial agreements or by crawling using a centralized interface. P2P networks have a great potential as a practically unbounded source of data for IR tasks. This wealth of information regarding user preferences is particularly useful in recommendation techniques based on collaborative filtering, which were shown to out-perform content based approaches, given that the dataset used is sufficiently comprehensive [12].

Another advantage of p2p datasets is the availability of information, mitigating the need for agreements with website operators and various restrictions they pose on the amount of data collected or/and usage. Due to their decentralized nature and open protocols, p2p networks are a source for independent large scale data collection.

Despite all their advantages, p2p networks are quite complex, making the collection of a comprehensive dataset far from being trivial, and in some cases practically unfeasible. First, p2p networks have high user churn, as users constantly connect and disconnect from the network, being unavailable for changing periods. Second, users in p2p networks often do not expose their shared data in order to maintain high privacy and security measures, therefore disabling the ability to collect information about their shared folders. Finally, users often delete content after using it, leaving no trace of its usage.

A different complexity involves the usage of meta-data, which was shown to be useful for finding similarity between performing artists [5]. The content in file sharing networks is mostly ripped by individual users for consumption by other users. User based interactions are a desirable property in IR datasets, however when it comes to meta-data, the main source for ambiguities and noise. Be it a movie, a song, or any other file type, typically there would be several similar duplications available on the network. The files may be digitally identical, thus having the same hash signature, yet bearing different file names, and meta-data tags. Duplication in meta-data tags are typically the result of spelling mistakes, missing data, and different variations on the correct values. A common hash signature can facilitate similar files grouping, nonetheless it does not solve the problem of copies that are not digitally identical. For example, in the Gnutella [13] network, which facilitates string-based search queries that are matched against meta-data, only 7–10% of the queries are successful in returning useful content [14].

Given the above considerations, it is clear why datasets based on p2p networks are gaining popularity in a variety of IR tasks. However, the extent of data collection effort that is required in a large-scale network so that the resulting dataset would be sufficiently accurate and representative, is still unknown. The objective of this work is to bridge this gap by analyzing the efficiency and extent of crawling required in a p2p network for obtaining accurate information for various IR tasks.

This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. In order to understand how well the crawl captures the underlying network, we measure the utility of an exhaustive crawl relative to a partial crawl from an IR point of view. When discussing shared files, a partial crawl means that not all peers are reached, and thus not all user-to-file relations are recorded. For example, a previous study [15] reported that in Gnutella, over 30% of crawled peers are non-responsive, presumably due to firewall protections. In the context of search queries, it is practically impossible to collect all queries in a fully distributed large scale p2p network [16]. A queries based dataset is therefore destined to capture only a partial view of the interactions.

Similar to previous studies [15], [17], we find that some of the graphs modeling p2p network data exhibit a power-law [18] distribution. This distribution indicates that collecting the majority of popular files and extracting accurate information for the main-streams, is relatively easy. By collecting the high degree nodes, which are easily reached, one may extract an abundance of information representing the core of user-to-content relations. On the other hand, reaching more exotic niches or following small popularity trends of digital content, mandates a more thorough crawl with significantly higher collection effort, as the collection process must visit the long “tail” of the distribution. Furthermore, we observe the existence of geographic locality of both files and queries, indicating that geographically aware applications like trend prediction mandate sampling in different geographic locations [10]. In additional, cultural similarities between far countries is also observed to have similar user behavior. These findings indicate a direct impact on the scalability of measurement efforts that seek to use such data in an accurate manner.

At the time of data collection, Gnutella was the most popular file sharing network [19], consisting mostly of the LimeWire client, with roughly 80–85% market share [20]. However, on October 26, 2010, US federal court ordered LimeWire to prevent “the searching, downloading, uploading, file trading and/or file distribution functionality, and/or all functionality” [21]. As a result, new LimeWire versions have been disabled, severely hurting the functionality of the network. In this research we seek to capture and interpret the behavioral patterns of users, namely the way content is shared and search queries are issued. Since content distribution properties appear similar across a range of p2p networks [22], [23], we believe that the conclusions we make can be generalized for creating accurate and optimized p2p IR applications.

Section snippets

Related work

This paper relates to two research areas – measurement of p2p networks characteristics and the emerging fields of large-scale data mining and IR. The characteristics of various p2p networks have been extensively studied, focusing on topology, content popularity, bandwidth and queries. Most studies [13], [17], [23], [24], [25] discovered a power-law distribution in the content shared by peers in various p2p networks. Similar studies of the distribution of search queries and file replication [26]

Measurement infrastructure

This section details the architecture of the measurement system developed to crawl the Gnutella [13] network and collect queries in a distributed manner. Although the exact details are adapted to comply to the Gnutella architecture and protocols, similar techniques can be applied to other p2p networks such as BitTorrent [23].

Dataset

The shared files dataset was collected using a 24 h crawl of over 1.2 million users sharing 373 million files on November 25th, 2007. Most of the query analysis presented in this paper is based on a dataset collected during the first week of February 2007. It is comprised of 4.5 million queries from over 3 million users.

We identified files types according to their file suffix, and found that music related content (mp3, wma, flac, m4p, m4a) account for over 75% of the files, i.e., over 281

Content distribution

We begin our analysis of the dataset with understanding how content is shared by p2p users, and the way this affects the ability to create an accurate snapshot. Fig. 3a depicts distribution of songs per user, exhibiting a power-law [18] distribution, with a very strong cut-off around the middle of the plot. This distribution closely resembles the one previously reported by Zhao et al. [29]. The figure shows that the vast majority of users share less than 300 songs, whereas only several

Collaborative filtering

Recommendation systems [27] often require an estimation of the distance between different items [32] or between users [4]. File similarity is typically achieved using content-based similarity [33], i.e., comparing some aspects of the content of files. This method is often computationally expensive and requires getting a hold of the actual files. Obtaining accurate user similarity is also quite challenging as it requires significant amount of information about user preferences.

However, both can

Query collection

Collection of queries is often a much more complicated task than crawling the shared folders [16]. Hence, similar to the previous analysis, we seek to quantify the utility of collecting queries from an increasing number of users.

We first study the distribution properties of the collected queries. Fig. 8a shows the distribution of query appearances, depicting a distinct power-law distribution, similar to previous work [40]. This indicates that only a few popular search terms repeat at a high

Discussion and conclusion

In face of the increasing usage of p2p networks in IR tasks, this paper provides a comprehensive analysis of different aspects that must be considered. Four main challenges are described: (a) the inability to crawl all users and collect information regarding all files, (b) the complexities in intercepting search queries, (c) the inherent noise of user generated content, and (d) the extreme sparseness of the dataset.

Content in p2p networks typically follows a power-law distribution, hence

Noam Koenigstein holds a Masters degree in Electrical Engineering from Tel Aviv University, and Bachelor of Science from the Technion – Israel Institute of Technology. He is currently pursuing his Ph.D. study at the School of Electrical Engineering, Tel Aviv University. His research interests include Recommender Systems, Machine Learning, Data Mining, Information Retrieval and Peer-to-Peer Networks.

References (40)

  • S. Bhattacharjee et al.

    Using P2P sharing activity to improve business decision making: proof of concept for estimating product life-cycle

    Electronic Commerce Research and Applications

    (2005)
  • S. Bhattacharjee et al.

    Whatever happened to payola? an empirical analysis of online music sharing

    Decision Support Systems

    (2006)
  • Y. Shavitt et al.

    Mining musical content from large-scale peer-to-peer networks

  • D.P.W. Ellis, B. Whitman, The quest for ground truth in musical artist similarity, in: The International Society for...
  • A. Berenzweig et al.

    A large-scale evaluation of acoustic and subjective music similarity measures

    Computer Music Journal

    (2003)
  • Y. Shavitt, E. Weinsberg, U. Weinsberg, Estimating peer similarity using distance of shared files, in: International...
  • Y. Shavitt, U. Weinsberg, Song clustering using peer-to-peer co-occurrences, in: International Workshop on Peer-To-Peer...
  • I. Mierswa et al.

    Collaborative use of features in a distributed system for the organization of music collections

  • N. Koenigstein, Y. Shavitt, Song ranking based on piracy in peer-to-peer networks, in: The International Society for...
  • N. Koenigstein, Y. Shavitt, T. Tankel, Spotting out emerging artists using geo-aware analysis of p2p query strings, in:...
  • N. Koenigstein, Y. Shavitt, Predicting billboard success using data-mining in p2p networks, in: International Workshop...
  • L. Barrington, R. Oda, G. Lanckriet, Smarter than genius? Human evaluation of music recommender systems, in: The...
  • M. Ripeanu, Peer-to-peer architecture case study: Gnutella network,...
  • M.A. Zaharia, A. Chandel, S. Saroiu, S. Keshav, Finding content in file-sharing networks when you can’t even spell, in:...
  • D. Stutzbach et al.

    Characterizing the two-tier Gnutella topology

  • A. Klemm et al.

    Characterizing the query behavior in peer-to-peer file sharing systems

  • A.S. Gish, Y. Shavitt, T. Tankel, Geographical statistics and characteristics of p2p query strings, in: International...
  • A.-L. Barabási et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • P. Resnikoff, Digital media desktop report: fourth quarter of 2007, Digital Music Research...
  • A. Rasti, D. Stutzbach, R. Rejaie, On the long-term evolution of the two-tier Gnutella overlay, in: Global Internet,...
  • Cited by (0)

    Noam Koenigstein holds a Masters degree in Electrical Engineering from Tel Aviv University, and Bachelor of Science from the Technion – Israel Institute of Technology. He is currently pursuing his Ph.D. study at the School of Electrical Engineering, Tel Aviv University. His research interests include Recommender Systems, Machine Learning, Data Mining, Information Retrieval and Peer-to-Peer Networks.

    Yuval Shavitt is a faculty member in the School of Electrical Engineering at Tel-Aviv University, Israel. His research interests include Internet measurements, mapping, and characterization, and data mining peer-to-peer networks. Shavitt has a D.Sc. in electrical engineering from the Technion, Haifa, Israel.

    Ela Weinsberg is an MS student in the department of industrial engineering at Tel-Aviv University, Israel. Her research interests include data mining of peer-to-peer networks. Ela holds a BS in computer science and mathematics from Bar-Ilan University.

    Udi Weinsberg is a Ph.D. candidate in the school of electrical engineering at Tel-Aviv University, Israel. His research interests include Internet measurement, complex networks analysis, and large-scale data mining. Udi holds an MS in electrical engineering from Tel-Aviv University, Israel.

    View full text