Network-aware summarisation for resource discovery in P2P-content networks

https://doi.org/10.1016/j.future.2011.03.004Get rights and content

Abstract

Many application scenarios have a read-dominated behaviour of information provision, which means that there are few updates, and that users execute frequent queries for information discovery. Discovery of content in such systems could benefit from the use of summary techniques in order to facilitate the discovery process and minimise the size of the data exchanged. This is particularly applicable within Grid computing environments where nodes (representing a group of computational and storage resources), which enable jobs to reach them may need to exchange metadata with each other about their resources. We introduce network-aware summarisation algorithms for the resource discovery in P2P-content networks, which are based on Cobweb clustering. We identify how summarisation can improve the discovery process while also improving the accuracy of the discovered resource(s). Metrics based on precision–recall are used to compare the accuracy for specific types of queries generated over the summarised content.

Highlights

► Content summarisation for resource discovery in large-scale Computational Grid. ► Using a peer-to-peer-based communication protocol to provide a very large scalability. ► Network awareness reduces the job processing time through the discovery of close nodes. ► Reduced lookup time by applying a content summarisation technique. ► Precision–recall metric to evaluate the quality of the discovery.

Introduction

Large-scale information systems have gained in importance over recent years, often treating with many concurrent users and managing an increasing amount of information. Different systems address the scalability challenges that arise within such systems. First, using a Peer-to-Peer (P2P) systems provides scalability in terms of the number of participants accessing such systems. Second, summarisation techniques help reduce the amount of information exchanged between system nodes. Thus, a combination of these two techniques is a promising solution for Grid-based Information Systems [1], [2] and Distributed Market Information Systems [3].

Many scientific applications involve the execution of not just a single task but a coordinated execution of multiple tasks—with data dependences existing between them. Such tasks may either be independent components or complete applications that require a combination of jobs. A task graph is generated in which vertices represent tasks, and arcs represent data dependences between tasks. An emphasis on the use of scientific workflow engines in computational science over recent years (such as Taverna [4] and Kepler [5] in the BioSciences) has led to combining the output of tasks that are executed across different (often distributed) platforms. For instance, an application such as the TeraGrid or EGEE/EGI may be distributed across multiple nodes of a Grid infrastructure and requires data to be transferred between nodes across a network. Taking account of data transfer rates between tasks must therefore be an important criteria to meet an overall application makespan when considering where to place tasks across a distributed infrastructure. Data transfer rates are ignored in resource discovery systems, which often look at the properties of a single task rather than the workflow in its entirety. In Grid system registries, for instance, resource properties are relatively stable, in that they will provide the same operating system and hardware configuration over a long time frame. However, the network connectivity between such resources may vary considerably, depending on the jobs being executed and the associated data transfer between them. Although new resources may also appear in such systems, queries to discover suitable resources to execute a job are likely to result in a much greater number than the new resources being added/removed to/from the system [6]. Thus, the efficiency of the data retrieval is more important than the costs of the setup.

This paper proposes a technique based on data summarisation that can take account of data transfer rates between jobs. The main contribution of this paper is the application of Cobweb clustering to generate summaries of resource properties, thereby supporting scalability of P2P-based content networks. The scalability is achieved by providing efficient network-aware data discovery and reducing the time and amount of messages that are exchanged between P2P nodes. This paper analyses the behaviour of the proposed mechanism using simulations with up to half a million resources, each having several attributes that are based on PlanetLab evaluations.

Summarisation techniques have been used in several research areas, for instance, database management [7], video coding [8], [9], [10], [11], or visualisation of web pages [12]. These applications demonstrate the usefulness of reducing the amount of information (but not the quality of such information) when performing several tasks.

The remainder of this paper is organised in several sections. Section 2 presents the background of content networks, and Section 3 explains the Cobweb-based data summarisation technique. Afterwards, the system architecture, which utilises summaries to discover resources is introduced in Section 4. The system architecture includes a summary-based resource discovery technique with an extension of network awareness. The evaluation for a network-aware summarisation environment that is efficient and scalable is presented in Section 5. Section 6 compares the proposed network-aware summarisation for large-scale information systems with the existing related work. Finally, the conclusions and the future work are covered in Section 7.

Section snippets

Content networks

Nodes in a content network may perform the routing of messages and also store content. In order to allow nodes to store content efficiently, two steps are necessary: content aggregation and content placement. Content aggregation is the process of grouping content based on common features. According to [13], content aggregation involves mapping and aggregation grouping. The first, mapping, maps content to a value in some value space. The second, aggregation grouping, groups content based on

Summarisation technique

The summarisation technique is based on a clustering algorithm called Cobweb [19], which is an incremental system for hierarchical conceptual clustering. The system carries out a hill-climbing search through a space of hierarchical classification schemes using operators that enable bidirectional travel through the space. Cobweb uses a heuristic measure called category utility to guide the search. Gluck and Corter [20] originally developed this metric as a means of predicting the basic level in

System architecture

This paper analyses read-dominated data [26], which means that the query for an attribute is more frequent than the updates of the attributes. For example, a resource within a Grid system may have the same configuration for several days or months. However, a large number of jobs could be submitted to it within minutes. Therefore, the main objective is to reduce the lookup costs associated with the discovery of a suitable resource to execute a batch of jobs. After the initialisation process, we

Evaluation

This section evaluates the previously presented resource discovery algorithms with regard to the summary sizes and number of messages and hops needed to obtain the results (presented in Section 5.1); the precision of the obtained results and compares it with discovery costs (explained in Section 5.2); and the network costs of the overall workflow compared with baseline experiments (depicted in Section 5.3).

Our simulator is built on top of Pastry [16], a well-known structured P2P overlay. We

Related work

Several systems have been developed for resource discovery in distributed systems over the years, some of which have been reviewed in [2]. In Grid systems, one of the most popular is Globus Monitoring and Discovery System (MDS) [1]. MDS allows users to discover what resources are considered part of a Virtual Organisation (VO) and to monitor those resources. However, most of the resource discovery systems are limited in their scalability. The architecture of the presented system improves

Conclusions

This paper presents a network-aware summarisation technique for efficient information retrieval within large-scale P2P-content networks in terms of message size, number of messages, maximum retrieval time and network dependences. Using a completely decentralised Grid system as an example scenario, we simulate an environment with up to half a million randomly distributed resources. The Cobweb-based summary tree allows us to reduce significantly the number of the disseminated messages.

René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.

References (36)

  • A.D. Doulamis et al.

    A fuzzy video content representation for video summarization and content-based retrieval

    Signal Processing

    (2000)
  • C. Mastroianni et al.

    A super-peer model for resource discovery services in large-scale grids

    Future Generation Computer Systems

    (2005)
  • N.D. Doulamis et al.

    Exploiting semantic proximities for content search over P2P networks

    Computer Communications

    (2009)
  • K. Czajkowski, C. Kesselman, S. Fitzgerald, I.T. Foster, Grid information services for distributed resource sharing,...
  • P. Trunfio, D. Talia, P. Fragopoulou, C. Papadakis, M. Mordacchini, M. Pennanen, K. Popov, V. Vlassov, S. Haridi,...
  • R. Brunner, F. Freitag, L. Navarro, Towards the development of a decentralized market information system: requirements...
  • T. Oinn et al.

    Taverna: lessons in creating a workflow environment for the life sciences: research articles

    Concurrency and Computation: Practice and Experience

    (2006)
  • A. Ngu et al.

    Flexible scientific workflow modeling using frames, templates, and dynamic embedding

    Scientific and Statistical Database Management

    (2008)
  • D. Kyriazis et al.

    Service selection and workflow mapping for grids: an approach exploiting quality-of-service information

    Concurrency and Computation: Practice and Experience

    (2009)
  • R. Saint-Paul, G. Raschia, N. Mouaddib, Database summarization: the SaintEtiQ system, in: Proc. of the 23rd Intl....
  • G. Evangelopoulos, A. Zlatintsi, G. Skoumas, K. Rapantzikos, A. Potamianos, P. Maragos, Y. Avrithis, Video event...
  • S. Liu, M.X. Zhou, S. Pan, W. Qian, W. Cai, X. Lian, Interactive, topic-based visual text summarization and analysis,...
  • D. Simakov, Y. Caspi, E. Shechtman, M. Irani, Summarizing visual data using bidirectional similarity, in: Proc. of the...
  • B. Jiao, L. Yang, J. Xu, F. Wu, Visual summarization of web pages, in: Proc. of the 33rd Intl. Conference on Research...
  • H.T. Kung et al.
  • R. Goldman, N. Shivakumar, S. Venkatasubramanian, H. Garcia-Molina, Proximity search in databases, in: Proc.of the...
  • W.P. Jones et al.

    Pictures of relevance : a geometric analysis of similarity measures

    Journal of the American Society for Information Science

    (1987)
  • A. Rowstron, P. Druschel, Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer...
  • Cited by (15)

    • Resource discovery for distributed computing systems: A comprehensive survey

      2018, Journal of Parallel and Distributed Computing
    • HARD: Hybrid Adaptive Resource Discovery for Jungle Computing

      2017, Journal of Network and Computer Applications
      Citation Excerpt :

      A recent work (Caminero et al., 2013) of this type extends RI and proposes a technique to perform resource discovery in grids based on P2P with capability to perform multi-attribute queries and range queries for numerical attributes. It uses an information summarization technique presented in Brunner et al. (2012) and creates different types of summaries and accordingly presents a metric (called goodness function) needed by RIs to guide the query process. It still suffers from RI drawbacks as well as lack of support for complex querying.

    • P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

      2013, Parallel Computing
      Citation Excerpt :

      Recall that a multi-attribute query is a query asking for resources with more than one pair 〈attribute, value〉, for example {OS = Linux & memory = 4 GB}, and range queries are queries asking for resources whose features are in a range of values (e.g. {50 GB < disk-space < 100 GB}). This technique uses information summarization, and extends proposals from literature using summarization, such as [10,9], because this work (1) provides an efficient way to disseminate and query summarized information over the system based on peer-to-peer (P2P), namely Routing Indices (RIs) [11], (2) adapts the summarization technique to the RIs by means of creating different types of summaries (called n-level summaries), and (3) presents a metric (called goodness function) needed by RIs to guide the query process. Even more, this paper presents a performance evaluation based on the EU DataGRID Testbed [12] that shows the better scalability and good performance of the proposed technique compared to proposals from literature.

    • A task routing approach to large-scale scheduling

      2013, Future Generation Computer Systems
    View all citing articles on Scopus

    René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.

    Agustín C. Caminero is a Post-Doc researcher since 2010 at the UNED, Madrid, Spain.

    Omer F. Rana is a Reader in the School of Computer Science at Cardiff University, Wales, UK.

    Felix Freitag is full time adjunct lecturer at the Computer Architecture Department of the Technical University of Catalonia, Spain.

    Leandro Navarro is associate professor at the Computer Architecture Department of the Technical University of Catalonia, Spain.

    View full text