Network-aware summarisation for resource discovery in P2P-content networks
Highlights
► Content summarisation for resource discovery in large-scale Computational Grid. ► Using a peer-to-peer-based communication protocol to provide a very large scalability. ► Network awareness reduces the job processing time through the discovery of close nodes. ► Reduced lookup time by applying a content summarisation technique. ► Precision–recall metric to evaluate the quality of the discovery.
Introduction
Large-scale information systems have gained in importance over recent years, often treating with many concurrent users and managing an increasing amount of information. Different systems address the scalability challenges that arise within such systems. First, using a Peer-to-Peer (P2P) systems provides scalability in terms of the number of participants accessing such systems. Second, summarisation techniques help reduce the amount of information exchanged between system nodes. Thus, a combination of these two techniques is a promising solution for Grid-based Information Systems [1], [2] and Distributed Market Information Systems [3].
Many scientific applications involve the execution of not just a single task but a coordinated execution of multiple tasks—with data dependences existing between them. Such tasks may either be independent components or complete applications that require a combination of jobs. A task graph is generated in which vertices represent tasks, and arcs represent data dependences between tasks. An emphasis on the use of scientific workflow engines in computational science over recent years (such as Taverna [4] and Kepler [5] in the BioSciences) has led to combining the output of tasks that are executed across different (often distributed) platforms. For instance, an application such as the TeraGrid or EGEE/EGI may be distributed across multiple nodes of a Grid infrastructure and requires data to be transferred between nodes across a network. Taking account of data transfer rates between tasks must therefore be an important criteria to meet an overall application makespan when considering where to place tasks across a distributed infrastructure. Data transfer rates are ignored in resource discovery systems, which often look at the properties of a single task rather than the workflow in its entirety. In Grid system registries, for instance, resource properties are relatively stable, in that they will provide the same operating system and hardware configuration over a long time frame. However, the network connectivity between such resources may vary considerably, depending on the jobs being executed and the associated data transfer between them. Although new resources may also appear in such systems, queries to discover suitable resources to execute a job are likely to result in a much greater number than the new resources being added/removed to/from the system [6]. Thus, the efficiency of the data retrieval is more important than the costs of the setup.
This paper proposes a technique based on data summarisation that can take account of data transfer rates between jobs. The main contribution of this paper is the application of Cobweb clustering to generate summaries of resource properties, thereby supporting scalability of P2P-based content networks. The scalability is achieved by providing efficient network-aware data discovery and reducing the time and amount of messages that are exchanged between P2P nodes. This paper analyses the behaviour of the proposed mechanism using simulations with up to half a million resources, each having several attributes that are based on PlanetLab evaluations.
Summarisation techniques have been used in several research areas, for instance, database management [7], video coding [8], [9], [10], [11], or visualisation of web pages [12]. These applications demonstrate the usefulness of reducing the amount of information (but not the quality of such information) when performing several tasks.
The remainder of this paper is organised in several sections. Section 2 presents the background of content networks, and Section 3 explains the Cobweb-based data summarisation technique. Afterwards, the system architecture, which utilises summaries to discover resources is introduced in Section 4. The system architecture includes a summary-based resource discovery technique with an extension of network awareness. The evaluation for a network-aware summarisation environment that is efficient and scalable is presented in Section 5. Section 6 compares the proposed network-aware summarisation for large-scale information systems with the existing related work. Finally, the conclusions and the future work are covered in Section 7.
Section snippets
Content networks
Nodes in a content network may perform the routing of messages and also store content. In order to allow nodes to store content efficiently, two steps are necessary: content aggregation and content placement. Content aggregation is the process of grouping content based on common features. According to [13], content aggregation involves mapping and aggregation grouping. The first, mapping, maps content to a value in some value space. The second, aggregation grouping, groups content based on
Summarisation technique
The summarisation technique is based on a clustering algorithm called Cobweb [19], which is an incremental system for hierarchical conceptual clustering. The system carries out a hill-climbing search through a space of hierarchical classification schemes using operators that enable bidirectional travel through the space. Cobweb uses a heuristic measure called category utility to guide the search. Gluck and Corter [20] originally developed this metric as a means of predicting the basic level in
System architecture
This paper analyses read-dominated data [26], which means that the query for an attribute is more frequent than the updates of the attributes. For example, a resource within a Grid system may have the same configuration for several days or months. However, a large number of jobs could be submitted to it within minutes. Therefore, the main objective is to reduce the lookup costs associated with the discovery of a suitable resource to execute a batch of jobs. After the initialisation process, we
Evaluation
This section evaluates the previously presented resource discovery algorithms with regard to the summary sizes and number of messages and hops needed to obtain the results (presented in Section 5.1); the precision of the obtained results and compares it with discovery costs (explained in Section 5.2); and the network costs of the overall workflow compared with baseline experiments (depicted in Section 5.3).
Our simulator is built on top of Pastry [16], a well-known structured P2P overlay. We
Related work
Several systems have been developed for resource discovery in distributed systems over the years, some of which have been reviewed in [2]. In Grid systems, one of the most popular is Globus Monitoring and Discovery System (MDS) [1]. MDS allows users to discover what resources are considered part of a Virtual Organisation (VO) and to monitor those resources. However, most of the resource discovery systems are limited in their scalability. The architecture of the presented system improves
Conclusions
This paper presents a network-aware summarisation technique for efficient information retrieval within large-scale P2P-content networks in terms of message size, number of messages, maximum retrieval time and network dependences. Using a completely decentralised Grid system as an example scenario, we simulate an environment with up to half a million randomly distributed resources. The Cobweb-based summary tree allows us to reduce significantly the number of the disseminated messages.
René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.
References (36)
- et al.
A fuzzy video content representation for video summarization and content-based retrieval
Signal Processing
(2000) - et al.
A super-peer model for resource discovery services in large-scale grids
Future Generation Computer Systems
(2005) - et al.
Exploiting semantic proximities for content search over P2P networks
Computer Communications
(2009) - K. Czajkowski, C. Kesselman, S. Fitzgerald, I.T. Foster, Grid information services for distributed resource sharing,...
- P. Trunfio, D. Talia, P. Fragopoulou, C. Papadakis, M. Mordacchini, M. Pennanen, K. Popov, V. Vlassov, S. Haridi,...
- R. Brunner, F. Freitag, L. Navarro, Towards the development of a decentralized market information system: requirements...
- et al.
Taverna: lessons in creating a workflow environment for the life sciences: research articles
Concurrency and Computation: Practice and Experience
(2006) - et al.
Flexible scientific workflow modeling using frames, templates, and dynamic embedding
Scientific and Statistical Database Management
(2008) - et al.
Service selection and workflow mapping for grids: an approach exploiting quality-of-service information
Concurrency and Computation: Practice and Experience
(2009) - R. Saint-Paul, G. Raschia, N. Mouaddib, Database summarization: the SaintEtiQ system, in: Proc. of the 23rd Intl....
Pictures of relevance : a geometric analysis of similarity measures
Journal of the American Society for Information Science
Cited by (15)
Resource discovery for distributed computing systems: A comprehensive survey
2018, Journal of Parallel and Distributed ComputingHARD: Hybrid Adaptive Resource Discovery for Jungle Computing
2017, Journal of Network and Computer ApplicationsCitation Excerpt :A recent work (Caminero et al., 2013) of this type extends RI and proposes a technique to perform resource discovery in grids based on P2P with capability to perform multi-attribute queries and range queries for numerical attributes. It uses an information summarization technique presented in Brunner et al. (2012) and creates different types of summaries and accordingly presents a metric (called goodness function) needed by RIs to guide the query process. It still suffers from RI drawbacks as well as lack of support for complex querying.
Fractal: An advanced multidimensional range query lookup protocol on nested rings for distributed systems
2017, Journal of Network and Computer ApplicationsBehavioral modeling and formal verification of a resource discovery approach in Grid computing
2014, Expert Systems with ApplicationsP2P-based resource discovery in dynamic grids allowing multi-attribute and range queries
2013, Parallel ComputingCitation Excerpt :Recall that a multi-attribute query is a query asking for resources with more than one pair 〈attribute, value〉, for example {OS = Linux & memory = 4 GB}, and range queries are queries asking for resources whose features are in a range of values (e.g. {50 GB < disk-space < 100 GB}). This technique uses information summarization, and extends proposals from literature using summarization, such as [10,9], because this work (1) provides an efficient way to disseminate and query summarized information over the system based on peer-to-peer (P2P), namely Routing Indices (RIs) [11], (2) adapts the summarization technique to the RIs by means of creating different types of summaries (called n-level summaries), and (3) presents a metric (called goodness function) needed by RIs to guide the query process. Even more, this paper presents a performance evaluation based on the EU DataGRID Testbed [12] that shows the better scalability and good performance of the proposed technique compared to proposals from literature.
A task routing approach to large-scale scheduling
2013, Future Generation Computer Systems
René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.
Agustín C. Caminero is a Post-Doc researcher since 2010 at the UNED, Madrid, Spain.
Omer F. Rana is a Reader in the School of Computer Science at Cardiff University, Wales, UK.
Felix Freitag is full time adjunct lecturer at the Computer Architecture Department of the Technical University of Catalonia, Spain.
Leandro Navarro is associate professor at the Computer Architecture Department of the Technical University of Catalonia, Spain.