Network-aware summarisation for resource discovery in P2P-content networks

doi:10.1016/j.future.2011.03.004

Future Generation Computer Systems

Volume 28, Issue 3, March 2012, Pages 563-572

https://doi.org/10.1016/j.future.2011.03.004 Get rights and content

Abstract

Many application scenarios have a read-dominated behaviour of information provision, which means that there are few updates, and that users execute frequent queries for information discovery. Discovery of content in such systems could benefit from the use of summary techniques in order to facilitate the discovery process and minimise the size of the data exchanged. This is particularly applicable within Grid computing environments where nodes (representing a group of computational and storage resources), which enable jobs to reach them may need to exchange metadata with each other about their resources. We introduce network-aware summarisation algorithms for the resource discovery in P2P-content networks, which are based on Cobweb clustering. We identify how summarisation can improve the discovery process while also improving the accuracy of the discovered resource(s). Metrics based on precision–recall are used to compare the accuracy for specific types of queries generated over the summarised content.

Highlights

► Content summarisation for resource discovery in large-scale Computational Grid. ► Using a peer-to-peer-based communication protocol to provide a very large scalability. ► Network awareness reduces the job processing time through the discovery of close nodes. ► Reduced lookup time by applying a content summarisation technique. ► Precision–recall metric to evaluate the quality of the discovery.

Introduction

Large-scale information systems have gained in importance over recent years, often treating with many concurrent users and managing an increasing amount of information. Different systems address the scalability challenges that arise within such systems. First, using a Peer-to-Peer (P2P) systems provides scalability in terms of the number of participants accessing such systems. Second, summarisation techniques help reduce the amount of information exchanged between system nodes. Thus, a combination of these two techniques is a promising solution for Grid-based Information Systems [1], [2] and Distributed Market Information Systems [3].

Many scientific applications involve the execution of not just a single task but a coordinated execution of multiple tasks—with data dependences existing between them. Such tasks may either be independent components or complete applications that require a combination of jobs. A task graph is generated in which vertices represent tasks, and arcs represent data dependences between tasks. An emphasis on the use of scientific workflow engines in computational science over recent years (such as Taverna [4] and Kepler [5] in the BioSciences) has led to combining the output of tasks that are executed across different (often distributed) platforms. For instance, an application such as the TeraGrid or EGEE/EGI may be distributed across multiple nodes of a Grid infrastructure and requires data to be transferred between nodes across a network. Taking account of data transfer rates between tasks must therefore be an important criteria to meet an overall application makespan when considering where to place tasks across a distributed infrastructure. Data transfer rates are ignored in resource discovery systems, which often look at the properties of a single task rather than the workflow in its entirety. In Grid system registries, for instance, resource properties are relatively stable, in that they will provide the same operating system and hardware configuration over a long time frame. However, the network connectivity between such resources may vary considerably, depending on the jobs being executed and the associated data transfer between them. Although new resources may also appear in such systems, queries to discover suitable resources to execute a job are likely to result in a much greater number than the new resources being added/removed to/from the system [6]. Thus, the efficiency of the data retrieval is more important than the costs of the setup.

This paper proposes a technique based on data summarisation that can take account of data transfer rates between jobs. The main contribution of this paper is the application of Cobweb clustering to generate summaries of resource properties, thereby supporting scalability of P2P-based content networks. The scalability is achieved by providing efficient network-aware data discovery and reducing the time and amount of messages that are exchanged between P2P nodes. This paper analyses the behaviour of the proposed mechanism using simulations with up to half a million resources, each having several attributes that are based on PlanetLab evaluations.

Summarisation techniques have been used in several research areas, for instance, database management [7], video coding [8], [9], [10], [11], or visualisation of web pages [12]. These applications demonstrate the usefulness of reducing the amount of information (but not the quality of such information) when performing several tasks.

The remainder of this paper is organised in several sections. Section 2 presents the background of content networks, and Section 3 explains the Cobweb-based data summarisation technique. Afterwards, the system architecture, which utilises summaries to discover resources is introduced in Section 4. The system architecture includes a summary-based resource discovery technique with an extension of network awareness. The evaluation for a network-aware summarisation environment that is efficient and scalable is presented in Section 5. Section 6 compares the proposed network-aware summarisation for large-scale information systems with the existing related work. Finally, the conclusions and the future work are covered in Section 7.

Section snippets

Content networks

Nodes in a content network may perform the routing of messages and also store content. In order to allow nodes to store content efficiently, two steps are necessary: content aggregation and content placement. Content aggregation is the process of grouping content based on common features. According to [13], content aggregation involves mapping and aggregation grouping. The first, mapping, maps content to a value in some value space. The second, aggregation grouping, groups content based on

Summarisation technique

The summarisation technique is based on a clustering algorithm called Cobweb [19], which is an incremental system for hierarchical conceptual clustering. The system carries out a hill-climbing search through a space of hierarchical classification schemes using operators that enable bidirectional travel through the space. Cobweb uses a heuristic measure called category utility to guide the search. Gluck and Corter [20] originally developed this metric as a means of predicting the basic level in

System architecture

This paper analyses read-dominated data [26], which means that the query for an attribute is more frequent than the updates of the attributes. For example, a resource within a Grid system may have the same configuration for several days or months. However, a large number of jobs could be submitted to it within minutes. Therefore, the main objective is to reduce the lookup costs associated with the discovery of a suitable resource to execute a batch of jobs. After the initialisation process, we

Evaluation

This section evaluates the previously presented resource discovery algorithms with regard to the summary sizes and number of messages and hops needed to obtain the results (presented in Section 5.1); the precision of the obtained results and compares it with discovery costs (explained in Section 5.2); and the network costs of the overall workflow compared with baseline experiments (depicted in Section 5.3).

Our simulator is built on top of Pastry [16], a well-known structured P2P overlay. We

Related work

Several systems have been developed for resource discovery in distributed systems over the years, some of which have been reviewed in [2]. In Grid systems, one of the most popular is Globus Monitoring and Discovery System (MDS) [1]. MDS allows users to discover what resources are considered part of a Virtual Organisation (VO) and to monitor those resources. However, most of the resource discovery systems are limited in their scalability. The architecture of the presented system improves

Conclusions

This paper presents a network-aware summarisation technique for efficient information retrieval within large-scale P2P-content networks in terms of message size, number of messages, maximum retrieval time and network dependences. Using a completely decentralised Grid system as an example scenario, we simulate an environment with up to half a million randomly distributed resources. The Cobweb-based summary tree allows us to reduce significantly the number of the disseminated messages.

René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.

References (36)

A.D. Doulamis et al.
A fuzzy video content representation for video summarization and content-based retrieval
Signal Processing
(2000)
C. Mastroianni et al.
A super-peer model for resource discovery services in large-scale grids
Future Generation Computer Systems
(2005)
N.D. Doulamis et al.
Exploiting semantic proximities for content search over P2P networks
Computer Communications
(2009)
K. Czajkowski, C. Kesselman, S. Fitzgerald, I.T. Foster, Grid information services for distributed resource sharing,...
P. Trunfio, D. Talia, P. Fragopoulou, C. Papadakis, M. Mordacchini, M. Pennanen, K. Popov, V. Vlassov, S. Haridi,...
R. Brunner, F. Freitag, L. Navarro, Towards the development of a decentralized market information system: requirements...
T. Oinn et al.
Taverna: lessons in creating a workflow environment for the life sciences: research articles
Concurrency and Computation: Practice and Experience
(2006)
A. Ngu et al.
Flexible scientific workflow modeling using frames, templates, and dynamic embedding
Scientific and Statistical Database Management
(2008)
D. Kyriazis et al.
Service selection and workflow mapping for grids: an approach exploiting quality-of-service information
Concurrency and Computation: Practice and Experience
(2009)
R. Saint-Paul, G. Raschia, N. Mouaddib, Database summarization: the SaintEtiQ system, in: Proc. of the 23rd Intl....

G. Evangelopoulos, A. Zlatintsi, G. Skoumas, K. Rapantzikos, A. Potamianos, P. Maragos, Y. Avrithis, Video event...

S. Liu, M.X. Zhou, S. Pan, W. Qian, W. Cai, X. Lian, Interactive, topic-based visual text summarization and analysis,...

D. Simakov, Y. Caspi, E. Shechtman, M. Irani, Summarizing visual data using bidirectional similarity, in: Proc. of the...

B. Jiao, L. Yang, J. Xu, F. Wu, Visual summarization of web pages, in: Proc. of the 33rd Intl. Conference on Research...

H.T. Kung et al.

R. Goldman, N. Shivakumar, S. Venkatasubramanian, H. Garcia-Molina, Proximity search in databases, in: Proc.of the...

W.P. Jones et al.

Pictures of relevance : a geometric analysis of similarity measures

Journal of the American Society for Information Science

(1987)

A. Rowstron, P. Druschel, Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer...

Cited by (15)

Resource discovery for distributed computing systems: A comprehensive survey
2018, Journal of Parallel and Distributed Computing
Large-scale distributed computing environments provide a vast amount of heterogeneous computing resources from different sources for resource sharing and distributed computing. Discovering appropriate resources in such environments is a challenge which involves several different subjects. In this paper, we provide an investigation on the current state of resource discovery protocols, mechanisms, and platforms for large-scale distributed environments, focusing on the design aspects. We classify all related aspects, general steps, and requirements to construct a novel resource discovery solution in three categories consisting of structures, methods, and issues. Accordingly, we review the literature, analyzing various aspects for each category.
HARD: Hybrid Adaptive Resource Discovery for Jungle Computing
2017, Journal of Network and Computer Applications
Citation Excerpt :
A recent work (Caminero et al., 2013) of this type extends RI and proposes a technique to perform resource discovery in grids based on P2P with capability to perform multi-attribute queries and range queries for numerical attributes. It uses an information summarization technique presented in Brunner et al. (2012) and creates different types of summaries and accordingly presents a metric (called goodness function) needed by RIs to guide the query process. It still suffers from RI drawbacks as well as lack of support for complex querying.
In recent years, Jungle Computing has emerged as a distributed computing paradigm based on simultaneous combination of various hierarchical and distributed computing environments which are composed by large number of heterogeneous resources. In such a computing environment, the resources and the underlying computation and communication infrastructures are highly-hierarchical and heterogeneous. This creates a lot of difficulty and complexity for finding the proper resources in a precise way in order to run a particular job on the system efficiently. This paper proposes Hybrid Adaptive Resource Discovery (HARD), a novel efficient and highly scalable resource-discovery approach which is built upon a virtual hierarchical overlay based on self-organization and self-adaptation of processing resources in the system, where the computing resources are organized into distributed hierarchies according to a proposed hierarchical multi-layered resource description model. The proposed approach supports distributed query processing within and across hierarchical layers by deploying various distributed resource discovery services and functionalities in the system which are implemented using different adapted algorithms and mechanisms in each level of hierarchy. The proposed approach addresses the requirements for resource discovery in Jungle Computing environments such as high-hierarchy, high-heterogeneity, high-scalability and dynamicity. Simulation results show significant scalability and efficiency of the proposed approach over highly heterogeneous, hierarchical and dynamic computing environments.
Fractal: An advanced multidimensional range query lookup protocol on nested rings for distributed systems
2017, Journal of Network and Computer Applications
One of the key issues in large-scale distributed systems such as P2P and grids is the capability of efficient multidimensional range query processing. Although several methods have recently been proposed for solving this problem on distributed systems, these methods have not been able to meet the fundamental necessity of a typical method on large-scale distributed systems, i.e., scalability.
This paper presents Fractal, a fully decentralized and highly scalable multidimensional range query lookup protocol for distributed systems. In this work, to organize the available nodes in the system, an n-dimensional space called Key Space is utilized. The available nodes on the nested Fractal rings maintain information about only O(logN) other nodes, and the Fractal lookup protocol discovers the destination node through these nested rings with a logarithmic cost. Because of its flexibility, Fractal allows the system to create a concept called Layering, which minimizes the probability of the wide-area message transfers (WAMTs) during the lookup process. Using several criteria, Fractal is compared with several successful methods that have recently been presented. Simulation results show the efficiency and performance of Fractal in networks of different sizes.
Behavioral modeling and formal verification of a resource discovery approach in Grid computing
2014, Expert Systems with Applications
Grid computing is the federation of resources from multiple locations to facilitate resource sharing and problem solving over the Internet. The challenge of finding services or resources in Grid environments has recently been the subject of many papers and researches. These researches and papers evaluate their approaches only by simulation and experiments. Therefore, it is possible that some part of the state space of the problem is not analyzed and checked well. To overcome this defect, model checking as an automatic technique for the verification of the systems is a suitable solution. In this paper, an adopted type of resource discovery approach to address multi-attribute and range queries has been presented. Unlike the papers in this scope, this paper decouple resource discovery behavior model to data gathering, discovery and control behavior. Also it facilitates the mapping process between three behaviors by means of the formal verification approach based on Binary Decision Diagram (BDD). The formal approach extracts the expected properties of resource discovery approach from control behavior in the form of CTL and LTL temporal logic formulas, and verifies the properties in data gathering and discovery behaviors comprehensively. Moreover, analyzing and evaluating the logical problems such as soundness, completeness, and consistency of the considered resource discovery approach is provided. To implement the behavior models of resource discovery approach the ArgoUML tool and the NuSMV model checker are employed. The results show that the adopted resource discovery approach can discovers multi-attribute and range queries very fast and detects logical problems such as soundness, completeness, and consistency.
P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries
2013, Parallel Computing
Citation Excerpt :
Recall that a multi-attribute query is a query asking for resources with more than one pair 〈attribute, value〉, for example {OS = Linux & memory = 4 GB}, and range queries are queries asking for resources whose features are in a range of values (e.g. {50 GB < disk-space < 100 GB}). This technique uses information summarization, and extends proposals from literature using summarization, such as [10,9], because this work (1) provides an efficient way to disseminate and query summarized information over the system based on peer-to-peer (P2P), namely Routing Indices (RIs) [11], (2) adapts the summarization technique to the RIs by means of creating different types of summaries (called n-level summaries), and (3) presents a metric (called goodness function) needed by RIs to guide the query process. Even more, this paper presents a performance evaluation based on the EU DataGRID Testbed [12] that shows the better scalability and good performance of the proposed technique compared to proposals from literature.
A key point for the efficient use of large grid systems is the discovery of resources, and this task becomes more complicated as the size of the system grows up. In this case, large amounts of information on the available resources must be stored and kept up-to-date along the system so that it can be queried by users to find resources meeting specific requirements (e.g. a given operating system or available memory). Thus, three tasks must be performed, (1) information on resources must be gathered and processed, (2) such processed information has to be disseminated over the system, and (3) upon users’ requests, the system must be able to discover resources meeting some requirements using the processed information. This paper presents a new technique for the discovery of resources in grids which can be used in the case of multi-attribute (e.g. {OS = Linux & memory = 4 GB}) and range queries (e.g. {50 GB < disk-space < 100 GB}). This technique relies on the use of content summarisation techniques to perform the first task mentioned before and strives at the main drawback found in proposals from literature using summarization. This drawback is related to scalability, and is tackled by means of using Peer-to-Peer (P2P) techniques, namely Routing Indices (RIs), to perform the second and third tasks.
Another contribution of this work is a performance evaluation conducted by means of simulations of the EU DataGRID Testbed which shows the usefulness of this approach compared to other proposals from literature. More specifically, the technique presented in this paper improves on the scalability and produces good performance. Besides, the parameters involved in the summary creation have been tuned and the most suitable values for the presented test case have been found.
A task routing approach to large-scale scheduling
2013, Future Generation Computer Systems
Scheduling many tasks in environments of millions of unreliable nodes is a challenging problem. To our knowledge, no work in the literature has proposed a solution that also supports many policies with very different objectives. In this paper, we present a decentralized scheduling model that overcomes these problems. A hierarchical network overlay supports a scalable resource discovery and allocation scheme. It uses aggregated information to route tasks to the most suitable execution nodes, and is easily extensible to provide very different scheduling policies. For this paper, we implemented a policy that just allocates tasks to idle nodes, a policy that minimizes the global makespan and a policy that fulfills deadline requirements. With thorough simulation tests, we conclude that our model allocates any number of tasks to several million nodes in just a few seconds, with very low overhead and high resilience. Meanwhile, policies with different objectives implemented on our model perform almost as well as their centralized counterpart.

View all citing articles on Scopus

René Brunner is a Ph.D. student at Technical University of Catalonia, Spain, since 2007.

Agustín C. Caminero is a Post-Doc researcher since 2010 at the UNED, Madrid, Spain.

Omer F. Rana is a Reader in the School of Computer Science at Cardiff University, Wales, UK.

Felix Freitag is full time adjunct lecturer at the Computer Architecture Department of the Technical University of Catalonia, Spain.

Leandro Navarro is associate professor at the Computer Architecture Department of the Technical University of Catalonia, Spain.

View full text

Network-aware summarisation for resource discovery in P2P-content networks

Abstract

Highlights

Introduction

Section snippets

Content networks

Summarisation technique

System architecture

Evaluation

Related work

Conclusions

Signal Processing

Future Generation Computer Systems

Computer Communications

Taverna: lessons in creating a workflow environment for the life sciences: research articles

Concurrency and Computation: Practice and Experience

Flexible scientific workflow modeling using frames, templates, and dynamic embedding

Scientific and Statistical Database Management

Service selection and workflow mapping for grids: an approach exploiting quality-of-service information

Concurrency and Computation: Practice and Experience

Pictures of relevance : a geometric analysis of similarity measures

Journal of the American Society for Information Science