A systematic approach to the integration of overlapping partitions in service-oriented data grids

doi:10.1016/j.future.2010.12.011

Future Generation Computer Systems

Volume 27, Issue 6, June 2011, Pages 667-680

https://doi.org/10.1016/j.future.2010.12.011 Get rights and content

Abstract

This paper aims to provide a service-oriented data integration solution over data Grids for cases where distributed data sources are partitioned with overlapping sections of various proportions. This is an interesting variation which combines both replicated and partitioned data within the same data management framework. Thus, the data management infrastructure has to deal with specific challenges regarding the identification, access and aggregation of partitioned data with varying proportions of overlapping sections. In order to provide a solution we have extended a well-known data access and integration middleware, namely Open Grid Services Architecture-Data Access and Integration Distributed Query Processing (OGSA-DAI DQP), with distributed query processing facilities, by incorporating the new ‘UnionPartitions’ operator into its algebra in order to cope with various unusual forms of horizontally partitioned databases. Our solution extends OGSA-DAI DQP in two aspects: (1)a new operator type is added to the algebra to handle the union of the partitions with different characteristics, and (2)OGSA-DAI DQP Federation Description is extended to include some more metadata to facilitate the successful execution of the newly introduced operator.

Research highlights

► A new operator is added to the algebra of OGSA-DQP to handle union of partitions. ► DQP Federation Description is extended to facilitate execution of new operator. ►UnionPartitions operator provides a convenient way of handling overlapping partitions.

Introduction

As the proliferation of information resources over the Internet has gained more and more momentum in recent years, information integration has become both more crucial and challenging for a growing range of applications that aim at providing an integrated view over distributed and heterogeneous resources. The challenge has many dimensions: coping with various kinds of heterogeneity including data model, platform, access interface heterogeneities; coping with various forms of data distribution and maintenance policies, scalability, performance, security and trust, reliability and resilience, access control, auditing and accounting, legal issues etc. It is obvious that each of these dimensions deserves a separate thread of research efforts, and indeed they did draw sufficient attention to cause the culmination of a considerable amount of the literature [1], [2], [3], [4], [5], [6], [7], [8], [9]. One particular challenge among the ones listed above, that is more relevant to the work presented in this paper is coping with various forms of data distribution and maintenance policies. In a typical data integration scenario over the Internet, data source hosting and data provision enterprise spans multiple administrative domains. The relationships between those data sources and methodologies to access and integrate data over them are not always straightforward. For instance, the distributed data sources may be replicas or partitions of the same database, rather than being logically distinct but related data segments with different schemas. The case where distributed data sources have replicated or partitioned data sections is of particular importance for the work presented in this paper. To be more specific, the problem we have tackled is to handle cases where a database is distributed to multiple independent administrative domains with the same data content initially (i.e., as replicas), but where parts of those multiple copies have evolved into overlapping partitions over time through independent data insertions carried out within each administrative domain. In this scenario, the replicas are not generated to support timelier query execution, but rather as a consequence of administrative policies. This may not be a particularly common case for distributed database applications; however it is a practical requirement for our target application area where a set of pre-defined simulation scenarios are distributed to multiple institutions and users create new scenarios to their local set causing the local set to grow [10]. This effectively results in partitioned data with overlapping sections of various proportions (due to the initial replication process). This is an interesting variation which combines both replicated and partitioned data within the same data management framework. Thus, the data management infrastructure has to deal with specific challenges regarding the identification, access and aggregation of partitioned data with varying proportions of overlapping sections.

Although, data partitioning is a well-explored domain and is supported by main stream DBMS vendors as an advanced feature, issues regarding the distribution of the partitions into separate administrative domains over a wide area network (such as the Internet or a Data Grid) in a service-oriented setting are much less addressed. We elaborate on the details of the problem domain further in Section 3.

The way we tackle this particular challenge is to devise a well-defined data integration mechanism that addresses requirements specific to the problem at hand and embed that mechanism into an appropriate data integration middleware as a first class construct. To establish the principles that delineate our approach we argue that:

1.
The data integration mechanism should be explicitly expressed in the specification of the overall behavior of the middleware at a certain level of abstraction.
2.
The run-time behavior details of the mechanism should be fully transparent to the end user, and largely transparent to the high-level application developer.
3.
The mechanism should be able to handle relatively unusual forms of data partitions such as those described above (e.g., not only disjoint partitions).

Our choice of data integration middleware (i.e., OGSA-DAI DQP) is largely influenced by the principles mentioned above and to a certain extent; it is a direct consequence of our previous work in this field [11], [12], [13]. The work presented in this paper extends OGSA-DAI DQP, a service-based middleware capable of executing data integration queries over highly distributed data resources in a data grid, to allow for the data resources to include horizontally partitioned databases in such a way that the access and integration procedure required to handle the partitioned data remains transparent to the query constructor (i.e., to the user who poses data integration queries using OGSA-DAI DQP). We present a detailed description of the extensions and report our findings on the performance of queries using the extensions. Since the extensions are non-disruptive to the architecture and the fundamental run-time execution mechanisms of OGSA-DAI DQP, and since the overall run-time characteristics of OGSA-DAI DQP are already well documented [14], [15], our performance experiments focus only on the impact of our extensions.

The rest of the paper is organized as follows: In Section 2 some related work in the literature is presented, together with an overview of the software middleware extended by our work. In Section 3, a more detailed account of the problem we aim to solve is presented. In Section 4 our approach to the solution and some details of the implementation are given. Section 5 contains the performance evaluation results regarding the new mechanisms introduced. And finally, in Section 6, conclusions and future plans for further improvements are discussed.

Section snippets

Multi-node horizontal partitioning in distributed environments

Database partitioning is a relatively advanced topic in data management area, mostly to support applications that require high performance or high availability for large volumes of data. Mainstream commercial DBMS vendors provide solutions for various kinds of partitioning techniques such as range partitioning, list partitioning, hash partitioning or a combination of those. The selection of the appropriate technique would depend on the characteristics of the data or the primary purpose of the

Problem definition

We now define the problem we aim to solve in more detail. In the application domain we operate a database containing pre-defined simulation scenarios, which is distributed to multiple administrative domains that are geographically dispersed as replicas. Due to administrative policies, normal users can add new scenarios, modify or delete existing scenarios only in their local databases. So, common replica management policies are not applicable. However, some more privileged users can issue

Our approach to the solution

Before delving into the details, we would like to illustrate the added value our solution offers using a simple example. We assume that a table $P$ is 3-way partitioned with partition identifiers $P_{1}, P_{2}$ and $P_{3}$ , where $P_{1}$ and $P_{2}$ are overlapping partitions and $P_{3}$ is disjoint to others. We list three alternative methods of integrating these partitions, ranging from the most ad hoc to the fully transparent one representing our solution:

1.
Assuming the client (the query constructor) is fully aware of the

Experimental results

The performance of the ‘UnionPartitions’ operator is evaluated along two lines; first to illustrate whether there is a significant overhead incurred by the new operator; and second to illustrate the added value of the extensions in querying the overlapping and/or disjoint partitions. With these aims; three experiments are carried out using the queries given in Fig. 5, Fig. 7, Fig. 9. For the experiments, different datasets are generated arbitrarily using ‘DBMonster’ which is a tool that is

Conclusion

In this paper we presented an extension to the OGSA-DAI DQP which is a well-known data access and integration middleware with distributed query processing facilities, by incorporating ‘UnionPartitions’ operator into its algebra in order to cope with various unusual forms of horizontally partitioned databases. To summarize; our solution extends OGSA-DAI DQP in two aspects;

1.
A new operator type is added to the algebra to handle the union of partitions with different characteristics.
2.
OGSA-DAI DQP

Acknowledgement

This work is partially supported by the State Planning Organization under the Office of Prime Minister of Turkish Government with grant number 2008K010995. We are grateful for that support.

H. Kevser Sunercan: received her B.Sc. (2006) and M.Sc. (2010) in Computer Engineering from Middle East Technical University (METU), Turkey. She worked at MILSOFT A.S. as part-time and later full-time Software Engineer. Currently, she is working as a full-time researcher at Software Infrastructures Department of TUBITAK BILGEM UEKAE/ILTAREN, Ankara, Turkey.

References (60)

S. Lynden et al.
The design and implementation of OGSA-DQP: a service-based distributed query processor
Future Generation Computer Systems
(2009)
D.J. Reid
Minimizing the response time of executing a join between fragmented relations in a distributed database system
Mathematical and Computer Modelling
(1997)
C. Goble et al.
State of the nation in data integration for bioinformatics
Journal of Biomedical Informatics
(2008)
C. Comito et al.
A service-oriented system for distributed data querying and integration on grids
Future Generation Computer Systems
(2009)
A. Sánchez et al.
MAPFS-DAI, an extension of OGSA-DAI based on parallel file system
Future Generation Computer Systems
(2007)
A. Gounaris et al.
Self-monitoring query execution for adaptive query processing
Data & Knowledge Engineering
(2004)
I. Foster, J. Vöckler, M. Wilde, Y. Zhao, The virtual data grid: a new model and architecture for data-intensive...
A. Chervenak et al.
Giggle: a framework for constructing scalable replica location services
S. Fiore et al.
Advanced grid database management with the greic data access service
Globus Toolkit Software. http://www.globus.org (accessed...

M.J. Litzkow, M. Livny, M.W. Mutka, Condor—a hunter of idle workstations, in: Proceedings of the 8th International...

GLite. http://glite.web.cern.ch/glite/ (accessed...

B. Koblitz et al.

The AMGA metadata service

Journal of Grid Computing

(2008)

G. Singh et al.

A metadata catalog service for data intensive applications

W.H. Bell et al.

Project spitfire-towards grid web service databases

Data repositories management component for simulation models (translated from Turkish), Software Design Document,...

M.N. Alpdemir et al.

An experience report on designing and building OGSA-DQP: a service based distributed query processor for the grid

M.N. Alpdemir et al.

Using OGSA-DQP to support scientific applications for the grid

M.N. Alpdemir et al.

OGSA-DQP: a service for distributed querying on the grid

M.N. Alpdemir, A. Gounaris, A. Mukherjee, D. Fitzgerald, N.W. Paton, P. Watson, R. Sakellariou, A.A.A. Fernandes, J....

R. Schumacher, Improving database performance with partitioning....

A. Segev

Optimization of join operations in horizontally partitioned database systems

ACM Transactions on Database Systems

(1986)

B. Gavish et al.

Set query optimization in distributed database systems

ACM Transactions on Database Systems

(1986)

Y. Kambayashi, M. Yoshikawa, Query processing utilizing dependencies and horizontal decomposition, in: Proceedings of...

D. Shasha et al.

Optimizing equijoin queries in distributed databases where relations are hash partitioned

ACM Transactions on Database Systems

(1991)

J. Chidambaram, C. Prabhu, P.A. Narasimho Rao, R. Wanker, C.S. Aneesh, A. Agarwal, A methodology for high availability...

R. Blankinship, A.R. Hevner, S.B. Yao, An iterative method for distributed database design, in: Proceedings of the 17th...

H. Ma et al.

A heuristic approach to cost-efficient derived horizontal fragmentation of complex value databases

D.-G. Shin et al.

Fragmenting relations horizontally using a knowledge-based approach

IEEE Transactions on Software Engineering

(1991)

S. Pramanik et al.

Description and identification of distributed fragments of recursive relations

IEEE Transactions on Knowledge and Data Engineering

(1996)

Cited by (0)

M. Nedim Alpdemir: received his B.Sc. (1990) in Computer Engineering from Middle East Technical University (METU), Turkey; M.Sc. (1996) in Advanced Computer Science and Ph.D. (2000) in Component-Based Simulation Environments from the Department Computer Science, University of Manchester, UK. He worked as a Research Associate, and later as a Research Fellow in the Information Management Group (IMG) at the Department Computer Science of University of Manchester, UK, until 2005, taking part in the development of OGSA-DQP a service-based distributed query processor. Currently he is the head of the Software Infrastructures Department at TUBITAK BILGEM UEKAE/ILTAREN, Ankara, Turkey.

Nihan Kesim Cicekli: is an Associate Professor in the Department of Computer Engineering at the Middle East Technical University, Ankara, Turkey. She received her B.Sc. degree in Computer Engineering at the Middle East Technical University in 1986. She received the M.Sc. degree in Computer Engineering at Bilkent University in Ankara in 1988; and the Ph.D. degree in Computer Science at Imperial College, London, UK in 1993. She was a visiting associate professor at the University of Central Florida, Orlando, USA, from 2001 till 2003. Her current research interests include multimedia databases, semantic web, web services, workflow management systems, recommender systems and temporal reasoning. She is a member of IEEE. She served on the program committee of several international conferences including VLDB and ICDE. For more information about her, see http://www.ceng.metu.edu.tr/~nihan.

View full text

A systematic approach to the integration of overlapping partitions in service-oriented data grids

Abstract

Research highlights

Introduction

Section snippets

Multi-node horizontal partitioning in distributed environments

Problem definition

Our approach to the solution

Experimental results

Conclusion

Acknowledgement

Future Generation Computer Systems

Mathematical and Computer Modelling

Journal of Biomedical Informatics

Future Generation Computer Systems

Future Generation Computer Systems

Data & Knowledge Engineering

Giggle: a framework for constructing scalable replica location services

Advanced grid database management with the greic data access service

The AMGA metadata service

Journal of Grid Computing

A metadata catalog service for data intensive applications

Project spitfire-towards grid web service databases

An experience report on designing and building OGSA-DQP: a service based distributed query processor for the grid

Using OGSA-DQP to support scientific applications for the grid

OGSA-DQP: a service for distributed querying on the grid

Optimization of join operations in horizontally partitioned database systems

ACM Transactions on Database Systems

Set query optimization in distributed database systems

ACM Transactions on Database Systems

Optimizing equijoin queries in distributed databases where relations are hash partitioned

ACM Transactions on Database Systems

A heuristic approach to cost-efficient derived horizontal fragmentation of complex value databases

Fragmenting relations horizontally using a knowledge-based approach

IEEE Transactions on Software Engineering

Description and identification of distributed fragments of recursive relations

IEEE Transactions on Knowledge and Data Engineering