A systematic approach to the integration of overlapping partitions in service-oriented data grids
Research highlights
► A new operator is added to the algebra of OGSA-DQP to handle union of partitions. ► DQP Federation Description is extended to facilitate execution of new operator. ►UnionPartitions operator provides a convenient way of handling overlapping partitions.
Introduction
As the proliferation of information resources over the Internet has gained more and more momentum in recent years, information integration has become both more crucial and challenging for a growing range of applications that aim at providing an integrated view over distributed and heterogeneous resources. The challenge has many dimensions: coping with various kinds of heterogeneity including data model, platform, access interface heterogeneities; coping with various forms of data distribution and maintenance policies, scalability, performance, security and trust, reliability and resilience, access control, auditing and accounting, legal issues etc. It is obvious that each of these dimensions deserves a separate thread of research efforts, and indeed they did draw sufficient attention to cause the culmination of a considerable amount of the literature [1], [2], [3], [4], [5], [6], [7], [8], [9]. One particular challenge among the ones listed above, that is more relevant to the work presented in this paper is coping with various forms of data distribution and maintenance policies. In a typical data integration scenario over the Internet, data source hosting and data provision enterprise spans multiple administrative domains. The relationships between those data sources and methodologies to access and integrate data over them are not always straightforward. For instance, the distributed data sources may be replicas or partitions of the same database, rather than being logically distinct but related data segments with different schemas. The case where distributed data sources have replicated or partitioned data sections is of particular importance for the work presented in this paper. To be more specific, the problem we have tackled is to handle cases where a database is distributed to multiple independent administrative domains with the same data content initially (i.e., as replicas), but where parts of those multiple copies have evolved into overlapping partitions over time through independent data insertions carried out within each administrative domain. In this scenario, the replicas are not generated to support timelier query execution, but rather as a consequence of administrative policies. This may not be a particularly common case for distributed database applications; however it is a practical requirement for our target application area where a set of pre-defined simulation scenarios are distributed to multiple institutions and users create new scenarios to their local set causing the local set to grow [10]. This effectively results in partitioned data with overlapping sections of various proportions (due to the initial replication process). This is an interesting variation which combines both replicated and partitioned data within the same data management framework. Thus, the data management infrastructure has to deal with specific challenges regarding the identification, access and aggregation of partitioned data with varying proportions of overlapping sections.
Although, data partitioning is a well-explored domain and is supported by main stream DBMS vendors as an advanced feature, issues regarding the distribution of the partitions into separate administrative domains over a wide area network (such as the Internet or a Data Grid) in a service-oriented setting are much less addressed. We elaborate on the details of the problem domain further in Section 3.
The way we tackle this particular challenge is to devise a well-defined data integration mechanism that addresses requirements specific to the problem at hand and embed that mechanism into an appropriate data integration middleware as a first class construct. To establish the principles that delineate our approach we argue that:
- 1.
The data integration mechanism should be explicitly expressed in the specification of the overall behavior of the middleware at a certain level of abstraction.
- 2.
The run-time behavior details of the mechanism should be fully transparent to the end user, and largely transparent to the high-level application developer.
- 3.
The mechanism should be able to handle relatively unusual forms of data partitions such as those described above (e.g., not only disjoint partitions).
The rest of the paper is organized as follows: In Section 2 some related work in the literature is presented, together with an overview of the software middleware extended by our work. In Section 3, a more detailed account of the problem we aim to solve is presented. In Section 4 our approach to the solution and some details of the implementation are given. Section 5 contains the performance evaluation results regarding the new mechanisms introduced. And finally, in Section 6, conclusions and future plans for further improvements are discussed.
Section snippets
Multi-node horizontal partitioning in distributed environments
Database partitioning is a relatively advanced topic in data management area, mostly to support applications that require high performance or high availability for large volumes of data. Mainstream commercial DBMS vendors provide solutions for various kinds of partitioning techniques such as range partitioning, list partitioning, hash partitioning or a combination of those. The selection of the appropriate technique would depend on the characteristics of the data or the primary purpose of the
Problem definition
We now define the problem we aim to solve in more detail. In the application domain we operate a database containing pre-defined simulation scenarios, which is distributed to multiple administrative domains that are geographically dispersed as replicas. Due to administrative policies, normal users can add new scenarios, modify or delete existing scenarios only in their local databases. So, common replica management policies are not applicable. However, some more privileged users can issue
Our approach to the solution
Before delving into the details, we would like to illustrate the added value our solution offers using a simple example. We assume that a table is 3-way partitioned with partition identifiers and , where and are overlapping partitions and is disjoint to others. We list three alternative methods of integrating these partitions, ranging from the most ad hoc to the fully transparent one representing our solution:
- 1.
Assuming the client (the query constructor) is fully aware of the
Experimental results
The performance of the ‘UnionPartitions’ operator is evaluated along two lines; first to illustrate whether there is a significant overhead incurred by the new operator; and second to illustrate the added value of the extensions in querying the overlapping and/or disjoint partitions. With these aims; three experiments are carried out using the queries given in Fig. 5, Fig. 7, Fig. 9. For the experiments, different datasets are generated arbitrarily using ‘DBMonster’ which is a tool that is
Conclusion
In this paper we presented an extension to the OGSA-DAI DQP which is a well-known data access and integration middleware with distributed query processing facilities, by incorporating ‘UnionPartitions’ operator into its algebra in order to cope with various unusual forms of horizontally partitioned databases. To summarize; our solution extends OGSA-DAI DQP in two aspects;
- 1.
A new operator type is added to the algebra to handle the union of partitions with different characteristics.
- 2.
OGSA-DAI DQP
Acknowledgement
This work is partially supported by the State Planning Organization under the Office of Prime Minister of Turkish Government with grant number 2008K010995. We are grateful for that support.
H. Kevser Sunercan: received her B.Sc. (2006) and M.Sc. (2010) in Computer Engineering from Middle East Technical University (METU), Turkey. She worked at MILSOFT A.S. as part-time and later full-time Software Engineer. Currently, she is working as a full-time researcher at Software Infrastructures Department of TUBITAK BILGEM UEKAE/ILTAREN, Ankara, Turkey.
References (60)
- et al.
The design and implementation of OGSA-DQP: a service-based distributed query processor
Future Generation Computer Systems
(2009) Minimizing the response time of executing a join between fragmented relations in a distributed database system
Mathematical and Computer Modelling
(1997)- et al.
State of the nation in data integration for bioinformatics
Journal of Biomedical Informatics
(2008) - et al.
A service-oriented system for distributed data querying and integration on grids
Future Generation Computer Systems
(2009) - et al.
MAPFS-DAI, an extension of OGSA-DAI based on parallel file system
Future Generation Computer Systems
(2007) - et al.
Self-monitoring query execution for adaptive query processing
Data & Knowledge Engineering
(2004) - I. Foster, J. Vöckler, M. Wilde, Y. Zhao, The virtual data grid: a new model and architecture for data-intensive...
- et al.
Giggle: a framework for constructing scalable replica location services
- et al.
Advanced grid database management with the greic data access service
- Globus Toolkit Software. http://www.globus.org (accessed...
The AMGA metadata service
Journal of Grid Computing
A metadata catalog service for data intensive applications
Project spitfire-towards grid web service databases
An experience report on designing and building OGSA-DQP: a service based distributed query processor for the grid
Using OGSA-DQP to support scientific applications for the grid
OGSA-DQP: a service for distributed querying on the grid
Optimization of join operations in horizontally partitioned database systems
ACM Transactions on Database Systems
Set query optimization in distributed database systems
ACM Transactions on Database Systems
Optimizing equijoin queries in distributed databases where relations are hash partitioned
ACM Transactions on Database Systems
A heuristic approach to cost-efficient derived horizontal fragmentation of complex value databases
Fragmenting relations horizontally using a knowledge-based approach
IEEE Transactions on Software Engineering
Description and identification of distributed fragments of recursive relations
IEEE Transactions on Knowledge and Data Engineering
Cited by (0)
H. Kevser Sunercan: received her B.Sc. (2006) and M.Sc. (2010) in Computer Engineering from Middle East Technical University (METU), Turkey. She worked at MILSOFT A.S. as part-time and later full-time Software Engineer. Currently, she is working as a full-time researcher at Software Infrastructures Department of TUBITAK BILGEM UEKAE/ILTAREN, Ankara, Turkey.
M. Nedim Alpdemir: received his B.Sc. (1990) in Computer Engineering from Middle East Technical University (METU), Turkey; M.Sc. (1996) in Advanced Computer Science and Ph.D. (2000) in Component-Based Simulation Environments from the Department Computer Science, University of Manchester, UK. He worked as a Research Associate, and later as a Research Fellow in the Information Management Group (IMG) at the Department Computer Science of University of Manchester, UK, until 2005, taking part in the development of OGSA-DQP a service-based distributed query processor. Currently he is the head of the Software Infrastructures Department at TUBITAK BILGEM UEKAE/ILTAREN, Ankara, Turkey.
Nihan Kesim Cicekli: is an Associate Professor in the Department of Computer Engineering at the Middle East Technical University, Ankara, Turkey. She received her B.Sc. degree in Computer Engineering at the Middle East Technical University in 1986. She received the M.Sc. degree in Computer Engineering at Bilkent University in Ankara in 1988; and the Ph.D. degree in Computer Science at Imperial College, London, UK in 1993. She was a visiting associate professor at the University of Central Florida, Orlando, USA, from 2001 till 2003. Her current research interests include multimedia databases, semantic web, web services, workflow management systems, recommender systems and temporal reasoning. She is a member of IEEE. She served on the program committee of several international conferences including VLDB and ICDE. For more information about her, see http://www.ceng.metu.edu.tr/~nihan.