Wavelet transformation-based management of integrated summary data for distributed query processing

https://doi.org/10.1016/S0169-023X(01)00044-1Get rights and content

Abstract

As the Internet technology evolves, there is growing need for Internet queries involving multiple information sources. Efficient processing of such queries necessitates the integrated summary data that compactly represents the data distribution of the entire database scattered over many information sources. We propose a new method based on wavelet transform that creates and maintains the integrated summary data by merging multiple instances of summary data, each of which is maintained in an information source. A wavelet-based summary data is easily converted to satisfy conditions for merging. Moreover, the merging process is very simple owing to the shifting and linearity properties of wavelet transform. We formally derive the upper bound of the absolute, square-root, and maximum errors in the integrated wavelet-based summary data. We also show that the integrated summary data can be used for optimizing Internet queries effectively.

Introduction

As the Internet technology evolves, database search through the Internet is becoming a common way of finding useful information [3]. This paper deals with effective processing of database queries in such environments.

Fig. 1 shows a general query processing model in the Internet environment [13]. In the Internet, there are a large number of information sources, each of which has the capability of processing queries in its own database. Users issue Internet queries to the predetermined information sources and receive results through the Web clients. In most of the previous Internet applications, queries used to be involved with only a single information source. In recent applications, however, the queries involving multiple information sources are getting ever popular [8]. A typical example is finding the goods that satisfy the conditions given by a user from many Internet shopping malls.

To coordinate the multiple information sources for processing such queries, we use a module called mediator [13]. The mediator (1) receives an Internet query from a Web client, (2) selects information sources to issue the query, (3) translates the query to the local queries to be run for specific information sources, (4) sends the local query to each information source, (5) merges the query results received, and (6) returns the integrated query results to the Web client [14]. Since Internet queries deal with an enormous volume of data, we need effective processing of such queries.

Conventional DBMSs employ summary data for effective processing of queries. The summary data compactly represents the data distribution for one or more attributes in the database [1]. A data distribution is a set of pairs each consisting of a value and its frequency. The summary data are widely used in selectivity estimation for query optimization [17] and for physical database design [17], [23]. Recently, applications of the summary data are expanded to cover approximate query processing [21], on-line analytical processing (OLAP) [22], and top-k query processing [6].

Recently, there has been a lot of research effort on summary data. The main focus has been on accurate approximation of the actual data distribution and on query processing using summary data in a single information source [7], [16], [19]. To the extent of authors' knowledge, optimizing Internet queries using summary data for multiple information sources have not been addressed in the literature. In this paper, we discuss issues in the management of integrated summary data and in processing Internet queries using such data. We define the integrated summary data as the summary data for the virtual database consisting of all the databases in the target information sources. Especially, for multiple data cubes in OLAP, we need a summary data for the data cubes. The summary data for the integrated data cube is a kind of integrated summary data. We assume that the mediator creates, stores, and manages the integrated summary data and processes the Internet queries using the integrated summary data [14].

A simple brute-force method to create the integrated summary data would be merging the actual data distributions from individual information sources, and then obtaining the integrated summary data by compressing it. However, this method would incur high cost of transmitting, storing, and merging large volumes of data distributions. An alternative method is directly merging summary data from individual information sources. This method would significantly reduce the cost of transmitting, storing, and merging since summary data are much smaller than data distributions themselves. Further, this method is viable since the summary data for the individual information source is already available. A possible disadvantage of this method is the increased error in the integrated summary data compared with that in merging data distributions themselves because of the presence of error in the original summary data for individual information sources.

In this paper, we present a new method based on wavelet transformation for maintaining the integrated summary data using component summary data from individual information sources. We identify the conditions for merging and show that our method based on wavelet transform can easily be made to satisfy these conditions. We show that the wavelet-based summary data can be easily merged to create the integrated summary data due to the two properties of wavelet transformation: linearity and shifting. We also formally prove the bounds of the errors introduced in merging the wavelet-based summary data. In particular, we prove that the error in the integrated summary data is always smaller than the sum of those in component summary data to be merged. In general, the updating period of each summary data is different depending on the characteristics of its information source. This fact motivates us to propose an incremental update of the integrated summary data. We show that the integrated wavelet-based summary data is suitable for easy maintenance of incremental updates. As potential applications of the integrated summary data, we identify Internet query optimization, Internet top-N query processing, and OLAP. We discuss how our method can be applied to there applications in great benefit. Finally, we perform extensive experiments of Internet query optimization and Internet top-N query processing to verify the effectiveness of our approach.

The rest of the paper is organized as follows. Section 2 defines the terminology. Section 3 briefly reviews the wavelet-based summary data. Section 4 proposes a method to create and manage the integrated wavelet-based summary data. We also discuss the errors that can be introduced in the merging process and formally proves their bounds. Section 5 discusses applications of the proposed method. Section 6 presents and analyzes the results of performance evaluation. Finally, Section 7 summarizes and concludes the paper.

Section snippets

Terminology

Summary data compactly represents the distribution of attribute values in a database using a small amount of information. In this paper, we focus on integer- and real-valued attributes to be used for summary data. The domain DX is a set of all possible values of the attribute X. The value set VX is a set of values for the attribute X actually stored in a database (i.e., VXDX). Given VX={vi:1⩽i|DX|}, the frequency fi is the number of records that have vi for the attribute X in the database.

Wavelet-based summary data

In this section, we briefly review the concept of the wavelet transform and discuss the properties of the summary data created using that transform.

Integrated wavelet-based summary data

This section discusses management of the integrated summary data based on the wavelet-based summary data. We first define the process of merging summary data and propose a merging algorithm, and then, analyze the errors in the integrated summary data thus obtained. We also discuss an incremental update algorithm for the integrated summary data that reflects the changes in individual summary data in an incremental manner.

Applications

In this section, we address Internet query optimization, Internet top-N query processing, and OLAP as applications that can have benefit from using integrated summary data.

Performance evaluation

To prove the effectiveness of the integrated summary data, we have conducted two kinds of experiments: (1) estimating selectivities and (2) processing Internet top-N queries. In this section, we evaluate the accuracy of estimating selectivities using the integrated wavelet-based summary data and the effectiveness in processing Internet top-N queries using that. We assume that the wavelet-based summary data for each data distribution is maintained in each information source. We also assume that

Conclusions

In this paper, we have discussed techniques for managing the integrated summary data and its application to Internet query optimization. Merging data distributions for creating the integrated summary data suffers from high costs of transmitting, storing, and merging a large amount of data distributions. To overcome the drawbacks, we have proposed a new method based on wavelet transformation that creates the integrated summary data by merging multiple instances of summary data. We have also

Acknowledgements

This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc).

Moon Jeung Joe received B.S. degree in Electrical Engineering from Hanyang University in 1989, and M.S. degree in Electronic & Electrical Engineering from Pohang University of Science and Technology (POSTECH) in 1991. He is currently a Ph.D. candidate in Computer Science at KAIST. Since 1991, he has been a research engineer in LG Electronics. His research interests include Web database, distributed database, and ORDBMS.

References (25)

  • D. Barbara

    The New Jersey data reduction report

    IEEE Data Engineering Bulletin

    (1997)
  • P. Bernstein

    Query processing in a system for distributed databases (SDD-1)

    ACM Transactions on Database Systems

    (1981)
  • P. Bernstein

    The Asilomar report on database research

    SIGMOD Record

    (1998)
  • M. Carey et al.

    On saying enough already! in SQL

  • S. Chaudhuri et al.

    An overview of data warehousing and OLAP technology

    SIGMOD Record

    (1997)
  • S. Chaudhuri et al.

    Evaluating top-k selection queries

  • P. Gibbons et al.

    New sampling-based summary statistics for improving approximate query answers

  • L. Gravano et al.

    The effectiveness of GlOSS for text database discovery problem

  • H. Jagadish

    Optimal histograms with quality guarantees

  • Y. Matias et al.

    Wavelet-based histograms for selectivity estimation

  • Y. Matias et al.

    Dynamic maintenance of wavelet-based histograms

  • M. Ozsu et al.

    Principles of Distributed Database Systems

    (1999)
  • Cited by (1)

    Moon Jeung Joe received B.S. degree in Electrical Engineering from Hanyang University in 1989, and M.S. degree in Electronic & Electrical Engineering from Pohang University of Science and Technology (POSTECH) in 1991. He is currently a Ph.D. candidate in Computer Science at KAIST. Since 1991, he has been a research engineer in LG Electronics. His research interests include Web database, distributed database, and ORDBMS.

    Kyu-Young Whang graduated (Summa Cum Laude) from Seoul National University in 1973, and received the M.S. degrees from Korea Advanced Institute of Science and Technology (KAIST) in 1975, and Stanford University in 1982. He earned the Ph.D. degree from Stanford University in 1984. From 1983 to 1991, he was a Research Staff Member at the IBM T.J. Watson Research Center, Yorktown Heights, NY. He is now a full professor at the Department of Computer Science and the Director of the Advanced Information Technology Research Center (AITrc) of KAIST-an ERC supported by KOSEF. His research interests encompass data mining/data warehouses, database systems/storage systems, object-oriented databases, multimedia/hypermedia databases, geographic information systems (GIS), and digital libraries.

    He served as an IEEE Distinguished Visitor from 1989 to 1990, received the Best Paper Award from the 6th IEEE International Conference on Data Engineering, served the 5th IEEE International Conference on Data Engineering as a Program Co-Chair, and has served program committees of numerous international conferences including ACM SIGMOD and VLDB. He served the VLDB Conference as the Program Chair for Asia, Pacific, and Australia in 2000 and served the IFCIS CoopIS Conference as the Program Chair for Asia and Pacific Rim in 1998. He twice received the External Honor Recognition from IBM. He was an associate editor of the IEEE Data Engineering Bulletin from 1990 to 1993, and an editor of the Distributed and Parallel Database from 1991 to 1995. He is on the editorial boards of the VLDB Journal and International Journal of Geographic Information Systems. He is a Trustee of the VLDB Endowment and a Steering Committee Member of the DASFAA Conference. He was a vice president and served the board of directors of Korea Information Science Society. He is a senior member of the IEEE and a member of the ACM.

    Sang-Wook Kim received the B.S. degree in Computer Engineering from Seoul National University in Korea at 1989, and earned the M.S. and Ph.D. degrees in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) at 1991 and 1994, respectively. From 1994 to 1995, he worked with the Information and Electronics Research Center in Korea, as a Post-Doc. Since 1995, he has served as an Associate Professor of the Division of Computer, Information, and Communications Engineering at Kangwon National University in Korea. From 1999 to 2000, he worked with the IBM T.J. Watson Research Center in Yorktown Heights, New York, as a visiting scientist. He also visited the Computer Science Department of Stanford University as a summer intern in 1991. His research interests include data mining/data warehousing, multimedia information retrieval, transaction management, geographic information systems, and main memory databases. He is a member of the ACM and the IEEE.

    View full text