Elsevier

Information Sciences

Volume 179, Issue 16, 20 July 2009, Pages 2833-2850
Information Sciences

AMID: Approximation of MultI-measured Data using SVD

https://doi.org/10.1016/j.ins.2009.04.008Get rights and content

Abstract

Approximate query answering has recently emerged as an effective method for generating a viable answer. Among various techniques for approximate query answering, wavelets have received a lot of attention. However, wavelet techniques minimizing the root squared error (i.e., the L2 norm error) have several problems such as the poor quality of reconstructed data when the original data is biased. In this paper, we present AMID (Approximation of MultI-measured Data using SVD) for multi-measured data. In AMID, we adapt the singular value decomposition (SVD) to compress multi-measured data. We show that SVD guarantees the root squared error, and also drive an error bound of SVD for an individual data value, using mathematical analyses. In addition, in order to improve the accuracy of approximated data, we combine SVD and wavelets in AMID.

Since SVD is applied to a fixed matrix, we use various properties of matrices to adapt SVD to the incremental update environment. We devise two variants of AMID for the incremental update environment: incremental AMID and local AMID. To the best of our knowledge, our work is the first to extend SVD to incremental update environments.

Introduction

In general, traditional database management systems (DBMSs) generate exact query results with respect to user requests. However, due to the explosive growth of networking in recent years, a large volume of data is transmitted into a system through the internet continuously. In this situation, to generate exact query results, a system may have to scan an enormous amount of data and wastes valuable resources (e.g., time, disk space, and computing power).

In particular, time critical applications such as decision support systems (DSSs) require a fast response in order to provide viable information to users. Due to the exploratory nature of many DSS applications, an exact result may not be required, while a user prefers a fast answer. Thus, approximate query answering has recently emerged as an effective method for generating viable answers to complex queries against a large volume of data. In recent years, in order to facilitate the approximate query answering, much research such as random sampling [10], [9], [16], [28], histogram [17], [18], [24], and wavelet [2], [6], [7], [8], [13], [19], [23] has been conducted.

Random sampling and histogram have long and rich history in the query optimization area. In order to estimate the accurate result size of queries, the statistics of data distribution is required. The most accurate statistics of the data is the data itself. However, the result size estimation using data itself is absurd. Thus, in order to represent the statistics, a small data set is selected based on a probability model in random sampling. In histogram, the data distribution is represented by buckets in which the summary data (e.g., the frequency of data values) is maintained.

Matias et al. [23] proposed a histogram method using wavelets. After their work, the wavelet-based techniques for query optimization and approximate query answering have received significant attention. Wavelets are a mathematical tool for a hierarchical decomposition of functions. By storing a few wavelet coefficients, the data can be stored on a small disk space with a little loss of accuracy and an approximate query result can be obtained efficiently.

In DSS applications, data generally consists of multiple measures. For example, a stock market database includes information on the corporation number, the trade amount, the upper price, the lower price and so on. These applications gather multi-measured data transmitted continuously and analyze them on-line. Also, for historical analyses, these applications may generate a snapshot periodically (e.g., daily or weekly).

The computation model of our work is shown in Fig. 1. For the current analyses, data is kept in the memory for a specified period of time. Collected data (snapshot) will be stored approximately in a data warehouse. Approximated data in a data warehouse is used for historical analyses.

In order to support approximate query answering in multiple measure environments using wavelets, Deligiannakis and Roussopoulos [3] presented extended wavelets. As mentioned in [3], wavelets cannot easily adapt to multi-measured data. In order to reduce the disk space and minimize the root squared error, the extended wavelet records multiple wavelet coefficients for different measures. In order to improve the time and space complexity for generating the extended wavelet, Guha et al. [14] suggested XWAVE which is based on a dynamic programming formulation minimizing the L2 norm error efficiently.

However, recent works have shown that the wavelet techniques based on minimizing the L2 norm error can suffer from important problems such as the severe bias and wide variance in the quality of reconstructed data, and the lack of the error bound for an individual approximate answer [6], [8]. Actually, the L2 norm error is greater than or equal to the maximum error (i.e., L) for an individual approximate answer. But, it does not provide the tight bound of the maximum error.

In this paper, we propose a data approximation method for solving problems of the wavelet techniques mentioned above. We take a different approach, called AMID (Approximation of MultI-measured Data using SVD), for multi-measured environments. AMID utilizes singular value decomposition (SVD) [11], [22] which has been employed for diverse image applications such as compression and feature extraction. Also, for historical analyses in DSS systems, we propose incremental update methods based on SVD.

Multi-measured data is treated as a two-dimensional matrix. SVD of this matrix provides a medium to extract dominant vectors effectively. Using the extracted dominant vectors, the original matrix can be represented approximately.

SVD is a numerical tool, which effectively decomposes a matrix into two orthogonal matrices and its singular values. Thus a matrix A is decomposed into A=UΣVT,1 where A is an m×n matrix that we want to summarize, Σ is an n×n diagonal matrix, U is an m×n column-orthogonal matrix,2 and V is an n×n orthogonal matrix.3

The contributions of this paper are as follows:

  • Guarantee maximum absolute error for an individual data value: The wavelet techniques based on minimizing the L2 norm error do not suggest the tight error bound for an individual data value. SVD guarantees the L2 norm error. In addition, in this paper, based on mathematical analysis of SVD, we derive the error bound for each data value.

  • Combine SVD and wavelets for multi-measured environments: Although SVD presents an effective mechanism for approximation of data, we adapt wavelets after applying SVD in order to improve the accuracy of approximation. We show that combining SVD and wavelets achieves less error ratio compared to the utilization of only SVD in the experiment.

  • Adapt to the incremental update environments: To the best of our knowledge, SVD considers only the preexisting matrix. In incremental update environments, the current snapshot, which is generated for historical analysis, should be consolidated with the previously archived data. A naive approach is that the previous compressed data is reconstructed and the whole data including the reconstructed data and the current snapshot is compressed using SVD. This approach wastes the computing power and memory. Thus, we devise efficient SVD algorithms to reflect the current snapshot into compressed data without the whole reconstruction.

In addition, to demonstrate the effectiveness of AMID, we implemented various versions of AMID. We conducted an extensive experimental study with real-life and synthetic data sets. Our experiments show that AMID achieves an improvement of accuracy compared to other approaches.

Remarks on the originality: In this paper, to improve the accuracy of approximation for multi-measured environments, we propose a novel method to combine SVD and wavelets and provide the error bound for an individual data value. Also, we adapt SVD to the incremental update environments. To the best of our knowledge, applying SVD to incremental update environments was not considered previously.

The remainder of the paper is organized as follows. In Section 2, we present previous work. We describe the basics of SVD and wavelets in Section 3. In Section 4, we present the details of AMID and the error bound of SVD. Section 5 presents an extension of AMID for the incremental update environments. Section 6 contains the results of our experiments. Finally, in Section 7, we summarize our work.

Section snippets

Previous work

In order to support efficient approximate query processing, various techniques which represent a huge amount of data with small disk space have been proposed. The representative techniques among them are sampling [1], [10], [9], [16], [28], histogram [17], [18], [24], and wavelet [2], [6], [7], [8], [13], [19], [23].

The basic idea of sampling is that a small amount of samples of data well represents the data. In [28], the reservior sampling algorithm was presented which can be used to create

Preliminaries

In this section, we introduce the basics of Haar wavelet and singular value decomposition (SVD).

AMID

In this section, we present the mechanism of AMID which compresses multi-measured data effectively.

Enhancement of AMID for incremental updates

As illustrated in Fig. 1, DSS generates snapshots when the multi-measured data arrives continuously. In order to support historical analysis, the current snapshot should be combined with the archived data.

A naive approach of AMID in incremental update environments is to reconstruct the archived data, and combine the current snapshot and the archived data, and then AMID is applied to the consolidated data. It takes a long time to reconstruct the archived data and compute SVD of the consolidated

Experiments

In this section, we demonstrate the effectiveness of AMID. We performed experiments on both real-life data sets and synthetic data sets to evaluate the accuracy and efficiency of AMID. Also, for comparing AMID with other approaches, we implemented diverse approaches. Table 1 summarizes the symbols and the names of the techniques to explain the experimental results.

We first show the accuracy and performance of AMID, SVD, SVDD, SVDW, and EWA using the static data. For extended wavelets, we use

Conclusion

In this paper, we propose AMID (Approximation of MultI-measured Data using SVD) to approximate the data with multiple measures.

Previous techniques for multi-measured data are based on wavelets. In contrast, AMID adapts the singular value decomposition (SVD) as a dominant compressor. Using the mathematical analysis, we derive the error bound of an individual data value that is not suggested in the wavelet techniques minimizing the L2 norm error.

In AMID, multi-measured data is decomposed into a

Acknowledgments

This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.

References (30)

  • D. Fuchs et al.

    Compressed histograms with arbitrary bucket layouts for selectivity estimation

    Information Sciences

    (2007)
  • R.R. Yager et al.

    Summarizing data using a similarity based mountain method

    Information Sciences

    (2008)
  • P.G. Brown, P.J. Hass, Techniques for warehousing of sample data, in: Proc. of IEEE ICDE, 2006, p....
  • G. Cormode, M. Garofalakis, D. Sacharidis, Fast approximate wavelet tracking on streams, in: Proc. of EDBT Conf., 2006,...
  • A. Deligiannakis, N. Roussopoulos, Extended wavelets for multiple measures, in: Proc. of ACM SIGMOD Conf., 2003, pp....
  • C. Eckart et al.

    The approximation of one matrix by another of lower rank

    Psychometrika

    (1936)
  • M. Garofalakis, P.B. Gibbons, Wavelet synopses with error guarantees, in: Proc. of ACM SIGMOD Conf., 2002, pp....
  • M. Garofalakis et al.

    Probabilistic wavelet synopses

    ACM Transactions on Database Systems (TODS)

    (2004)
  • M. Garofalakis, A. Kumar, Deterministic wavelet thresholding for maximum-error metrics, in: Proc. of PODS, 2004, pp....
  • P.B. Gibbons, Y. Matias, New sampling-based summary statistics for improving approximate query answers, in: Proc. of...
  • P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in: Proc. of...
  • G.H. Golub et al.

    Matrix Computations

    (1996)
  • S. Guha, B. Harb, Wavelet synopsis for data streams: minimizing non-Euclidean error, in: Proc. of ACM SIGKDD Conf.,...
  • S. Guha, B. Harb, Approximation algorithms for wavelet transform coding of data streams, in: Proc. of SODA Conf., 2006,...
  • S. Guha, C. Kim, K. Shim, XWAVE: optimal and approximate extended wavelets for streaming data, in: Proc. of VLDB Conf.,...
  • View full text