AMID: Approximation of MultI-measured Data using SVD
Introduction
In general, traditional database management systems (DBMSs) generate exact query results with respect to user requests. However, due to the explosive growth of networking in recent years, a large volume of data is transmitted into a system through the internet continuously. In this situation, to generate exact query results, a system may have to scan an enormous amount of data and wastes valuable resources (e.g., time, disk space, and computing power).
In particular, time critical applications such as decision support systems (DSSs) require a fast response in order to provide viable information to users. Due to the exploratory nature of many DSS applications, an exact result may not be required, while a user prefers a fast answer. Thus, approximate query answering has recently emerged as an effective method for generating viable answers to complex queries against a large volume of data. In recent years, in order to facilitate the approximate query answering, much research such as random sampling [10], [9], [16], [28], histogram [17], [18], [24], and wavelet [2], [6], [7], [8], [13], [19], [23] has been conducted.
Random sampling and histogram have long and rich history in the query optimization area. In order to estimate the accurate result size of queries, the statistics of data distribution is required. The most accurate statistics of the data is the data itself. However, the result size estimation using data itself is absurd. Thus, in order to represent the statistics, a small data set is selected based on a probability model in random sampling. In histogram, the data distribution is represented by buckets in which the summary data (e.g., the frequency of data values) is maintained.
Matias et al. [23] proposed a histogram method using wavelets. After their work, the wavelet-based techniques for query optimization and approximate query answering have received significant attention. Wavelets are a mathematical tool for a hierarchical decomposition of functions. By storing a few wavelet coefficients, the data can be stored on a small disk space with a little loss of accuracy and an approximate query result can be obtained efficiently.
In DSS applications, data generally consists of multiple measures. For example, a stock market database includes information on the corporation number, the trade amount, the upper price, the lower price and so on. These applications gather multi-measured data transmitted continuously and analyze them on-line. Also, for historical analyses, these applications may generate a snapshot periodically (e.g., daily or weekly).
The computation model of our work is shown in Fig. 1. For the current analyses, data is kept in the memory for a specified period of time. Collected data (snapshot) will be stored approximately in a data warehouse. Approximated data in a data warehouse is used for historical analyses.
In order to support approximate query answering in multiple measure environments using wavelets, Deligiannakis and Roussopoulos [3] presented extended wavelets. As mentioned in [3], wavelets cannot easily adapt to multi-measured data. In order to reduce the disk space and minimize the root squared error, the extended wavelet records multiple wavelet coefficients for different measures. In order to improve the time and space complexity for generating the extended wavelet, Guha et al. [14] suggested XWAVE which is based on a dynamic programming formulation minimizing the norm error efficiently.
However, recent works have shown that the wavelet techniques based on minimizing the norm error can suffer from important problems such as the severe bias and wide variance in the quality of reconstructed data, and the lack of the error bound for an individual approximate answer [6], [8]. Actually, the norm error is greater than or equal to the maximum error (i.e., ) for an individual approximate answer. But, it does not provide the tight bound of the maximum error.
In this paper, we propose a data approximation method for solving problems of the wavelet techniques mentioned above. We take a different approach, called AMID (Approximation of MultI-measured Data using SVD), for multi-measured environments. AMID utilizes singular value decomposition (SVD) [11], [22] which has been employed for diverse image applications such as compression and feature extraction. Also, for historical analyses in DSS systems, we propose incremental update methods based on SVD.
Multi-measured data is treated as a two-dimensional matrix. SVD of this matrix provides a medium to extract dominant vectors effectively. Using the extracted dominant vectors, the original matrix can be represented approximately.
SVD is a numerical tool, which effectively decomposes a matrix into two orthogonal matrices and its singular values. Thus a matrix A is decomposed into ,1 where A is an matrix that we want to summarize, is an diagonal matrix, U is an column-orthogonal matrix,2 and V is an orthogonal matrix.3
The contributions of this paper are as follows:
- •
Guarantee maximum absolute error for an individual data value: The wavelet techniques based on minimizing the norm error do not suggest the tight error bound for an individual data value. SVD guarantees the norm error. In addition, in this paper, based on mathematical analysis of SVD, we derive the error bound for each data value.
- •
Combine SVD and wavelets for multi-measured environments: Although SVD presents an effective mechanism for approximation of data, we adapt wavelets after applying SVD in order to improve the accuracy of approximation. We show that combining SVD and wavelets achieves less error ratio compared to the utilization of only SVD in the experiment.
- •
Adapt to the incremental update environments: To the best of our knowledge, SVD considers only the preexisting matrix. In incremental update environments, the current snapshot, which is generated for historical analysis, should be consolidated with the previously archived data. A naive approach is that the previous compressed data is reconstructed and the whole data including the reconstructed data and the current snapshot is compressed using SVD. This approach wastes the computing power and memory. Thus, we devise efficient SVD algorithms to reflect the current snapshot into compressed data without the whole reconstruction.
In addition, to demonstrate the effectiveness of AMID, we implemented various versions of AMID. We conducted an extensive experimental study with real-life and synthetic data sets. Our experiments show that AMID achieves an improvement of accuracy compared to other approaches.
Remarks on the originality: In this paper, to improve the accuracy of approximation for multi-measured environments, we propose a novel method to combine SVD and wavelets and provide the error bound for an individual data value. Also, we adapt SVD to the incremental update environments. To the best of our knowledge, applying SVD to incremental update environments was not considered previously.
The remainder of the paper is organized as follows. In Section 2, we present previous work. We describe the basics of SVD and wavelets in Section 3. In Section 4, we present the details of AMID and the error bound of SVD. Section 5 presents an extension of AMID for the incremental update environments. Section 6 contains the results of our experiments. Finally, in Section 7, we summarize our work.
Section snippets
Previous work
In order to support efficient approximate query processing, various techniques which represent a huge amount of data with small disk space have been proposed. The representative techniques among them are sampling [1], [10], [9], [16], [28], histogram [17], [18], [24], and wavelet [2], [6], [7], [8], [13], [19], [23].
The basic idea of sampling is that a small amount of samples of data well represents the data. In [28], the reservior sampling algorithm was presented which can be used to create
Preliminaries
In this section, we introduce the basics of Haar wavelet and singular value decomposition (SVD).
AMID
In this section, we present the mechanism of AMID which compresses multi-measured data effectively.
Enhancement of AMID for incremental updates
As illustrated in Fig. 1, DSS generates snapshots when the multi-measured data arrives continuously. In order to support historical analysis, the current snapshot should be combined with the archived data.
A naive approach of AMID in incremental update environments is to reconstruct the archived data, and combine the current snapshot and the archived data, and then AMID is applied to the consolidated data. It takes a long time to reconstruct the archived data and compute SVD of the consolidated
Experiments
In this section, we demonstrate the effectiveness of AMID. We performed experiments on both real-life data sets and synthetic data sets to evaluate the accuracy and efficiency of AMID. Also, for comparing AMID with other approaches, we implemented diverse approaches. Table 1 summarizes the symbols and the names of the techniques to explain the experimental results.
We first show the accuracy and performance of AMID, SVD, SVDD, SVDW, and EWA using the static data. For extended wavelets, we use
Conclusion
In this paper, we propose AMID (Approximation of MultI-measured Data using SVD) to approximate the data with multiple measures.
Previous techniques for multi-measured data are based on wavelets. In contrast, AMID adapts the singular value decomposition (SVD) as a dominant compressor. Using the mathematical analysis, we derive the error bound of an individual data value that is not suggested in the wavelet techniques minimizing the norm error.
In AMID, multi-measured data is decomposed into a
Acknowledgments
This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.
References (30)
- et al.
Compressed histograms with arbitrary bucket layouts for selectivity estimation
Information Sciences
(2007) - et al.
Summarizing data using a similarity based mountain method
Information Sciences
(2008) - P.G. Brown, P.J. Hass, Techniques for warehousing of sample data, in: Proc. of IEEE ICDE, 2006, p....
- G. Cormode, M. Garofalakis, D. Sacharidis, Fast approximate wavelet tracking on streams, in: Proc. of EDBT Conf., 2006,...
- A. Deligiannakis, N. Roussopoulos, Extended wavelets for multiple measures, in: Proc. of ACM SIGMOD Conf., 2003, pp....
- et al.
The approximation of one matrix by another of lower rank
Psychometrika
(1936) - M. Garofalakis, P.B. Gibbons, Wavelet synopses with error guarantees, in: Proc. of ACM SIGMOD Conf., 2002, pp....
- et al.
Probabilistic wavelet synopses
ACM Transactions on Database Systems (TODS)
(2004) - M. Garofalakis, A. Kumar, Deterministic wavelet thresholding for maximum-error metrics, in: Proc. of PODS, 2004, pp....
- P.B. Gibbons, Y. Matias, New sampling-based summary statistics for improving approximate query answers, in: Proc. of...
Matrix Computations
Cited by (2)
Combined metabolomics and proteomics to reveal beneficial mechanisms of Dendrobium fimbriatum against gastric mucosal injury
2022, Frontiers in PharmacologyA new method for singularity detection based on singular value decomposition
2011, Shenyang Gongye Daxue Xuebao/Journal of Shenyang University of Technology