ABSTRACT
Data streams are fundamental in several data processing applications involving large amount of data generated continuously as a sequence of events. Frequently, such events are not stored, so the data is analyzed and queried as they arrive and discarded right away. In many applications these events are represented by a predetermined number of numerical attributes. Thus, without loss of generality, we can consider events as elements from a dimensional domain. A sequence of events in a data stream can be characterized by its intrinsic dimension, which in dimensional datasets is usually lower than the embedding dimensionality. As the intrinsic dimension can be used to improve the performance of algorithms handling dimensional data (specially query optimization) measuring it is relevant to improve data streams processing and analysis as well. Moreover, it can also be useful to forecast data behavior. Hence, we present an algorithm able to measure the intrinsic dimension of a data stream on the fly, following its continuously changing behavior. We also present experimental studies, using both real and synthetic data streams, showing that the results on well-understood datasets closely follow what is expected from the known behavior of the data.
- C. C. Aggarwal. A framework for diagnosing changes in evolving data streams. In Proc. of SIGMOD'03, pages 575--586, San Diego, USA, 2003. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS'02, pages 1--16, Madison, USA, 2002. Google ScholarDigital Library
- D. Barbará and P. Chen. Using self-similarity to cluster large data sets. Data Mining and Knowledge Discovery, 7(2):123--152, 2003. Google ScholarDigital Library
- A. Belussi and C. Faloutsos. Self-spatial join selectivity estimation using fractal concepts. TOIS, 16(2):161--201, 1998. Google ScholarDigital Library
- D. Chakrabarti and C. Faloutsos. F4: large-scale automated forecasting using fractals. In Proc. of CIKM'02, pages 2--9, McLean, EUA, 2002. Google ScholarDigital Library
- C. Faloutsos and I. Kamel. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proc. of PODS'94, pages 4--13, Minneapolis, USA, 1994. Google ScholarDigital Library
- J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Proc. of KDD'03, pages 523--528, Washington, USA, 2003. Google ScholarDigital Library
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarDigital Library
- M. Kantardzic, P. Sadeghian, and C. Shen. The time diversification monitoring of a stock portfolio: an approach based on the fractal dimension. In Proc. of SAC'04, pages 637--641, Nicosia, Cyprus, 2004. Google ScholarDigital Library
- E. Keogh and T. Folias. The UCR Time Series Data Mining Archive. University of California, Computer Science and Engineering Department, 2002 {http://www.cs.ucr.edu/eamonn/tsdma/index.html}.Google Scholar
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. of ICDE'05, pages 767--778, Tokyo, Japan, 2005. Google ScholarDigital Library
- B.-U. Pagel, F. Korn, and C. Faloutsos. Deflating the dimensionality curse using multiple fractal dimensions. In Proc. of ICDE'00, pages 589--598, San Diego, USA, 2000. Google ScholarDigital Library
- M. Schroeder. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. W. H. Freeman and Company, 1991.Google Scholar
- E. P. M. Sousa, A. J. M. Traina, and C. Traina. SID: Calculating the intrinsic dimension of data streams. In Proc. of the II SIGKDD Workshop on Fractals, Power Laws and Other Next Generation Data Mining Tools, pages 18--23, Washington, USA, 2003.Google Scholar
- C. Traina, A. Traina, L. Wu, and C. Faloutsos. Fast feature selection using fractal dimension. In Proc. of SBBD'00, pages 158--171, João Pessoa, Brazil, 2000.Google Scholar
Index Terms
- Evaluating the intrinsic dimension of evolving data streams
Recommendations
Measuring Evolving Data Streams’ Behavior through Their Intrinsic Dimension
AbstractThe dimension of a dataset has major impact on database management, such as indexing and querying processing. The embedding dimension (i.e., the number of attributes of the dataset) usually overestimates the actual contribution of the attributes ...
Intrinsic dimension estimation
The paper reviews state-of-the-art of the methods of Intrinsic Dimension (ID) Estimation.The paper defines the properties that an ideal ID estimator should have.The paper reviews, under the above mentioned framework, the major ID estimation methods ...
Adaptive non-linear clustering in data streams
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementData stream clustering has emerged as a challenging and interesting problem over the past few years. Due to the evolving nature, and one-pass restriction imposed by the data stream model, traditional clustering algorithms are inapplicable for stream ...
Comments