skip to main content
article

Mining data streams: a review

Published:01 June 2005Publication History
Skip Abstract Section

Abstract

The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traffic. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. The research in data stream mining has gained a high attraction due to the importance of its applications and the increasing generation of streaming information. Applications of data stream analysis can vary from critical scientific and astronomical applications to important business and financial ones. Algorithms, systems and frameworks that address streaming challenges have been developed over the past three years. In this review paper, we present the state-of-the-art in this growing vital field.

References

  1. C. Aggarwal, J. Han, J. Wang, P. S. Yu, A Framework for Clustering Evolving Data Streams, Proc. 2003 Int. Conf. on Very Large Data Bases, Berlin, Germany, Sept. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A Framework for Projected Clustering of High Dimensional Data Streams, Proc. 2004 Int. Conf. on Very Large Data Bases, Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On Demand Classification of Data Streams, Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Arasu, B. Babcock. S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom. STREAM: The Stanford Stream Data Manager Demonstration description - short overview of system status and plans; in Proc. of the ACM Intl Conf. on Management of Data, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of PODS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Babcock, M. Datar, and R. Motwani. Load Shedding Techniques for Data Stream Systems (short paper) In Proc. of the 2003 Workshop on Management and Processing of Data Streams, June 2003Google ScholarGoogle Scholar
  7. B. Babcock, M. Datar, R. Motwani, L. O'Callaghan: Maintaining Variance and k-Medians over Data Stream Windows, Proceedings of the 22nd Symposium on Principles of Database Systems, 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Bhargava, H. Kargupta, and M. Powers, Energy Consumption in Data Analysis for On-board and Distributed Applications, Proceedings of the ICML'03 workshop on Machine Learning Technologies for Autonomous Space Applications, 2003.Google ScholarGoogle Scholar
  9. M. Burl, Ch. Fowlkes, J. Roden, A. Stechert, and S. Mukhtar, Diamond Eye: A distributed architecture for image data mining, in SPIE DMKD, Orlando, April 1999.Google ScholarGoogle Scholar
  10. Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, L. Auvil. MAIDS: Mining Alarming Incidents from Data Streams. Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data, June 13-18, 2004, Paris, France. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Charikar, L. O'Callaghan, and R. Panigrahy. Better streaming algorithms for clustering problems In Proc. of 35th ACM Symposium on Theory of Computing, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams In VLDB Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Cormode, S. Muthukrishnan What's hot and what's not: tracking most frequent items dynamically. PODS 2003: 296--306 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Q. Ding, Q. Ding, and W. Perrizo, Decision Tree Classification of Spatial Data Streams Using Peano Count Trees, Proceedings of the ACM Symposium on Applied Computing, Madrid, Spain, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Domingos and G. Hulten. Mining High-Speed Data Streams. In Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Domingos and G. Hulten, A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering, Proceedings of the Eighteenth International Conference on Machine Learning. 2001. Williamstown, MA, Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Dong, J. Han, L. V. S. Lakshmanan, J. Pei, H. Wang and P. S. Yu. Online mining of changes from data streams: Research problems and preliminary results, In Proceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams. In cooperation with the 2003 ACM-SIGMOD International Conference on Management of Data, San Diego, CA, June 8, 2003.Google ScholarGoogle Scholar
  18. V. Ganti, Johannes Gehrke, Raghu Ramakrishnan: Mining Data Streams under Block Evolution. SIGKDD Explorations 3(2), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Garofalakis, Johannes Gehrke, Rajeev Rastogi: Querying and mining data streams: you only get one look a tutorial. SIGMOD Conference 2002: 635 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, Mining Frequent Patterns in Data Streams at Multiple Time Granularities, in H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), Next Generation Data Mining, AAAI/MIT, 2003.Google ScholarGoogle Scholar
  21. Gaber, M, M., Krishnaswamy, S., and Zaslavsky, A., On-board Mining of Data Streams in Sensor Networks, Accepted as a chapter in the forthcoming book Advanced Methods of Knowledge Discovery from Complex Data, (Eds.) Sanghamitra Badhyopadhyay, Ujjwal Maulik, Lawrence Holder and Diane Cook, Springer Verlag, to appearGoogle ScholarGoogle Scholar
  22. Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., A Cost-Efficient Model for Ubiquitous Data Stream Mining, the Tenth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Perugia Italy, July 4-9.Google ScholarGoogle Scholar
  23. Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S., Towards an Adaptive Approach for Mining Data Streams in Resource Constrained Environments, the Proceedings of Sixth International Conference on Data Warehousing and Knowledge Discovery - Industry Track (DaWak 2004), Zaragoza, Spain, 30 August - 3 September, Lecture Notes in Computer Science (LNCS), Springer Verlag.Google ScholarGoogle Scholar
  24. Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Resource-Aware Knowledge Discovery in Data Streams, the Proceedings of First International Workshop on Knowledge Discovery in Data Streams, to be held in conjunction with the 15th European Conference on Machine Learning and the 8th European Conference on the Principals and Practice of Knowledge Discovery in Databases, Pisa, Italy, 2004.Google ScholarGoogle Scholar
  25. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss: One-Pass Wavelet Decompositions of Data Streams. TKDE 15(3), 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Golab and M. T. Ozsu. Issues in Data Stream Management. In SIGMOD Record, Volume 32, Number 2, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Guha, A. Meyerson. N. Mishra, R. Motwani, and L. O'Callaghan, Clustering Data Streams: Theory and Practice TKDE special issue on clustering, vol. 15, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Guralnik and J. Srivastava. Event detection from time series data. In ACM KDD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. J. Hand, Statistics and Data Mining: Intersecting Disciplines ACM SIGKDD Explorations, 1, 1, pp. 16--19, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hand D. J., Mannila H., and Smyth P. (2001) Principles of data mining, MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Henzinger, P. Raghavan and S. Rajagopalan, Computing on data streams, Technical Note 1998-011, Digital Systems Research Center, Palo Alto, CA, May 1998Google ScholarGoogle Scholar
  33. J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmki, and H. T. T. Toivonen. Time series segmentation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 203--210, San Jos, California, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hoffmann F., Hand D. J., Adams N., Fisher D., and Guimaraes G. (eds) (2001) Advances in Intelligent Data Analysis. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Hulten, L. Spencer, and P. Domingos. Mining Time-Changing Data Streams. ACM SIGKDD 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. In Proc. of the 26th Int. Conf. on Very Large Data Bases, Cairo, Egypt, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kargupta, H., Park, B., Pittie, S., Liu, L., Kushraj. D. and Sarkar, K. (2002). MobiMine: Monitoring the Stock Market from a PDA. ACM SIGKDD Explorations. January 2002. Volume 3, Issue 2. Pages 37--46. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy, VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring, Proceedings of SIAM International Conference on Data Mining, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  39. E. Keogh, J. Lin, and W. Truppel. Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research. In proceedings of the 3rd IEEE International Conference on Data Mining. Melbourne, FL. Nov 19-22, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: An Architectural Status Report. IEEE Data Engineering Bulletin, Vol 26(1), March 2003.Google ScholarGoogle Scholar
  41. M. Last, Online Classification of Nonstationary Data Streams, Intelligent Data Analysis, Vol. 6, No. 2, pp. 129--147, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Muthukrishnan (2003), Data streams: algorithms and applications. Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha. and R. Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of IEEE International Conference on Data Engineering, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Ordonez. Clustering Binary Data Streams with K-means ACM DMKD 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. B. Park and H. Kargupta. Distributed Data Mining: Algorithms, Systems, and Applications. To be published in the Data Mining Handbook. Editor: Nong Ye. 2002.Google ScholarGoogle Scholar
  48. S. Papadimitriou, C. Faloutsos, and A. Brockwell, Adaptive, Hands-Off Stream Mining. 29th International Conference on Very Large Data Bases VLDB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. E. Perlman and A. Java. Predictive Mining of Time Series Data in Astronomy. In ASP Conf. Ser. 295: Astronomical Data Analysis Software and Systems XII, 2003.Google ScholarGoogle Scholar
  50. A. Srivastava and J. Stroeve, Onboard Detection of Snow, Ice, Clouds and Other Geophysical Processes Using Kernel Methods, Proceedings of the ICML'03 workshop on Machine Learning Technologies for Autonomous Space ApplicationsGoogle ScholarGoogle Scholar
  51. S. Tanner, M. Alshayeb, E. Criswell, M. Iyer, A. McDowell, M. McEniry, K. Regner, EVE: On-Board Process Planning and Execution, Earth Science Technology Conference, Pasadena, CA, Jun. 11-14. 2002Google ScholarGoogle Scholar
  52. N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, M. Stonebraker. Load Shedding on Data Streams, In Proceedings of the Workshop on Management and Processing of Data Streams, San Diego, CA, USA, June 8, 2003.Google ScholarGoogle Scholar
  53. H. Wang, W. Fan, P. Yu and J. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, in the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Aug, 2003, Washington DC, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Y. Zhu and D. Shasha. StatStream: Statistical monitoring of thousands of data streams in real time. In VLDB 2002, pages 358--369. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining data streams: a review

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGMOD Record
        ACM SIGMOD Record  Volume 34, Issue 2
        June 2005
        91 pages
        ISSN:0163-5808
        DOI:10.1145/1083784
        Issue’s Table of Contents

        Copyright © 2005 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 June 2005

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader