Skip to main content

Advertisement

Log in

Energy-based function to evaluate data stream clustering

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Severe constraints imposed by the nature of endless sequences of data collected from unstable phenomena have pushed the understanding and the development of automated analysis strategies, such as data clustering techniques. However, current clustering validation approaches are inadequate to data streams due to they do not properly evaluate representation of behavior changes. This paper proposes a novel function to continuously evaluate data stream clustering inspired in Lyapunov energy functions used by techniques such as the Hopfield artificial neural network and the Bidirectional Associative Memory (Bam). The proposed function considers three terms: i) the intra-cluster distance, which allows to evaluate cluster compactness; ii) the inter-cluster distance, which reflects cluster separability; and iii) entropy estimation of the clustering model, which permits the evaluation of the level of uncertainty in data streams. A first set of experiments illustrate the proposed function applied to scenarios of continuous evaluation of data stream clustering. Further experiments were conducted to compare this new function to well-established clustering indices and results confirm our proposal reflects the same information obtained with external clustering indices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. In our experiments, we set \(\delta = 0.001\).

  2. In our experiments, we set \(\eta = 10^{-12}\).

  3. x \(=\) rnorm(mean \(=\) 0, sd \(=\) 0.1, n \(=\) 10000) of the R tool (R Development Core Team 2011).

  4. This is performed by using the R command \(x = sin(seq(1,50,\mathrm{length}=10{,}000)) * seq(1{,}1{,}000,\mathrm{length}=10{,}000)\).

References

  • Ackley D, Hinton G, Sejnowski T (1985) A learning algorithm for Boltzmann machines. Cogn Sci 9(1): 147–169

    Article  Google Scholar 

  • Aggarwal C, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 81–92

  • Albertini M, Mello R (2007) A self-organizing neural network for detecting novelties. In: Proceedings of the 2007 ACM symposium on Applied computing, New York, USA, pp 462–466. http://doi.acm.org/10.1145/1244002.1244110

  • Albertini M, Mello R (2008) Nature-inspired informatics for intelligent applications and knowledge discovery: implications in business, science and engineering. IGI Global, chap A self-organizing neural network to approach novelty detection, pp 49–71

  • Allcock B, Foster I, Nefedova V, Chervenak A, Deelman E, Kesselman C, Lee J, Sim A, Shoshani A, Williams B (2001) High-performance remote access to climate simulation data: a challenge problem for data grid technologies. In: Proceedings of the ACM/IEEE (2001) Conference supercomputing. Denver, USA, pp 1–20

  • Alligood K, Sauer T, Yorke J (1996) Chaos: an introduction to dynamical systems. Springer, Berlin

    MATH  Google Scholar 

  • Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retriev 12:461–486

    Article  Google Scholar 

  • Biehl M, Ghosh A, Hammer B (2006) Learning vector quantization: The dynamics of winner-takes-all algorithms. Neurocomputing 69(7–9):660–670. http://www.sciencedirect.com/science/article/pii/S0925231205003218

    Google Scholar 

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: Massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  • Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T (2010b) Moa: Massive online analysis, a framework for stream classification and clustering. In: Journal of Machine Learning Research (JMLR) workshop and conference proceedings: workshop on applications of pattern analysis. J Mach Learn Res 11:44–50

    Google Scholar 

  • Bifet A, Holmes G, Pfahringer B, Read J, Kranen P, Kremer H, Jansen T, Seidl T (2011) Moa: a real-time analytics open source framework. In: Machine learning and knowledge discovery in databases, lecture notes in computer science, vol 6913. Springer, Berlin, pp 617–620

  • Box GEP, Jenkins GM (1976) Time series analysis, forecasting, and control. Holden-Day, San Francisco

    MATH  Google Scholar 

  • Cao F, Ester M, Qian W, Zhou A (2006) In density-based clustering over an evolving data stream with noise. SDM

  • Carpenter GA, Grossberg S, Rosen D (1991) Art 2-A: an adaptive resonance algorithm for rapid category learning and recognition. IJCNN-91-Seattle international joint conference on neural networks, 1991 ii, vol. 2, pp 151–156. http://dx.doi.org/10.1109/IJCNN.1991.155329

  • Freeman JA, Skapura DM (1991) Neural networks: algorithms, applications, and programming techniques. Addison Wesley, Redwood City

    MATH  Google Scholar 

  • Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, pp 625–632. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4273

  • Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the (1998) ACM SIGMOD international conference on Management of data. ACM, New York, pp 73–84

  • Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: Theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528

    Article  Google Scholar 

  • Hartigan J, Wong M (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Applied Statistics) 28(1):100–108

    MATH  Google Scholar 

  • Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Hinneburg A, Keim D (1999) Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large data bases, San Francisco, USA, pp 506–517

  • Holt CC (2004) Forecasting seasonals and trends by exponentially weighted moving averages. Int J Forecast 20(1):5–10

    Article  Google Scholar 

  • Itti L, Baldi P (2005) A principled approach to detecting surprising events in video. In: Proceedings of the (2005) IEEE Computer Society conference on computer vision and pattern recognition. Washington, USA, pp 631–637

    Google Scholar 

  • Jain A (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  • Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In Proceedings of the thirtieth international conference on very large data bases-volume. VLDB Endowment, pp 180–191

  • Kohonen T (1997) Self-organizing maps. Springer, Secaucus

    Book  MATH  Google Scholar 

  • Kosko B (1988) Bidirectional associative memories. IEEE Trans Syst Man Cybern 18(1):49–60

    Article  MathSciNet  Google Scholar 

  • Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: Indexing micro-clusters for anytime stream mining. In: Knowl Inf Syst J 29(2):249–272

  • Lindstrom P, Delany S, Mac Namee B (2010) Handling concept drift in a text data stream constrained by high labelling cost. In: Proceedings of the Florida Artificial Intelligence Research Society conference, Daytona Beach, USA, pp 1–8

  • Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, New York, pp 613–618. http://doi.acm.org/10.1145/956750.956828

  • Marsland S (2002) On-line novelty detection through self-organisation, with application to inspection robotics. PhD thesis, University of Manchester

  • Masud M, Chen Q, Gao J, Khan L, Han J, Thuraisingham B (2010) Classification and novel class detection of data streams in a dynamic feature space. Mach Learn Knowl Discov Datab 6322:337–352

    Article  Google Scholar 

  • Meila M (2003) Comparing clusterings by the variation of information, pp 173–187. http://www.springerlink.com/content/rdda4r5bm9ajlbw3

  • O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002a) Streaming-data algorithms for high-quality clustering. In: ICDE, pp 685–694

  • O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002b) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering, San Jose, California, pp 685–694

  • Pavlidis N, Tasoulis D, Adams N, Hand D (2011) [lambda]-Perceptron: an adaptive classifier for data streams. Pattern Recogn 44:78–96

    Article  MATH  Google Scholar 

  • Pincus SM (1991) Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88:2297–2301

    Article  MathSciNet  MATH  Google Scholar 

  • R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

  • Roberts SW (1966) A comparison of some control chart procedures. Technometrics 8(3):411–430. http://www.jstor.org/stable/1266688

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. DOI:10.1016/0377-0427(87)90125-7. http://www.sciencedirect.com/science/article/pii/037704278790-1257

  • Seneta E (2006) Non-negative matrices and Markov chains (Springer Series in Statistics), 2nd edn. Springer, Berlin. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387297650

  • Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656

    Article  MathSciNet  Google Scholar 

  • Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the international conference on very large data bases, San Francisco, USA, pp 428–439

  • Spinosa E, de Leon F et al (2008) Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In: Proceedings of the (2008) ACM symposium on applied computing. ACM, New York, pp 976–980

  • Stanley JC (1976) Computer simulation of a model of habituation. Nature 261:146–148. doi:10.1038/261146a0. http://www.nature.com/nature/journal/v261/n5556/abs/261146a0.html

    Google Scholar 

  • Swenson G Jr, Kellermann K (1975) An intercontinental array—a next-generation radio telescope. Science 188(4195):1263

    Article  Google Scholar 

  • Tyson T, Pike R, Stein M, Szalay A (2002) Managing and mining the Lsst data sets. Tech. rep, The Lsst Collaboration

    Google Scholar 

  • Wang W, Yang J, Muntz R (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, San Francisco, USA, pp 186–195

  • Winters PR (1960) Forecasting sales by exponentially weighted moving averages. Manage Sci 6(3):324–342

    Article  MathSciNet  MATH  Google Scholar 

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  • Young S, Arel I, Karnowski TP, Rose D (2010) A fast and stable incremental clustering algorithm. In: Third international conference on information technology: new generations. IEEE Computer Society, Los Alamitos, pp 204–209. http://doi.ieeecomputersociety.org/10.1109/ITNG.2010.148

  • Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, vol 2. ACM, New York, pp 103–114

Download references

Acknowledgments

This paper is based upon work supported by FAPESP—São Paulo Research Foundation, Brazil, under grant no. 2011/19459-8, “CAPES—Coordenadoria de Aperfeiçoamento de Pessoal de Nível Superior” research funding agency under grant no. PDEE-4443-08-0, CNPq—National Council for Scientific and Technological Development research funding agency under grant no. 304338/2008-7. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP, CAPES and CNPq.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcelo Keese Albertini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albertini, M.K., de Mello, R.F. Energy-based function to evaluate data stream clustering. Adv Data Anal Classif 7, 435–464 (2013). https://doi.org/10.1007/s11634-013-0145-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0145-3

Keywords

Mathematics Subject Classification

Navigation