Abstract
We study the influence of different clustering algorithms on cluster evolution monitoring in data streams. The capturing and interpretation of cluster change delivers indicators on the evolution of the underlying population. For text stream monitoring, the clusters can be summarized into topics, so that cluster monitoring provides insights on the data and decline of thematic subjects over time. However, such insights should always be taken with a grain of salt: The quality of the clusters has a decisive impact on the observed changes. In the simplest case, cluster change across the stream may be due to the low quality of the original cluster than to a drift in the population belonging to this cluster. We show our framework Theme Finder for topic evolution monitoring in streams and compare the influence to the quality of two very different cluster algorithms. After an evaluation of different cluster algorithms with external and internal quality measures, we use the center based bisecting k-means algorithm and the density-based DBScan algorithm. Our results show that the influence is relatively high and show that different clustering algorithms results allow to draw conclusion to the evaluation of the other cluster algorithm. Our experiments were done on a subarchive of the ACM library.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allan, J.: Introduction to Topic Detection and Tracking. Kluwer Academic Publishers, Dordrecht (2002)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 44(6), 391–407 (1990)
Ester, M., Sander, J.: Knowledge Discovery in Databases. Techniken und Anwendungen. Springer, Heidelberg (2000)
Karypis, G., Han, E.-H(S): Fast Supervised Dimensionality Reduction Algorithm with Apllications to Document Categorization & Retrieval. In: Proceedings of CIKM-2000, pp. 12–19. ACM Press, New York (2000)
Karypis, G., Steinbach, M., Kumar, V.: A comparison of document clustering techniques. In: TextMining Workshop at KDD 2000 (May 2000)
Lin, J., Vlachos, M., Keogh, E., Gunopulos, D.: Iterative Incremental Clustering of Time Series. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, Springer, Heidelberg (2004)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd int. Conf. on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, AAAI Press, Stanford (1996)
Moringa, S., Yamanishi, K.: Tracking Dynamics of Topic Trends Using a Finite Mixture Model. In: Kohavi, R., Gehrke, J., DuMouchel, W., Ghosh, J. (eds.) Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 811–816. ACM Press, New York (2004)
Pham, D.T., Dimov, S.S., Nguyen, C.D.: An Incremental K-means algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218, 783–795 (2004)
Schult, R., Spiliopoulou, M.: Discovering emerging topics in unlabelled text collections. In: Manolopoulos, Y., Pokorný, J., Sellis, T. (eds.) ADBIS 2006. LNCS, vol. 4152, pp. 353–366. Springer, Heidelberg (2006)
Schult, R., Spiliopoulou, M.: Expanding the Taxonomies of Bibliographic Archives with Persistent Long-Term Themes. In: Procedings of the 21th Annual ACM Symposium on Applied Computing (SAC 2006), ACM Press, New York (2006)
Wang, X., McCallum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: Procedings of KDD 2006, Philadelphia, Pennsylvania, ACM Press, New York (2006)
Zhong, S.: Efficient streaming text clustering. Neural Networks 18(5-6), 790–798 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schult, R. (2007). Comparing Clustering Algorithms and Their Influence on the Evolution of Labeled Clusters. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_63
Download citation
DOI: https://doi.org/10.1007/978-3-540-74469-6_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74467-2
Online ISBN: 978-3-540-74469-6
eBook Packages: Computer ScienceComputer Science (R0)