Abstract
Seeds based semi-supervised clustering algorithms often utilize a seeds set consisting of a small amount of labeled data to initialize cluster centroids, hence improve the performance of clustering over whole data set. Researches indicate that both the scale and quality of seeds set greatly restrict the performance of semi-supervised clustering. A novel semi-supervised clustering algorithm named DE-Tri-training semi-supervised K means is proposed. In new algorithm, prior to initializing cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label the unlabeled data and add them into initial seeds to enlarge the scale. Meanwhile, to improve the quality of enlarged seeds set, a Nearest Neighbor Rule based data editing technique named Depuration is introduced into the Tri-training process to eliminate and correct the noise and mislabeled data among the enlarged seeds. Experiments show that novel algorithm can effectively improve the initialization of cluster centroids and enhance clustering performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)
Zhong, S.: Semi-supervised model-based document clustering: A comparative study. Machine Learning (published online, March 2006)
Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cambridge (2006), http://www.kyb.tuebingen.mpg.de/ssl-book/ssl_toc.pdf
Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: 21st International Conference on Machine Learning, Banff, Canada (ICML 2004), pp. 81–88 (2004)
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: The 19th International Conference on Machine Learning (ICML 2002), pp. 19–26 (2002)
Demiriz, A., Bennett, K.P., Embrechts, M.J.: Semi-supervised clustering using genetic algorithms. In: Dagli, C.H., et al. (eds.) Intelligent Engineering Systems Through Artificial Neural Networks(ANNIE 1999), pp. 809–814. ASME Press, NewYork (1999)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)
Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA, pp. 59–68 (2004)
Seeger, M.: Learning with labelled and unlabelled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2002)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)
Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via the EM approach. Advances in Neural Information Processing Systems 6, 120–127 (1994)
Joachims, T.: Transductive inference for text classification using support vector machines. In: The Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. 200–209 (1999)
Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using randomized mincuts. In: The 21st International Conference on Machine Learning (ICML 2004) (2004)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)
Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: The 17th International Conference on Machine Learning (ICML 2000), San Francisco, CA, pp. 327–334 (2000)
Zhou, Z.H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 11, 1529–1541 (2005)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Li, M., Zhou, Z.H.: SETRED: Self-training with editing. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 611–621. Springer, Heidelberg (2005)
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24, 1015–1022 (2003)
Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognition 13, 251–255 (1981)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI-2000), pp. 58–64 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, C., Guo, M.Z. (2006). Tri-training and Data Editing Based Semi-supervised Clustering Algorithm. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_61
Download citation
DOI: https://doi.org/10.1007/11925231_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49026-5
Online ISBN: 978-3-540-49058-6
eBook Packages: Computer ScienceComputer Science (R0)