Skip to main content

Tri-training and Data Editing Based Semi-supervised Clustering Algorithm

  • Conference paper
MICAI 2006: Advances in Artificial Intelligence (MICAI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4293))

Included in the following conference series:

Abstract

Seeds based semi-supervised clustering algorithms often utilize a seeds set consisting of a small amount of labeled data to initialize cluster centroids, hence improve the performance of clustering over whole data set. Researches indicate that both the scale and quality of seeds set greatly restrict the performance of semi-supervised clustering. A novel semi-supervised clustering algorithm named DE-Tri-training semi-supervised K means is proposed. In new algorithm, prior to initializing cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label the unlabeled data and add them into initial seeds to enlarge the scale. Meanwhile, to improve the quality of enlarged seeds set, a Nearest Neighbor Rule based data editing technique named Depuration is introduced into the Tri-training process to eliminate and correct the noise and mislabeled data among the enlarged seeds. Experiments show that novel algorithm can effectively improve the initialization of cluster centroids and enhance clustering performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 239.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)

    MATH  Google Scholar 

  2. Zhong, S.: Semi-supervised model-based document clustering: A comparative study. Machine Learning (published online, March 2006)

    Google Scholar 

  3. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cambridge (2006), http://www.kyb.tuebingen.mpg.de/ssl-book/ssl_toc.pdf

    Google Scholar 

  4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: 21st International Conference on Machine Learning, Banff, Canada (ICML 2004), pp. 81–88 (2004)

    Google Scholar 

  5. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: The 19th International Conference on Machine Learning (ICML 2002), pp. 19–26 (2002)

    Google Scholar 

  6. Demiriz, A., Bennett, K.P., Embrechts, M.J.: Semi-supervised clustering using genetic algorithms. In: Dagli, C.H., et al. (eds.) Intelligent Engineering Systems Through Artificial Neural Networks(ANNIE 1999), pp. 809–814. ASME Press, NewYork (1999)

    Google Scholar 

  7. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)

    Google Scholar 

  8. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA, pp. 59–68 (2004)

    Google Scholar 

  9. Seeger, M.: Learning with labelled and unlabelled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2002)

    Google Scholar 

  10. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)

    Article  MATH  Google Scholar 

  11. Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via the EM approach. Advances in Neural Information Processing Systems 6, 120–127 (1994)

    Google Scholar 

  12. Joachims, T.: Transductive inference for text classification using support vector machines. In: The Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. 200–209 (1999)

    Google Scholar 

  13. Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using randomized mincuts. In: The 21st International Conference on Machine Learning (ICML 2004) (2004)

    Google Scholar 

  14. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)

    Google Scholar 

  15. Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: The 17th International Conference on Machine Learning (ICML 2000), San Francisco, CA, pp. 327–334 (2000)

    Google Scholar 

  16. Zhou, Z.H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 11, 1529–1541 (2005)

    Article  Google Scholar 

  17. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  18. Li, M., Zhou, Z.H.: SETRED: Self-training with editing. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 611–621. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24, 1015–1022 (2003)

    Article  Google Scholar 

  20. Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognition 13, 251–255 (1981)

    Article  Google Scholar 

  21. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI-2000), pp. 58–64 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deng, C., Guo, M.Z. (2006). Tri-training and Data Editing Based Semi-supervised Clustering Algorithm. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_61

Download citation

  • DOI: https://doi.org/10.1007/11925231_61

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49026-5

  • Online ISBN: 978-3-540-49058-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics