Skip to main content

Comparing Document Classification Schemes Using K-Means Clustering

  • Conference paper
Knowledge-Based Intelligent Information and Engineering Systems (KES 2008)

Abstract

In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with the clusters produced by the K-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining K-means and Principal Component Analysis for cluster visualization. The described idea allows calculations to be performed in reasonable amount of CPU time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)

    Google Scholar 

  2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)

    Google Scholar 

  3. Croatian Information Documentation Referral Agency, http://www.hidra.hr/

  4. Dhillon, I.S., Modha, D.S., Spangler, W.S.: Class visualization of high-dimensional data with applications. Computational Statistics and Data Analysis 41(1), 59–90 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  5. Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Journal of Machine Learning 42, 143–175 (2001)

    Article  MATH  Google Scholar 

  6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)

    Google Scholar 

  7. EUROVOC thesaurus, European Union publications office, http://europa.eu.int/celex/eurovoc/

  8. Halkidi, M., Batistatis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)

    Article  MATH  Google Scholar 

  9. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)

    Google Scholar 

  10. Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 129–136 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  11. Moens, M.-F.: Note on Clustering Large Document Collections. Technical Report, CADIAL, Katholieke Universiteit Leuven (July 2007)

    Google Scholar 

  12. Rosell, M., Kann, V., Litton, J.-E.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications. In: Proceedings of the International Conference on Natural Language Processing (ICON 2004), Hyderabad, India (2004)

    Google Scholar 

  13. Satchidanandan, D., Chinmay, M., Ashish, G., Rajib, M.: A Comparative Study of Clustering Algorithms. Information Technology Journal 5, 551–559 (2006)

    Article  Google Scholar 

  14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  15. Su, M.C., Chou, C.H.: A K-means Algorithm with a Novel Non-Metric Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 674–680 (2001)

    Article  Google Scholar 

  16. Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic Acquisition of Inflectional Lexica for Morphological Normalisation. Information Processing & Management (2008) (accepted, to be published) doi:10.1016/j.ipm.2008.03.006

    Google Scholar 

  17. Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford, California, pp. 287–295 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ignac Lovrek Robert J. Howlett Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Šilić, A., Moens, MF., Žmak, L., Bašić, B.D. (2008). Comparing Document Classification Schemes Using K-Means Clustering. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2008. Lecture Notes in Computer Science(), vol 5177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85563-7_78

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85563-7_78

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85562-0

  • Online ISBN: 978-3-540-85563-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics