Comparing Document Classification Schemes Using K-Means Clustering

Šilić, Artur; Moens, Marie-Francine; Žmak, Lovro; Bašić, Bojana Dalbelo

doi:10.1007/978-3-540-85563-7_78

Artur Šilić¹,
Marie-Francine Moens²,
Lovro Žmak¹ &
…
Bojana Dalbelo Bašić¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5177))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1943 Accesses
4 Citations

Abstract

In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with the clusters produced by the K-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining K-means and Principal Component Analysis for cluster visualization. The described idea allows calculations to be performed in reasonable amount of CPU time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)
Google Scholar
Croatian Information Documentation Referral Agency, http://www.hidra.hr/
Dhillon, I.S., Modha, D.S., Spangler, W.S.: Class visualization of high-dimensional data with applications. Computational Statistics and Data Analysis 41(1), 59–90 (2002)
Article MATH MathSciNet Google Scholar
Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Journal of Machine Learning 42, 143–175 (2001)
Article MATH Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)
Google Scholar
EUROVOC thesaurus, European Union publications office, http://europa.eu.int/celex/eurovoc/
Halkidi, M., Batistatis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)
Article MATH Google Scholar
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 129–136 (1982)
Article MATH MathSciNet Google Scholar
Moens, M.-F.: Note on Clustering Large Document Collections. Technical Report, CADIAL, Katholieke Universiteit Leuven (July 2007)
Google Scholar
Rosell, M., Kann, V., Litton, J.-E.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications. In: Proceedings of the International Conference on Natural Language Processing (ICON 2004), Hyderabad, India (2004)
Google Scholar
Satchidanandan, D., Chinmay, M., Ashish, G., Rajib, M.: A Comparative Study of Clustering Algorithms. Information Technology Journal 5, 551–559 (2006)
Article Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Su, M.C., Chou, C.H.: A K-means Algorithm with a Novel Non-Metric Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 674–680 (2001)
Article Google Scholar
Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic Acquisition of Inflectional Lexica for Morphological Normalisation. Information Processing & Management (2008) (accepted, to be published) doi:10.1016/j.ipm.2008.03.006
Google Scholar
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford, California, pp. 287–295 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000, Zagreb, Croatia
Artur Šilić, Lovro Žmak & Bojana Dalbelo Bašić
Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Heverlee, Belgium
Marie-Francine Moens

Authors

Artur Šilić
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar
Lovro Žmak
View author publications
You can also search for this author in PubMed Google Scholar
Bojana Dalbelo Bašić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ignac Lovrek Robert J. Howlett Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Šilić, A., Moens, MF., Žmak, L., Bašić, B.D. (2008). Comparing Document Classification Schemes Using K-Means Clustering. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2008. Lecture Notes in Computer Science(), vol 5177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85563-7_78

Download citation

DOI: https://doi.org/10.1007/978-3-540-85563-7_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85562-0
Online ISBN: 978-3-540-85563-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics