Abstract
In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with the clusters produced by the K-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining K-means and Principal Component Analysis for cluster visualization. The described idea allows calculations to be performed in reasonable amount of CPU time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)
Croatian Information Documentation Referral Agency, http://www.hidra.hr/
Dhillon, I.S., Modha, D.S., Spangler, W.S.: Class visualization of high-dimensional data with applications. Computational Statistics and Data Analysis 41(1), 59–90 (2002)
Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Journal of Machine Learning 42, 143–175 (2001)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)
EUROVOC thesaurus, European Union publications office, http://europa.eu.int/celex/eurovoc/
Halkidi, M., Batistatis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 129–136 (1982)
Moens, M.-F.: Note on Clustering Large Document Collections. Technical Report, CADIAL, Katholieke Universiteit Leuven (July 2007)
Rosell, M., Kann, V., Litton, J.-E.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications. In: Proceedings of the International Conference on Natural Language Processing (ICON 2004), Hyderabad, India (2004)
Satchidanandan, D., Chinmay, M., Ashish, G., Rajib, M.: A Comparative Study of Clustering Algorithms. Information Technology Journal 5, 551–559 (2006)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Su, M.C., Chou, C.H.: A K-means Algorithm with a Novel Non-Metric Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 674–680 (2001)
Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic Acquisition of Inflectional Lexica for Morphological Normalisation. Information Processing & Management (2008) (accepted, to be published) doi:10.1016/j.ipm.2008.03.006
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford, California, pp. 287–295 (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Šilić, A., Moens, MF., Žmak, L., Bašić, B.D. (2008). Comparing Document Classification Schemes Using K-Means Clustering. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2008. Lecture Notes in Computer Science(), vol 5177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85563-7_78
Download citation
DOI: https://doi.org/10.1007/978-3-540-85563-7_78
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85562-0
Online ISBN: 978-3-540-85563-7
eBook Packages: Computer ScienceComputer Science (R0)