Abstract
Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.
This work has been partially supported by the MCyT TIN2006-15265-C06-04 project, as well as by the BUAP-701 PROMEP/103.5/05/1536 grant.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1998)
Bigi, B., Huang, Y., Mori, R.d.: Vocabulary and Language Model Adaptation using Information Retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)
Bigi, B.: Using Kullback-Leibler Distance for Text Categorization. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)
Bigi, B., Mori, R.d., El-Bèze, M., Spriet, T.: A fuzzy decision strategy for topic identification and dynamic selection of language models. Signal Processing Journal, Special Issue on Fuzzy Logic in Signal Processing 80(6), 1085–1097 (2000)
Booth, A.D.: A Law of Occurrences for Words of Low Frequency. Information and control 10(4), 386–393 (1967)
Burman, P.: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3), 503–514 (1989)
Carpineto, C., Mori, R.d., Romano, G., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001)
Dagan, I., Lee, L., Pereira, F.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1–3), 43–69 (1999)
Fuglede, B., Topse, F.: Jensen-Shannon Divergence and Hilbert space embedding. IEEE Int. Sym. Information Theory (2004)
Jiménez, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Procesamiento del Lenguaje Natural 35(1), 114–118 (2005)
Johnson, S.C.: Hierarchical Clustering Schemes. Psychometrika 2, 241–254 (1967)
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)
Liu, T., Liu, S., Chen, Z., Ma, W.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 488–495. AAAI Press, Menlo Park (2003)
Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)
Montejo-Ráez, A., Ureña-Lopez, L.A., Steinberger, R.: Categorization using bibliographic records: beyond document content. Procesamiento del Lenguaje Natural 35(1), 119–126 (2005)
Mori, R.d. (ed.): Spoken Dialogues with Computers. Academic Press, London (1998)
Pekar, V., Krkoska, M., Staab, S.: Feature Weighting for Co-occurrence-based Classification of Words. In: Proceedings of the 20th Conference on Computational Linguistics, COLING-2004 (2004)
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Pinto, D., Rosso, P., Juan, A., Jiménez, H.: A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts. Procesamiento del Lenguaje Natural 37(1), 43–49 (2006)
Pinto, D., Rosso, P.: KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline. In: Proceedings of TLH-ENC06, pp. 266–269 (2006)
Porter, M.F.: An algorithm for suffix stripping. In Program 14(3) (1980)
Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-ACM, pp. 256–263. ACM Press, New York (1995)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. ICML, pp. 412–420 (1997)
Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory 39(4), 1270–1279 (1993)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pinto, D., Benedí, JM., Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-70939-8_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)