Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Pinto, David; Benedí, José-Miguel; Rosso, Paolo

doi:10.1007/978-3-540-70939-8_54

David Pinto^1,2,
José-Miguel Benedí¹ &
Paolo Rosso¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1680 Accesses

Abstract

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.

This work has been partially supported by the MCyT TIN2006-15265-C06-04 project, as well as by the BUAP-701 PROMEP/103.5/05/1536 grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Clustering small-sized collections of short texts

Article 30 November 2017

Evaluating the Performance of Transformers-Based Semantic Similarity Measures in Short-Text Clustering

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)
Chapter Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1998)
Article MathSciNet MATH Google Scholar
Bigi, B., Huang, Y., Mori, R.d.: Vocabulary and Language Model Adaptation using Information Retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)
Chapter Google Scholar
Bigi, B.: Using Kullback-Leibler Distance for Text Categorization. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)
Chapter Google Scholar
Bigi, B., Mori, R.d., El-Bèze, M., Spriet, T.: A fuzzy decision strategy for topic identification and dynamic selection of language models. Signal Processing Journal, Special Issue on Fuzzy Logic in Signal Processing 80(6), 1085–1097 (2000)
MATH Google Scholar
Booth, A.D.: A Law of Occurrences for Words of Low Frequency. Information and control 10(4), 386–393 (1967)
Article MATH Google Scholar
Burman, P.: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3), 503–514 (1989)
Article MathSciNet MATH Google Scholar
Carpineto, C., Mori, R.d., Romano, G., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001)
Article Google Scholar
Dagan, I., Lee, L., Pereira, F.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1–3), 43–69 (1999)
Article MATH Google Scholar
Fuglede, B., Topse, F.: Jensen-Shannon Divergence and Hilbert space embedding. IEEE Int. Sym. Information Theory (2004)
Google Scholar
Jiménez, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Procesamiento del Lenguaje Natural 35(1), 114–118 (2005)
Google Scholar
Johnson, S.C.: Hierarchical Clustering Schemes. Psychometrika 2, 241–254 (1967)
Article MATH Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 488–495. AAAI Press, Menlo Park (2003)
Google Scholar
Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)
Chapter Google Scholar
Montejo-Ráez, A., Ureña-Lopez, L.A., Steinberger, R.: Categorization using bibliographic records: beyond document content. Procesamiento del Lenguaje Natural 35(1), 119–126 (2005)
Google Scholar
Mori, R.d. (ed.): Spoken Dialogues with Computers. Academic Press, London (1998)
Google Scholar
Pekar, V., Krkoska, M., Staab, S.: Feature Weighting for Co-occurrence-based Classification of Words. In: Proceedings of the 20th Conference on Computational Linguistics, COLING-2004 (2004)
Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Chapter Google Scholar
Pinto, D., Rosso, P., Juan, A., Jiménez, H.: A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts. Procesamiento del Lenguaje Natural 37(1), 43–49 (2006)
Google Scholar
Pinto, D., Rosso, P.: KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline. In: Proceedings of TLH-ENC06, pp. 266–269 (2006)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In Program 14(3) (1980)
Google Scholar
Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)
Chapter Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-ACM, pp. 256–263. ACM Press, New York (1995)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. ICML, pp. 412–420 (1997)
Google Scholar
Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory 39(4), 1270–1279 (1993)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain
David Pinto, José-Miguel Benedí & Paolo Rosso
Faculty of Computer Science, BUAP, Puebla 72570, Ciudad Universitaria, Mexico
David Pinto

Authors

David Pinto
View author publications
You can also search for this author in PubMed Google Scholar
José-Miguel Benedí
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinto, D., Benedí, JM., Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics