Skip to main content

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Abstract

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.

This work has been partially supported by the MCyT TIN2006-15265-C06-04 project, as well as by the BUAP-701 PROMEP/103.5/05/1536 grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bigi, B., Huang, Y., Mori, R.d.: Vocabulary and Language Model Adaptation using Information Retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Bigi, B.: Using Kullback-Leibler Distance for Text Categorization. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Bigi, B., Mori, R.d., El-Bèze, M., Spriet, T.: A fuzzy decision strategy for topic identification and dynamic selection of language models. Signal Processing Journal, Special Issue on Fuzzy Logic in Signal Processing 80(6), 1085–1097 (2000)

    MATH  Google Scholar 

  6. Booth, A.D.: A Law of Occurrences for Words of Low Frequency. Information and control 10(4), 386–393 (1967)

    Article  MATH  Google Scholar 

  7. Burman, P.: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3), 503–514 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  8. Carpineto, C., Mori, R.d., Romano, G., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001)

    Article  Google Scholar 

  9. Dagan, I., Lee, L., Pereira, F.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1–3), 43–69 (1999)

    Article  MATH  Google Scholar 

  10. Fuglede, B., Topse, F.: Jensen-Shannon Divergence and Hilbert space embedding. IEEE Int. Sym. Information Theory (2004)

    Google Scholar 

  11. Jiménez, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Procesamiento del Lenguaje Natural 35(1), 114–118 (2005)

    Google Scholar 

  12. Johnson, S.C.: Hierarchical Clustering Schemes. Psychometrika 2, 241–254 (1967)

    Article  MATH  Google Scholar 

  13. Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  14. Liu, T., Liu, S., Chen, Z., Ma, W.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 488–495. AAAI Press, Menlo Park (2003)

    Google Scholar 

  15. Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Montejo-Ráez, A., Ureña-Lopez, L.A., Steinberger, R.: Categorization using bibliographic records: beyond document content. Procesamiento del Lenguaje Natural 35(1), 119–126 (2005)

    Google Scholar 

  17. Mori, R.d. (ed.): Spoken Dialogues with Computers. Academic Press, London (1998)

    Google Scholar 

  18. Pekar, V., Krkoska, M., Staab, S.: Feature Weighting for Co-occurrence-based Classification of Words. In: Proceedings of the 20th Conference on Computational Linguistics, COLING-2004 (2004)

    Google Scholar 

  19. Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Pinto, D., Rosso, P., Juan, A., Jiménez, H.: A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts. Procesamiento del Lenguaje Natural 37(1), 43–49 (2006)

    Google Scholar 

  21. Pinto, D., Rosso, P.: KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline. In: Proceedings of TLH-ENC06, pp. 266–269 (2006)

    Google Scholar 

  22. Porter, M.F.: An algorithm for suffix stripping. In Program 14(3) (1980)

    Google Scholar 

  23. Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  24. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)

    Google Scholar 

  25. Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-ACM, pp. 256–263. ACM Press, New York (1995)

    Google Scholar 

  26. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. ICML, pp. 412–420 (1997)

    Google Scholar 

  27. Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory 39(4), 1270–1279 (1993)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pinto, D., Benedí, JM., Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70939-8_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70938-1

  • Online ISBN: 978-3-540-70939-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics