Skip to main content

Classification of Short Texts by Deploying Topical Annotations

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Abstract

We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW ’08), and robust with respect to concept drifts and input sources.

This work has been supported in part by MIUR PRIN MadWeb, MIUR FIRB Linguistica ’06, and a Google Faculty Award 2010.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering Short Texts using Wikipedia. In: ACM SIGIR, pp. 787–788 (2007)

    Google Scholar 

  2. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web Search engines. In: WWW, pp. 757–766 (2007)

    Google Scholar 

  3. Cilibrasi, R., Vitanyi, P.: The Google similarity distances. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  4. Ferragina, P., Scaiella, U.: TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In: ACM CIKM, pp. 1625–1628 (2010)

    Google Scholar 

  5. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Int. Joint Conference on A.I, pp. 1048–1053 (2005)

    Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Wikipedia-based Semantic Interpretation for Natural Language Processing. J. Artif. Intell. Res. 34, 443–498 (2009)

    MATH  Google Scholar 

  7. Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011. LNCS, vol. 6780, pp. 484–492. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  8. Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust Disambiguation of Named Entities in Text. In: EMNLP, pp. 782–792 (2011)

    Google Scholar 

  9. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: ACM KDD, pp. 457–466 (2009)

    Google Scholar 

  10. Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67(9), 716–754 (2009)

    Article  Google Scholar 

  11. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: AAAI Workshop on Wikipedia and Artificial Intelligence (2008)

    Google Scholar 

  12. Phan, X.H., Nguyen, L.M., Houriguchi, S.: Learning to Classify Short and Sparse Text & Web with Hiddent Topics from Large-scale Data Collections. In: WWW, pp. 91–100 (2008)

    Google Scholar 

  13. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp. 377–386 (2006)

    Google Scholar 

  14. Schlimmer, J.C., Graner, R.H.: Beyond Incremental Processing: Tracking Concept Drift. In: AAAI, pp. 502–507 (1986)

    Google Scholar 

  15. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  16. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: ACM SIGIR, pp. 841–842 (2010)

    Google Scholar 

  17. Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. In: AAAI, pp. 1419–1424 (2006)

    Google Scholar 

  18. Sun, X., Haofen, W., Yong, Y.: Towards effective short text deep classification. In: ACM SIGIR, pp. 1143–1144 (2011)

    Google Scholar 

  19. Zelikovitz, S., Hirsh, H.: Improving short-text classification using unlabeled data for classification problems. In: ICML, pp. 1191–1198 (2000)

    Google Scholar 

  20. Zelikovitz, S., Marquez, F.: Transductive Learning for Short-Text Classification problems using Latent Semantic Indexing. IJPRAI 19(2), 146–163 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vitale, D., Ferragina, P., Scaiella, U. (2012). Classification of Short Texts by Deploying Topical Annotations. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28997-2_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28996-5

  • Online ISBN: 978-3-642-28997-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics