Classification of Short Texts by Deploying Topical Annotations

Vitale, Daniele; Ferragina, Paolo; Scaiella, Ugo

doi:10.1007/978-3-642-28997-2_32

Classification of Short Texts by Deploying Topical Annotations

Daniele Vitale²²,
Paolo Ferragina²² &
Ugo Scaiella²²

Conference paper

3292 Accesses
35 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Abstract

We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW ’08), and robust with respect to concept drifts and input sources.

This work has been supported in part by MIUR PRIN MadWeb, MIUR FIRB Linguistica ’06, and a Google Faculty Award 2010.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banerjee, S., Ramanathan, K., Gupta, A.: Clustering Short Texts using Wikipedia. In: ACM SIGIR, pp. 787–788 (2007)
Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web Search engines. In: WWW, pp. 757–766 (2007)
Google Scholar
Cilibrasi, R., Vitanyi, P.: The Google similarity distances. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007)
Article Google Scholar
Ferragina, P., Scaiella, U.: TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In: ACM CIKM, pp. 1625–1628 (2010)
Google Scholar
Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Int. Joint Conference on A.I, pp. 1048–1053 (2005)
Google Scholar
Gabrilovich, E., Markovitch, S.: Wikipedia-based Semantic Interpretation for Natural Language Processing. J. Artif. Intell. Res. 34, 443–498 (2009)
MATH Google Scholar
Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011. LNCS, vol. 6780, pp. 484–492. Springer, Heidelberg (2011)
Chapter Google Scholar
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust Disambiguation of Named Entities in Text. In: EMNLP, pp. 782–792 (2011)
Google Scholar
Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: ACM KDD, pp. 457–466 (2009)
Google Scholar
Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67(9), 716–754 (2009)
Article Google Scholar
Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: AAAI Workshop on Wikipedia and Artificial Intelligence (2008)
Google Scholar
Phan, X.H., Nguyen, L.M., Houriguchi, S.: Learning to Classify Short and Sparse Text & Web with Hiddent Topics from Large-scale Data Collections. In: WWW, pp. 91–100 (2008)
Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp. 377–386 (2006)
Google Scholar
Schlimmer, J.C., Graner, R.H.: Beyond Incremental Processing: Tracking Concept Drift. In: AAAI, pp. 502–507 (1986)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: ACM SIGIR, pp. 841–842 (2010)
Google Scholar
Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. In: AAAI, pp. 1419–1424 (2006)
Google Scholar
Sun, X., Haofen, W., Yong, Y.: Towards effective short text deep classification. In: ACM SIGIR, pp. 1143–1144 (2011)
Google Scholar
Zelikovitz, S., Hirsh, H.: Improving short-text classification using unlabeled data for classification problems. In: ICML, pp. 1191–1198 (2000)
Google Scholar
Zelikovitz, S., Marquez, F.: Transductive Learning for Short-Text Classification problems using Latent Semantic Indexing. IJPRAI 19(2), 146–163 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, University of Pisa, Italy
Daniele Vitale, Paolo Ferragina & Ugo Scaiella

Authors

Daniele Vitale
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Ugo Scaiella
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yahoo! Research, Diagonal 177, 08018, Barcelona, Spain
Ricardo Baeza-Yates & B. Barla Cambazoglu &
Centrum Wiskunde & Informatica, Science Park 123, Amsterdam, The Netherlands
Arjen P. de Vries
Websays, Nàpols 294 7-4, 08025, Barcelona, Spain
Hugo Zaragoza
Yahoo! Research, Diagnoal 177, 08018, Barcelona, Spain
Vanessa Murdock
Yahoo! Labs, Tower 3, Matam Park, 31905, Haifa, Israel
Ronny Lempel
ISTI-CNR, via G. Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vitale, D., Ferragina, P., Scaiella, U. (2012). Classification of Short Texts by Deploying Topical Annotations. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-28997-2_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics