Skip to main content
Log in

A document is known by the company it keeps: neighborhood consensus for short text categorization

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

During the last decades the Web has become the greatest repository of digital information. In order to organize all this information, several text categorization methods have been developed, achieving accurate results in most cases and in very different domains. Due to the recent usage of Internet as communication media, short texts such as news, tweets, blogs, and product reviews are more common every day. In this context, there are two main challenges; on the one hand, the length of these documents is short, and therefore, the word frequencies are not informative enough, making text categorization even more difficult than usual. On the other hand, topics are changing constantly at a fast rate, causing the lack of adequate amounts of training data. In order to deal with these two problems we consider a text classification method that is supported on the idea that similar documents may belong to the same category. Mainly, we propose a neighborhood consensus classification method that classifies documents by considering their own information as well as information about the category assigned to other similar documents from the same target collection. In particular, the short texts we used in our evaluation are news titles with an average of 8 words. Experimental results are encouraging; they indicate that leveraging information from similar documents helped to improve classification accuracy and that the proposed method is especially useful when labeled training resources are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The figure shows the results of the best configuration of SVMs in this setting: a polynomial kernel of degree 1.

  2. Note that we do not use this dataset at the beginning because it does not separate the titles from the body of the news.

References

  • Abney, S. P. (2008). Semi-supervised learning for computational linguistics. Computer science and data analysis series. London: Chapman and Hall/CRC.

  • Angelova, R., & Weikum, G. (2006). Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 485–492). New York, NY: ACM.

  • Anguiano-Hernández, E., Villaseñor-Pineda, L., Montes-y-Gómez, M., & Rosso, P. (2010). Summarization as feature selection for document categorization on small datasets. In Proceedings of the 7th international conference on advances in natural language processing, IceTAL’10 (pp. 39–44). Berlin, Heidelberg: Springer.

  • Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using wikipedia. In SIGIR ’07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 787–788). New York, NY: ACM.

  • Cardoso-Cachopo, A., & Oliveira, A. L. (2007). Semi-supervised single-label text categorization using centroid-based classifiers. In SAC ’07: Proceedings of the 2007 ACM symposium on applied computing (pp. 844–851). New york: ACM.

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.

    Google Scholar 

  • Driessens, K., Reutemann, P., Pfahringer, B., & Leschi, C. (2006). Using weighted nearest neighbor to benefit from unlabeled data. Lecture Notes in Computer Science, 3918, 60–69.

    Article  Google Scholar 

  • Escobar-Acevedo, A., Montes-y-Gómez, M., & Villaseñor-Pineda, L. (2009). Using nearest neighbor information to improve cross-language text classification. In Proceedings of the 8th Mexican international conference on artificial intelligence, MICAI ’09 (pp. 157–164). Berlin, Heidelberg: Springer.

  • Faguo, Z., Fan, Z., Bingru, Y., & Xingang, Y. (2010). Research on short text classification algorithm based on statistics and rules. In Proceedings of the 2010 third international symposium on electronic commerce and security, ISECS ’10 (pp. 3–7). Washington, DC: IEEE Computer Society.

  • Fan, X., & Hu, H. (2010). A new model for chinese short-text classification considering feature extension. Artificial Intelligence and Computational Intelligence, International Conference on 2, 7–11.

  • Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge, MA: Cambridge University Press.

    Book  Google Scholar 

  • Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–6).

  • Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., & Villaseñor-Pineda, L. (2009). Using the web as corpus for self-training text categorization. Information Retrieval, 12, 400–415.

    Article  Google Scholar 

  • Han, E. H., & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European conference on principles of data mining and knowledge discovery, PKDD ’00 (pp. 424–431). London: Springer.

  • Healy, M., Delany, S. J., & Zamolotskikh, A. (2005). An assessment of case-based reasoning for short text message classification. In N. Creaney (Ed.), 16th Irish conference on artificial intelligence and cognitive science.

  • Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09 (pp. 389–396). New York, NY: ACM.

  • Huang, Y., Sun, L., & Nie, J. (2009). Smoothing document language model with local word graph. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09 (pp. 1943–1946). New York, NY: ACM.

  • Ifrim, G., & Weikum, G. (2006). Transductive learning for text classification using explicit knowledge models. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, PKDD 2006 (pp. 223–234). Berlin, Heidelberg, Germany: Springer.

  • Jiang, E. P. (2010). Learning to integrate unlabeled data in text classification. In W. D. Yi Hang & P. S. Sandhu (Eds.), Proccedings of the 3rd IEEE international conference on computer science and information technology (Vol. 4, pp. 82–86). Chengdu, China.

  • Kang, I. S., Na, S. H., Kim, J., & Lee, J. H. (2007). Cluster-based patent retrieval. Information Processing and Management, 43, 1173–1182.

    Article  Google Scholar 

  • Ko, Y., & Seo, J. (2009). Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing and Management, 45(1), 70–83.

    Article  Google Scholar 

  • Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04 (pp. 194–201). New York, NY: ACM.

  • Kyriakopoulou, A., & Kalamboukis, T. (2006). Text classification using clustering. In Proceedings of the ECML-PKDD discovery challenge workshop.

  • Lewis, D. (1998). Naive (bayes) at forty: The independence assumption in information retrieval. In C. Nédellec & C. Rouveirol (Eds.) Machine learning: ECML-98, lecture notes in computer science (Vol. 1398, pp. 4–15). Berlin/Heidelberg: Springer.

  • Lewis, D. D. (1991). Evaluating text categorization. In Proceedings of speech and natural language workshop (pp. 312–318). Los Altos, CA: Morgan Kaufmann.

  • Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international conference on research and development in information retrieval, SIGIR ’04 (pp. 186–193). New York, NY: ACM.

  • Makagonov, P., Alex, M., & Gelbukh, E. (2004). Clustering abstracts instead of full texts. In Text, speech, dialog, LNAI N 3206 (pp. 129–135). Berlin: Springer.

  • Mei, Q., Zhang, D., & Zhai, C. (2008). A general optimization framework for smoothing language models on graph structures. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08 (pp. 611–618). New York, NY: ACM.

  • Navigli, R., & Crisafulli, G. (2010). Inducing word senses to improve web search result clustering. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10 (pp. 116–126). Stroudsburg, PA: Association for Computational Linguistics.

  • Ning, X., & Karypis, G. (2008). The set classification problem and solution methods. In Proceedings of the 2008 IEEE international conference on data mining workshops (pp. 720–729). Washington, DC: IEEE Computer Society.

  • Oh, H. J., Myaeng, S. H., & Lee, M. H. (2000). A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00 (pp. 264–271). New York, NY: ACM.

  • Ostrowski, D. A. (2010). Sentiment mining within social media for topic identification. In Proceedings of the 2010 IEEE fourth international conference on semantic computing, ICSC ’10 (pp. 394–401). Washington, DC: IEEE Computer Society.

  • Perez-Tellez, F., Pinto, D., Cardiff, J., & Rosso, P. (2010). On the difficulty of clustering company tweets. In Proceedings of the 2nd international workshop on search and mining user-generated contents, SMUC ’10 (pp. 95–102). New York, NY: ACM.

  • Pinto, D. (2008). On clustering and evaluation of narrow domain short-text corpora. Ph.D. thesis, Polytechnic University of Valencia, Spain.

  • Pinto, D., Rosso, P., & Jiménez-Salazar, H. (2010). A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54, 1148–1165.

    Article  Google Scholar 

  • Quinlan, J. R. (1996). Improved use of continuous attributes in c4.5. Artificial Intelligence Research, 4, 77–90.

    Google Scholar 

  • Rigutini, L., Maggini, M., & Liu, B. (2005). An EM based training algorithm for cross-language text categorization. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, WI ’05 (pp. 529–535). Washington, DC: IEEE Computer Society.

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.

    Article  Google Scholar 

  • Sen, P., & Getoor, L. (2007). Link-based classification. Technical Report CS-TR-4858, University of Maryland.

  • Sharifi, B., Hutton, M. A., & Kalita, J. (2010). Summarizing microblogs automatically. In The 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ’10 (pp. 685–688). Stroudsburg, PA: Association for Computational Linguistics.

  • Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 841–842). New York, NY: ACM.

  • Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 28(4), 667–671.

    Article  Google Scholar 

  • Tan, S. (2008). An improved centroid classifier for text categorization. Expert Systems with Applications, 35(1–2), 279–285.

    Article  Google Scholar 

  • Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of the main conference on human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL ’06 (pp. 407–414). Stroudsburg, PA: Association for Computational Linguistics.

  • Tao, Y., & Xi-wei, W. (2010). Feature extension for short text. In Z. J. Youfeng Zou Fei Yu (Ed.) Proceedings of the third international symposium on computer science and computational technology, ISCSCT ’10 (pp. 338–341). China: Jiaozuo.

  • Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). ”A term is known by the company it keeps": On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of the 2nd international conference on theory of information retrieval: advances in information retrieval theory, ICTIR ’09 (pp. 104–115). Berlin, Heidelberg: Springer.

  • Wang, J., Zhou, Y., Li, L., Hu, B., & Hu, X. (2009). Improving short text clustering performance with keyword expansion. In H. Wang, Y. Shen, T. Huang, & Z. Zeng (Eds.) The sixth international symposium on neural networks (ISNN 2009), advances in intelligent and soft computing (Vol. 56, pp. 291–298). Berlin/Heidelberg: Springer.

  • Wermter, S., Panchev, C., & Arevian, G. (1999). Hybrid neural plausibility networks for news agents. In Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, AAAI ’99/IAAI ’99 (pp. 93–98). Menlo Park, CA: American Association for Artificial Intelligence.

  • Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann Series in Data Management Systems. San Fransisco, CA: Morgan Kaufmann.

  • Xu, Z., Jin, R., Huang, K., Lyu, M. R., & King, I. (2008). Semi-supervised text categorization by active search. In Proceeding of the 17th ACM conference on information and knowledge management, CIKM ’08 (pp. 1517–1518). New York, NY: ACM.

  • Zelikovitz, S. (2004). Transductive LSI for short text classification problems. In FLAIRS conference.

  • Zelikovitz, S., & Hirsh, H. (2000). Improving short text classification using unlabeled background knowledge to assess document similarity. In Proceedings of the seventeenth international conference on machine learning, ICML’00 (pp. 1183–1190).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Ramírez-de-la-Rosa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramírez-de-la-Rosa, G., Montes-y-Gómez, M., Solorio, T. et al. A document is known by the company it keeps: neighborhood consensus for short text categorization. Lang Resources & Evaluation 47, 127–149 (2013). https://doi.org/10.1007/s10579-012-9192-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9192-1

Keywords

Navigation