Skip to main content

Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

  • Conference paper
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2012)

Abstract

In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document but substitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)

    Article  Google Scholar 

  2. Timonen, M., Silvonen, P., Kasari, M.: Classification of short documents to categorize consumer opinions. In: Online Proceedings of 7th International Conference on Advanced Data Mining and Applications (ADMA 2011), China (2011), http://aminer.org/PDF/adma2011/session3D/adma11_conf_32.pdf (accessed October 10, 2012)

  3. Timonen, M.: Categorization of very short documents. In: Internation Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 5–16 (2012)

    Google Scholar 

  4. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  5. Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Brazil, pp. 353–360 (2005)

    Google Scholar 

  6. Clark, K., Gale, W.: Inverse Document Frequency (IDF): A measure of deviation from Poisson. In: Third Workshop on Very Large Corpora, pp. 121–130. Massachusetts Institute of Technology, Cambridge (1995)

    Google Scholar 

  7. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Slovenia, pp. 258–267 (1999)

    Google Scholar 

  8. Forman, G.: BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), USA, pp. 263–270 (2008)

    Google Scholar 

  9. Yang, Y., Pedersen, J.: Feature selection in statistical learning of text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), USA, pp. 412–420 (1997)

    Google Scholar 

  10. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)

    Article  Google Scholar 

  11. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), USA, pp. 42–49 (1999)

    Google Scholar 

  12. Krishnakumar, A.: Text categorization building a kNN classifier for the Reuters-21578 collection (2006), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.9946 (accessed October 10, 2012)

  13. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  14. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), USA, pp. 616–623 (2003)

    Google Scholar 

  15. Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010), Malta (2010)

    Google Scholar 

  17. Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on Twitter through text classification. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2013.pdf (accessed October 10, 2012)

  18. Benevenuto, F., Mango, G., Rodrigues, T., Almeida, V.: Detecting spammers on Twitter. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2021.pdf (accessed October 10, 2012)

  19. Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning, pp. 41–56. MIT Press (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Timonen, M., Kasari, M. (2013). Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54105-6_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54104-9

  • Online ISBN: 978-3-642-54105-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics