Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

Timonen, Mika; Kasari, Melissa

doi:10.1007/978-3-642-54105-6_1

Mika Timonen⁵ &
Melissa Kasari⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 415))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

819 Accesses

Abstract

In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document but substitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Article 19 February 2016

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

Article 21 May 2019

Model-induced term-weighting schemes for text classification

Article 15 January 2016

References

Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Article Google Scholar
Timonen, M., Silvonen, P., Kasari, M.: Classification of short documents to categorize consumer opinions. In: Online Proceedings of 7th International Conference on Advanced Data Mining and Applications (ADMA 2011), China (2011), http://aminer.org/PDF/adma2011/session3D/adma11_conf_32.pdf (accessed October 10, 2012)
Timonen, M.: Categorization of very short documents. In: Internation Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 5–16 (2012)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Brazil, pp. 353–360 (2005)
Google Scholar
Clark, K., Gale, W.: Inverse Document Frequency (IDF): A measure of deviation from Poisson. In: Third Workshop on Very Large Corpora, pp. 121–130. Massachusetts Institute of Technology, Cambridge (1995)
Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Slovenia, pp. 258–267 (1999)
Google Scholar
Forman, G.: BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), USA, pp. 263–270 (2008)
Google Scholar
Yang, Y., Pedersen, J.: Feature selection in statistical learning of text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), USA, pp. 412–420 (1997)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Article Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), USA, pp. 42–49 (1999)
Google Scholar
Krishnakumar, A.: Text categorization building a kNN classifier for the Reuters-21578 collection (2006), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.9946 (accessed October 10, 2012)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), USA, pp. 616–623 (2003)
Google Scholar
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)
Chapter Google Scholar
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010), Malta (2010)
Google Scholar
Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on Twitter through text classification. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2013.pdf (accessed October 10, 2012)
Benevenuto, F., Mango, G., Rodrigues, T., Almeida, V.: Detecting spammers on Twitter. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2021.pdf (accessed October 10, 2012)
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning, pp. 41–56. MIT Press (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

VTT Technical Research Centre of Finland, P.O. 1000, FI-02044, Finland
Mika Timonen
Department of Computer Science, University of Helsinki, P.O. 68, FI-00014, Finland
Melissa Kasari

Authors

Mika Timonen
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Kasari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IST - Technical University of Lisbon, Av.Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
Delft University of Technology, Mekelweg 4, 2628, Delft, CD, The Netherlands
Jan L. G. Dietz
Informatics Research Centre, Henley Business School, University of Reading, RG6 6UD, UK
Kecheng Liu
INSTICC and IPS, Estefanilha, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Timonen, M., Kasari, M. (2013). Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-54105-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54104-9
Online ISBN: 978-3-642-54105-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics