Abstract
In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document but substitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Timonen, M., Silvonen, P., Kasari, M.: Classification of short documents to categorize consumer opinions. In: Online Proceedings of 7th International Conference on Advanced Data Mining and Applications (ADMA 2011), China (2011), http://aminer.org/PDF/adma2011/session3D/adma11_conf_32.pdf (accessed October 10, 2012)
Timonen, M.: Categorization of very short documents. In: Internation Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 5–16 (2012)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Brazil, pp. 353–360 (2005)
Clark, K., Gale, W.: Inverse Document Frequency (IDF): A measure of deviation from Poisson. In: Third Workshop on Very Large Corpora, pp. 121–130. Massachusetts Institute of Technology, Cambridge (1995)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Slovenia, pp. 258–267 (1999)
Forman, G.: BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), USA, pp. 263–270 (2008)
Yang, Y., Pedersen, J.: Feature selection in statistical learning of text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), USA, pp. 412–420 (1997)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), USA, pp. 42–49 (1999)
Krishnakumar, A.: Text categorization building a kNN classifier for the Reuters-21578 collection (2006), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.9946 (accessed October 10, 2012)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), USA, pp. 616–623 (2003)
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010), Malta (2010)
Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on Twitter through text classification. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2013.pdf (accessed October 10, 2012)
Benevenuto, F., Mango, G., Rodrigues, T., Almeida, V.: Detecting spammers on Twitter. In: Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS 2010), USA (2010), http://ceas.cc/2010/papers/Paper%2021.pdf (accessed October 10, 2012)
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning, pp. 41–56. MIT Press (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Timonen, M., Kasari, M. (2013). Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-54105-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54104-9
Online ISBN: 978-3-642-54105-6
eBook Packages: Computer ScienceComputer Science (R0)