Abstract
For the purpose of text classification or information retrieval, we apply preprocessing to these texts such as stemming and stopwords removal. Almost all the techniques could be useful only to well-formed text information like textbooks and news articles, but is not true to social network services (SNS) or any other texts in internet world. In this investigation, we propose how to extract stopwords in context of social network services. To do that, first we discuss what stopwords mean, how different from conventional ones, and we propose statistical filters TFIG and TFCHI, to identify. We examine categorical estimation to extract characteristic values putting our attention on Kullback Leibler Divergence (KLD) over temporal sequences on SNS data. Moreover we apply several preprocessing to manage unknown words and to improve morphological analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
One exception is predicate. In fact, the predicate should appear as a last verb in each sentence.
- 2.
Morphological analysis means both word segmentation and part of speech processing in Japanese. For example, "sumomo/mo/momo/mo/momo/no/uchi" means Both Plum and Peach are same kind of Peach, which is a typical tongue twister where you should say “mo” many times. There are two nouns “sumomo” (plum) and “momo” (peach). There is no delimiter between words (no space, no comma, and no thrash) and everything goes into one string as “sumomomomomomomomonouchi”.
- 3.
- 4.
We say 1/IG instead of IG because we like to smaller value better. So is true for 1/CHI.
- 5.
For instance, when collecting twitter documents by giving a keyword "Home Alone", then we give a class “Home Alone” to the collection.
References
Manning, C., Raghavan, P.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)
Bouge, K.: https://sites.google.com/site/kevinbouge/stopwordslists/stopwordsja-txt. Accessed 28 Dec 2017
slothlib - Revision 77. http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/-SlothLib/NLP/Filter/StopWord/word/Japanese.txt. Accessed 19 Jan 2018
Saif, H., Fernandez, M., Alani, H.: Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In: The 13th International Semantic Web Conference (ISCW) (2014)
Sonoda, T., Miura, T.: Mining Japanese collocation by statistical indicators. In: 15th International Conference on Enterprise Information Systems (ICEIS), Angers, France (2013)
Yang, Y., Pedersen, J.O. : A comparative study on feature selection in text categorization. In: Proceedings of International Conference on Machine Learning (ICML), pp. 412–420 (1997)
Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science (2006)
Nezu, Y., Miura, T.: Extracting stopwords on social network service. In: The 29th International Conference on information Modelling and Knowledge Bases (EJC) (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Nezu, Y., Miura, T. (2019). Statistical Processing of Stopwords on SNS. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11706. Springer, Cham. https://doi.org/10.1007/978-3-030-27615-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-27615-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27614-0
Online ISBN: 978-3-030-27615-7
eBook Packages: Computer ScienceComputer Science (R0)