Statistical Processing of Stopwords on SNS

Nezu, Yuta; Miura, Takao

doi:10.1007/978-3-030-27615-7_9

Yuta Nezu¹⁴ &
Takao Miura¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11706))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1449 Accesses
2 Citations

Abstract

For the purpose of text classification or information retrieval, we apply preprocessing to these texts such as stemming and stopwords removal. Almost all the techniques could be useful only to well-formed text information like textbooks and news articles, but is not true to social network services (SNS) or any other texts in internet world. In this investigation, we propose how to extract stopwords in context of social network services. To do that, first we discuss what stopwords mean, how different from conventional ones, and we propose statistical filters TFIG and TFCHI, to identify. We examine categorical estimation to extract characteristic values putting our attention on Kullback Leibler Divergence (KLD) over temporal sequences on SNS data. Moreover we apply several preprocessing to manage unknown words and to improve morphological analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
One exception is predicate. In fact, the predicate should appear as a last verb in each sentence.
2.
Morphological analysis means both word segmentation and part of speech processing in Japanese. For example, "sumomo/mo/momo/mo/momo/no/uchi" means Both Plum and Peach are same kind of Peach, which is a typical tongue twister where you should say “mo” many times. There are two nouns “sumomo” (plum) and “momo” (peach). There is no delimiter between words (no space, no comma, and no thrash) and everything goes into one string as “sumomomomomomomomonouchi”.
3.
See https://twitter.com/?lang=ja.
4.
We say 1/IG instead of IG because we like to smaller value better. So is true for 1/CHI.
5.
For instance, when collecting twitter documents by giving a keyword "Home Alone", then we give a class “Home Alone” to the collection.

References

Manning, C., Raghavan, P.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Bouge, K.: https://sites.google.com/site/kevinbouge/stopwordslists/stopwordsja-txt. Accessed 28 Dec 2017
slothlib - Revision 77. http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/-SlothLib/NLP/Filter/StopWord/word/Japanese.txt. Accessed 19 Jan 2018
Saif, H., Fernandez, M., Alani, H.: Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In: The 13th International Semantic Web Conference (ISCW) (2014)
Google Scholar
Sonoda, T., Miura, T.: Mining Japanese collocation by statistical indicators. In: 15th International Conference on Enterprise Information Systems (ICEIS), Angers, France (2013)
Google Scholar
Yang, Y., Pedersen, J.O. : A comparative study on feature selection in text categorization. In: Proceedings of International Conference on Machine Learning (ICML), pp. 412–420 (1997)
Google Scholar
Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science (2006)
Google Scholar
Nezu, Y., Miura, T.: Extracting stopwords on social network service. In: The 29th International Conference on information Modelling and Knowledge Bases (EJC) (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Advanced Sciences, HOSEI University, Kajinocho 3-7-2, Koganei, Tokyo, Japan
Yuta Nezu & Takao Miura

Authors

Yuta Nezu
View author publications
You can also search for this author in PubMed Google Scholar
Takao Miura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuta Nezu .

Editor information

Editors and Affiliations

Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
The University of Texas at Arlington, Arlington, TX, USA
Sharma Chakravarthy
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nezu, Y., Miura, T. (2019). Statistical Processing of Stopwords on SNS. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11706. Springer, Cham. https://doi.org/10.1007/978-3-030-27615-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-27615-7_9
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27614-0
Online ISBN: 978-3-030-27615-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics