Skip to main content

Statistical Processing of Stopwords on SNS

  • Conference paper
  • First Online:
Book cover Database and Expert Systems Applications (DEXA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11706))

Included in the following conference series:

Abstract

For the purpose of text classification or information retrieval, we apply preprocessing to these texts such as stemming and stopwords removal. Almost all the techniques could be useful only to well-formed text information like textbooks and news articles, but is not true to social network services (SNS) or any other texts in internet world. In this investigation, we propose how to extract stopwords in context of social network services. To do that, first we discuss what stopwords mean, how different from conventional ones, and we propose statistical filters TFIG and TFCHI, to identify. We examine categorical estimation to extract characteristic values putting our attention on Kullback Leibler Divergence (KLD) over temporal sequences on SNS data. Moreover we apply several preprocessing to manage unknown words and to improve morphological analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One exception is predicate. In fact, the predicate should appear as a last verb in each sentence.

  2. 2.

    Morphological analysis means both word segmentation and part of speech processing in Japanese. For example, "sumomo/mo/momo/mo/momo/no/uchi" means Both Plum and Peach are same kind of Peach, which is a typical tongue twister where you should say “mo” many times. There are two nouns “sumomo” (plum) and “momo” (peach). There is no delimiter between words (no space, no comma, and no thrash) and everything goes into one string as “sumomomomomomomomonouchi”.

  3. 3.

    See https://twitter.com/?lang=ja.

  4. 4.

    We say 1/IG instead of IG because we like to smaller value better. So is true for 1/CHI.

  5. 5.

    For instance, when collecting twitter documents by giving a keyword "Home Alone", then we give a class “Home Alone” to the collection.

References

  1. Manning, C., Raghavan, P.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  2. Bouge, K.: https://sites.google.com/site/kevinbouge/stopwordslists/stopwordsja-txt. Accessed 28 Dec 2017

  3. slothlib - Revision 77. http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/-SlothLib/NLP/Filter/StopWord/word/Japanese.txt. Accessed 19 Jan 2018

  4. Saif, H., Fernandez, M., Alani, H.: Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In: The 13th International Semantic Web Conference (ISCW) (2014)

    Google Scholar 

  5. Sonoda, T., Miura, T.: Mining Japanese collocation by statistical indicators. In: 15th International Conference on Enterprise Information Systems (ICEIS), Angers, France (2013)

    Google Scholar 

  6. Yang, Y., Pedersen, J.O. : A comparative study on feature selection in text categorization. In: Proceedings of International Conference on Machine Learning (ICML), pp. 412–420 (1997)

    Google Scholar 

  7. Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science (2006)

    Google Scholar 

  8. Nezu, Y., Miura, T.: Extracting stopwords on social network service. In: The 29th International Conference on information Modelling and Knowledge Bases (EJC) (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuta Nezu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nezu, Y., Miura, T. (2019). Statistical Processing of Stopwords on SNS. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11706. Springer, Cham. https://doi.org/10.1007/978-3-030-27615-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27615-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27614-0

  • Online ISBN: 978-3-030-27615-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics