Skip to main content

Stopword Detection for Streaming Content

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10772))

Included in the following conference series:

Abstract

The removal of stopwords is an important preprocessing step in many natural language processing tasks, which can lead to enhanced performance and execution time. Many existing methods either rely on a predefined list of stopwords or compute word significance based on metrics such as tf-idf. The objective of our work in this paper is to identify stopwords, in an unsupervised way, for streaming textual corpora such as Twitter, which have a temporal nature. We propose to consider and model the dynamics of a word within the streaming corpus to identify the ones that are less likely to be informative or discriminative. Our work is based on the discrete wavelet transform (DWT) of word signals in order to extract two features, namely scale and energy. We show that our proposed approach is effective in identifying stopwords and improves the quality of topics in the task of topic detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    mallet.cs.umass.edu/topics.php.

  2. 2.

    mallet.cs.umass.edu/diagnostics.php.

References

  1. Abel, F., Gao, Q., Houben, G.-J., Tao, K.: Analyzing user modeling on twitter for personalized news recommendations. In: Konstan, J.A., Conejo, R., Marzo, J.L., Oliver, N. (eds.) UMAP 2011. LNCS, vol. 6787, pp. 1–12. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22362-4_1

    Chapter  Google Scholar 

  2. Bischof, J.M., Airoldi, E.M.: Summarizing topical content with word frequency and exclusivity. In: ICML 2012, pp. 9–16, USA. Omnipress (2012)

    Google Scholar 

  3. Blanchard, A.: Understanding and customizing stopword lists for enhanced patent mapping. World Patent Inf. 29(4), 308 (2007)

    Article  Google Scholar 

  4. Bun, K.K., Ishizuka, M.: Emerging topic tracking system in WWW. Knowl. Based Syst. 19(3), 164–171 (2006)

    Article  Google Scholar 

  5. Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: CIKM 2012, pp. 2427–2430 (2012)

    Google Scholar 

  6. Fani, H., Bagheri, E., Zarrinkalam, F., Zhao, X., Du, W.: Finding diachronic like-minded users. Comput. Intell. (2017). https://doi.org/10.1111/coin.12117

  7. Fani, H., Zarrinkalam, F., Zhao, X., Feng, Y., Bagheri, E., Du, W.: Temporal identification of latent communities on Twitter. CoRR, abs/1509.04227 (2015)

    Google Scholar 

  8. He, Q., Chang, K., Lim, E.: Analyzing feature trajectories for event detection. In: SIGIR 2007, pp. 207–214 (2007)

    Google Scholar 

  9. Kaiser, G.: A Friendly Guide to Wavelets. Birkhauser Boston Inc., Cambridge (1994)

    MATH  Google Scholar 

  10. Klatt, B., Krogmann, K., Kuttruff, V.: Developing stop word lists for natural language program analysis. Softwaretechnik-Trends 34(2) (2014)

    Google Scholar 

  11. Li, X., Chen, J., Zaïane, O.R.: Text document topical recursive clustering and automatic labeling of a hierarchy of document clusters. In: PAKDD, pp. 197–208 (2013)

    Google Scholar 

  12. Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)

    Google Scholar 

  13. Popova, S., Krivosheeva, T., Korenevsky, M.: Automatic stop list generation for clustering recognition results of call center recordings. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 137–144. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8_17

    Google Scholar 

  14. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  15. Weng, J., Lee, B.: Event detection in Twitter. In: ICWSM 2011 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hossein Fani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fani, H., Bashari, M., Zarrinkalam, F., Bagheri, E., Al-Obeidat, F. (2018). Stopword Detection for Streaming Content. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-76941-7_70

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76940-0

  • Online ISBN: 978-3-319-76941-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics