Abstract
The removal of stopwords is an important preprocessing step in many natural language processing tasks, which can lead to enhanced performance and execution time. Many existing methods either rely on a predefined list of stopwords or compute word significance based on metrics such as tf-idf. The objective of our work in this paper is to identify stopwords, in an unsupervised way, for streaming textual corpora such as Twitter, which have a temporal nature. We propose to consider and model the dynamics of a word within the streaming corpus to identify the ones that are less likely to be informative or discriminative. Our work is based on the discrete wavelet transform (DWT) of word signals in order to extract two features, namely scale and energy. We show that our proposed approach is effective in identifying stopwords and improves the quality of topics in the task of topic detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abel, F., Gao, Q., Houben, G.-J., Tao, K.: Analyzing user modeling on twitter for personalized news recommendations. In: Konstan, J.A., Conejo, R., Marzo, J.L., Oliver, N. (eds.) UMAP 2011. LNCS, vol. 6787, pp. 1–12. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22362-4_1
Bischof, J.M., Airoldi, E.M.: Summarizing topical content with word frequency and exclusivity. In: ICML 2012, pp. 9–16, USA. Omnipress (2012)
Blanchard, A.: Understanding and customizing stopword lists for enhanced patent mapping. World Patent Inf. 29(4), 308 (2007)
Bun, K.K., Ishizuka, M.: Emerging topic tracking system in WWW. Knowl. Based Syst. 19(3), 164–171 (2006)
Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: CIKM 2012, pp. 2427–2430 (2012)
Fani, H., Bagheri, E., Zarrinkalam, F., Zhao, X., Du, W.: Finding diachronic like-minded users. Comput. Intell. (2017). https://doi.org/10.1111/coin.12117
Fani, H., Zarrinkalam, F., Zhao, X., Feng, Y., Bagheri, E., Du, W.: Temporal identification of latent communities on Twitter. CoRR, abs/1509.04227 (2015)
He, Q., Chang, K., Lim, E.: Analyzing feature trajectories for event detection. In: SIGIR 2007, pp. 207–214 (2007)
Kaiser, G.: A Friendly Guide to Wavelets. Birkhauser Boston Inc., Cambridge (1994)
Klatt, B., Krogmann, K., Kuttruff, V.: Developing stop word lists for natural language program analysis. Softwaretechnik-Trends 34(2) (2014)
Li, X., Chen, J., Zaïane, O.R.: Text document topical recursive clustering and automatic labeling of a hierarchy of document clusters. In: PAKDD, pp. 197–208 (2013)
Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)
Popova, S., Krivosheeva, T., Korenevsky, M.: Automatic stop list generation for clustering recognition results of call center recordings. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 137–144. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8_17
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Weng, J., Lee, B.: Event detection in Twitter. In: ICWSM 2011 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Fani, H., Bashari, M., Zarrinkalam, F., Bagheri, E., Al-Obeidat, F. (2018). Stopword Detection for Streaming Content. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_70
Download citation
DOI: https://doi.org/10.1007/978-3-319-76941-7_70
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)