A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters

Basu, Moumita; Roy, Anurag; Ghosh, Kripabandhu; Bandyopadhyay, Somprakash; Ghosh, Saptarshi

doi:10.1007/978-3-319-56608-5_53

Moumita Basu^20,21,
Anurag Roy²⁰,
Kripabandhu Ghosh²²,
Somprakash Bandyopadhyay²¹ &
…
Saptarshi Ghosh^20,23

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

European Conference on Information Retrieval

2520 Accesses
11 Citations
1 Altmetric

Abstract

IR methods are increasingly being applied over microblogs to extract real-time information, such as during disaster events. In such sites, most of the user-generated content is written informally – the same word is often spelled differently by different users, and words are shortened arbitrarily due to the length limitations on microblogs. Stemming is a common step for improving retrieval performance by unifying different morphological variants of a word. In this study, we show that rule-based stemming meant for formal text often cannot capture the arbitrary variations of words in microblogs. We propose a context-specific stemming algorithm, based on word embeddings, which can capture many more variations of words than what can be detected by conventional stemmers. Experiments on a large set of English microblogs posted during a recent disaster event shows that, the proposed stemming gives considerably better retrieval performance compared to Porter stemming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Gensim implementation for word2vec was used – https://radimrehurek.com/gensim/models/word2vec.html. The continuous bag of words model is used for the training, along with Hierarchical softmax, with the following parameter values – Vector size: 2000, Context size: 5, Learning rate: 0.05.
2.
https://en.wikipedia.org/wiki/April_2015_Nepal_earthquake.

References

Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
MATH Google Scholar
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: yet another suffix stripper. ACM Trans. Inf. Syst. 25(4), 18 (2007)
Article Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: NAACL HLT 2013 (2013)
Google Scholar
Paik, J.H., Mitra, M., Parui, S.K., Järvelin, K.: Gras: an effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 19:29(4), 1–19:24 (2011)
Article Google Scholar
Paik, J.H., Pal, D., Parui, S.K.: A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of ACM SIGIR, pp. 863–872 (2011)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of ACM SIGIR, pp. 275–281 (1998)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Siegel, S.: Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill Series in Psychology. McGraw-Hill, New York (1956)
MATH Google Scholar
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of ICIA (2004). http://www.lemurproject.org/indri/
Tao, K., Abel, F., Hauff, C., Houben, G.J., Gadiraju, U.: Groundhog day: near-duplicate detection on Twitter. In: Proceedings of World Wide Web (WWW) (2013)
Google Scholar

Download references

Acknowledgement

This research was partially supported by a grant from the Information Technology Research Academy (ITRA), DeITY, Government of India (Ref. No.: ITRA/15 (58)/Mobile/DISARM/05).

Author information

Authors and Affiliations

Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India
Moumita Basu, Anurag Roy & Saptarshi Ghosh
Indian Institute of Management, Calcutta, Kolkata, India
Moumita Basu & Somprakash Bandyopadhyay
Indian Institute of Technology, Kanpur, Kanpur, India
Kripabandhu Ghosh
Indian Institute of Technology, Kharagpur, Kharagpur, India
Saptarshi Ghosh

Authors

Moumita Basu
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Roy
View author publications
You can also search for this author in PubMed Google Scholar
Kripabandhu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Somprakash Bandyopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kripabandhu Ghosh .

Editor information

Editors and Affiliations

University of Glasgow , Glasgow, United Kingdom
Joemon M Jose
TU Delft - EWI/ST/WIS , Delft, The Netherlands
Claudia Hauff
Middle East Technical University , Ankara, Turkey
Ismail Sengor Altıngovde
Open University , Milton Keynes, United Kingdom
Dawei Song
Signal Media , London, United Kingdom
Dyaa Albakour
Toronto, Canada
Stuart Watt
JohnTait.net Ltd. and BCS IRSG , Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basu, M., Roy, A., Ghosh, K., Bandyopadhyay, S., Ghosh, S. (2017). A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-56608-5_53
Published: 08 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56607-8
Online ISBN: 978-3-319-56608-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics