Automatic Stopwords Identification from Very Small Corpora

Ferilli, Stefano; Izzi, Giovanni Luca; Franza, Tiziano

doi:10.1007/978-3-030-67148-8_3

Stefano Ferilli⁶,
Giovanni Luca Izzi⁶ &
Tiziano Franza⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 949))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

287 Accesses
2 Citations

Abstract

Natural Language Processing tools use language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is complex, it would be desirable to learn these resources automatically from sample texts. In this paper we focus on stopwords, i.e., terms which are not relevant to understand the topic and content of a document. Specifically, we compare the performance of different techniques proposed in the literature when applied to very small corpora (even single documents), as may be the case for very local languages lacking a wide literature. Experiments show that simple term-frequency is an extremely reliable indicator, that outperforms other more complex approaches. While the study is conducted on Italian, the approach is generic and applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Of course, we mean applicable to languages having the same lexical and syntactic structure as English, Italian, etc. E.g., it would not be applicable to Vietnamese, where words are written with a space between each syllable. On the other hand, it might be applicable to inflected languages, assuming that inflected stopwords are sufficiently frequent so as to be selected by the algorithm.
2.
By Zipf’s law, the distribution of frequency of terms rank can be described very precisely by the relation \(F(r) = \frac{C}{r^\alpha } \quad \mathrm {where\ } \alpha \approx 1 \mathrm {\ and\ } C \approx 0.1\).
3.
According to [8], this approach is very similar to the one used in [16] for query expansion in IR by finding terms that have the same or similar meaning as a given term.
4.
Note that, if the term rarely occurs in the collection, the retrieved set of terms would be small. E.g., a term occurring in just one document would return only the other terms in that documents. Selecting n terms should overcome the problem and yield a better sample that allows a better estimation of the distribution and importance of terms.
5.
The texts are the same as in [3], with the addition of HeG and AdA. So, performance reported in Table 4 for the TF approach on the other single texts is the same as in [3], as well. However, due to the addition of HeG and AdA, performance reported in Table 4 for the TF approach on NTT and All has changed with respect to [3]. On the other hand, all performances reported for the other approaches, and their comparison, are presented in this paper for the first time.
6.
It is a translation, not an original Italian text. Some might object that translations should not be considered in experiments concerning a language. We do not agree: we believe that translations produced by mother-tongue writers are in any case a direct expression of the target language, and thus can be in all respects considered as target language texts. Moreover, using also translated texts in the experiments may test the effectiveness and robustness of the methodology.
7.
http://snowball.tartarus.org/algorithms/italian/stop.
8.
E.g.: ‘essere’, the infinitive form of verb ‘to be’, is missing, but many inflected form of that verb are in the list; ‘fra’ is not in the list, albeit being a very common alternate form of preposition ‘tra’, which is in the list; some modal verbs are in the list, but some others are not; etc.

References

Al-Shalabi, R., Kanaan, G., Jaam, J.M., Hasnah, A., Hilat, E.: Stop-word removal algorithm for Arabic language. In: Proceedings of the 2004 International Conference on Information and Communication Technologies: From Theory to Applications, pp. 545–549 (2004)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (1991)
Book Google Scholar
Ferilli, S., Esposito, F.: On frequency-based approaches to learning stopwords and the reliability of existing resources – a study on Italian language. In: Serra, G., Tasso, C. (eds.) Digital Libraries and Multimedia Archives. IRCDL 2018, volume 806 of Communications in Computer and Information Science, pp. 69–80. Springer (2018)
Google Scholar
Ferilli, S., Esposito, F., Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Comput. Sci. 38, 116–123 (2014)
Article Google Scholar
Fox, C.: A stop list for general text. SIGIR Forum 24(1–2), 19–21 (1989)
Article Google Scholar
Garg, U., Goyal, V.: Effect of stop word removal on document similarity for Hindi text. Eng. Sci. An Int. J. 2, 3 (2014)
Google Scholar
Kaur, J., Buttar, P.K.: A systematic review on stopword removal algorithms. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 4, 207–210 (2018)
Google Scholar
Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. In: Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop, vol. 5, pp. 17–24 (2005)
Google Scholar
Hans Peter Luhn: Keyword-in-context index for technical literature (kwic index). J. Assoc. Inf. Sci. Technol. 11, 288–295 (1960)
Google Scholar
Puri, R., Bedi, R.P.S., Goyal, V.: Automated stopwords identification in Punjabi documents. Eng. Sci. Int. J. 8, 119–125 (2013)
Google Scholar
Robertson, S.E., Sparck-Jones, K.: Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129–146 (1976)
Article Google Scholar
Savoy, J.: A stemming procedure and stopword list for general French corpora. J. Assoc. Inf. Sci. Technol. 50, 944–952 (1999)
Google Scholar
Sinka, M.P., Corne, D.W.: Evolving better stoplists for document clustering and web intelligence, pp. 1015–1023. IOS Press, NLD (2003)
Google Scholar
Sparck-Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)
Article Google Scholar
John Wilbur, W., Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992)
Article Google Scholar
Xu, J., Bruce Croft, W.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 4–11 (1996)
Google Scholar
Zou, F., Wang, F.L., Deng, X., Han, S.: Evaluation of stop word lists in Chinese language. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, May 2006, pp. 2504–2507. European Language Resources Association (ELRA) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Bari, Bari, Italy
Stefano Ferilli, Giovanni Luca Izzi & Tiziano Franza

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Luca Izzi
View author publications
You can also search for this author in PubMed Google Scholar
Tiziano Franza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Editor information

Editors and Affiliations

Graz University of Technology, Graz, Austria
Martin Stettinger
University of Klagenfurt, Klagenfurt, Austria
Gerhard Leitner
Graz University of Technology, Klagenfurt, Austria
Alexander Felfernig
University of North Carolina, Charlotte, NC, USA
Zbigniew W. Ras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferilli, S., Izzi, G.L., Franza, T. (2021). Automatic Stopwords Identification from Very Small Corpora. In: Stettinger, M., Leitner, G., Felfernig, A., Ras, Z.W. (eds) Intelligent Systems in Industrial Applications. ISMIS 2020. Studies in Computational Intelligence, vol 949. Springer, Cham. https://doi.org/10.1007/978-3-030-67148-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-67148-8_3
Published: 04 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67147-1
Online ISBN: 978-3-030-67148-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics