Improving LocalMaxs Multiword Expression Statistical Extractor

Silva, Joaquim F.; Cunha, Jose C.

doi:10.1007/978-3-031-36021-3_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14074))

Included in the following conference series:

International Conference on Computational Science

642 Accesses

Abstract

LocalMaxs algorithm extracts relevant Multiword Expressions from text corpora based on a statistical approach. However, statistical extractors face an increased challenge of obtaining good practical results, compared to linguistic approaches which benefit from language-specific, syntactic and/or semantic, knowledge. First, this paper contributes to an improvement to the LocalMaxs algorithm, based on a more selective evaluation of the cohesion of each Multiword Expressions candidate with respect to its neighbourhood, and a filtering criterion guided by the location of stopwords within each candidate. Secondly, a new language-independent method is presented for the automatic self-identification of stopwords in corpora, requiring no external stopwords lists or linguistic tools. The obtained results for LocalMaxs reach Precision values of about \(80\,\%\) for English, French, German and Portuguese, showing an increase of around \(12-13\,\%\) compared to the previous LocalMaxs version. The performance of the self-identification of stopwords reaches high Precision for top-ranked stopword candidates.

This work is supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
\(F1\!=\!\frac{2\,.Precision\times Recall}{Precision+Recall}\).

References

Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19, 143–177 (1993)
Google Scholar
Silva, J., Mexia, J., Coelho, C., Lopes, G.: A statistical approach for multilingual document clustering and topic extraction form clusters. Pliska Studia Mathematica Bulgarica 16, 207–228 (2004)
MathSciNet MATH Google Scholar
Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), (1995)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Gale, W., Church, K.W.: In concordance for parallel texts. In: Proceedings of the Seventh Annual Conference of the UW Centre of the new OED and Text Research, Using Corpora, pp. 40–62. Oxford (1991)
Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Pazienza, M.T., Pennacchiotti, M., Zanzotto, F.M.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Knowledge Mining, pp. 255–279. Springer, Berlin Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20
Witten, I., Paynter, G., Frank, E., Gutwin, C., Nevill-Manning, C.: KEA: practical automatic keyphrase extraction. CoRR, cs.DL/9902007 (1999)
Google Scholar
Evert, S., Krenn, B.: Using small random samples for the manual evaluation of statistical association measures. Comput. Speech Lang. 19(4), 450–466 (2005). Special issue on Multiword Expression
Google Scholar
Ramisch, C., Villavicencio, A., Boitet, C.: mwetoolkit: a framework for multiword expression identification. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). ELRA (2010)
Google Scholar
Tsz-Wai Lo, R., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manag. 3, 3–8 (2005)
Google Scholar
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 222–233. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_22
Chapter Google Scholar
Ferilli, S., Izzi, G.L., Franza, T.: Automatic stopwords identification from very small corpora. In: Stettinger, M., Leitner, G., Felfernig, A., Ras, Z.W. (eds.) ISMIS 2020. SCI, vol. 949, pp. 31–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67148-8_3
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

NOVA LINCS, NOVA School of Science and Technology, Costa da Caparica, Portugal
Joaquim F. Silva & Jose C. Cunha

Authors

Joaquim F. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Jose C. Cunha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquim F. Silva .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Jiří Mikyška
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, J.F., Cunha, J.C. (2023). Improving LocalMaxs Multiword Expression Statistical Extractor. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-36021-3_13
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36020-6
Online ISBN: 978-3-031-36021-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics