Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data

Leal, Sidney; Casanova, Edresson; Paetzold, Gustavo; Aluísio, Sandra

doi:10.1007/978-3-030-83527-9_3

Sidney Leal¹¹,
Edresson Casanova¹¹,
Gustavo Paetzold¹² &
…
Sandra Aluísio¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1376 Accesses

Abstract

Predictability corpora built via Cloze task generally accompany eye-tracking data for the study of processing costs of linguistic structures in tasks of reading for comprehension. Two semantic measures are commonly calculated to evaluate expectations about forthcoming words: (i) the semantic fit of the target word with the previous context of a sentence, and (ii) semantic similarity scores that represent the semantic similarity between the target word and Cloze task responses for it. For Brazilian Portuguese (BP), there was no large eye-tracking corpora with predictability norms. The goal of this paper is to present a method to calculate the two semantic measures used in the first BP corpus of eye movements during silent reading of short paragraphs by undergraduate students. The method was informed by a large evaluation of both static and contextualized word embeddings, trained on large corpora of texts. Here, we make publicly available: (i) a BP corpus for a sentence-completion task to evaluate semantic similarity, (ii) a new methodology to build this corpus based on the scores of Cloze data taken from our project, and (iii) a hybrid method to compute the two semantic measures in order to build predictability corpora in BP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms

Article 23 November 2021

The Provo Corpus: A large eye-tracking corpus with predictability norms

Article 18 May 2017

Human and computer estimations of Predictability of words in written language

Article Open access 10 March 2020

Notes

1.
http://www.nilc.icmc.usp.br/nilc/index.php/rastros.
2.
http://www.nilc.icmc.usp.br/embeddings.
3.
https://www.inf.pucrs.br/linatural/wordpress/pucrs-bbp-embeddings/.
4.
Dataset and evaluation sources are available at: https://github.com/sidleal/TSD2021.

References

Bianchi, B., Monzón, G.B., Ferrer, L., Slezak, D.F., Shalom, D.E., Kamienkowski, J.E.: Human and computer estimations of predictability of words in written language. Sci. Rep. 10(4396), 1–11 (2020)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguisti. 5, 135–146 (2017)
Article Google Scholar
Correia, R., Baptista, J., Eskenazi, M., Mamede, N.: Automatic generation of Cloze question stems. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS (LNAI), vol. 7243, pp. 168–178. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28885-2_19
Chapter Google Scholar
Demberg, V., Keller, F.: Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 109(2), 192–210 (2008)
Article Google Scholar
Deutsch, D., Roth, D.: Summary cloze: a new task for content selection in topic-focused summarization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3711–3720 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423
Fonseca, E.F., Garcia Rosa, J.L., Aluísio, Maria, S.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in portuguese. J. Braz. Comput. Soc, Open Access 21(2), 1340 (2015)
Google Scholar
Frank, S.: Word embedding distance does not predict word reading time. In: Proceedings of the 39th Annual Conference of the Cognitive Science Society (CogSci). pp. 385–390. Cognitive Science Society, Austin (2017)
Google Scholar
Hartmann, N.S., Fonseca, E.R., Shulby, C.D., Treviso, M.V., Rodrigues, J.S., Aluísio, S.M.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 122–131. SBC, Porto Alegre(2017)
Google Scholar
Hermann, K.M. et al .: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems pp. 1693–1701 (2015)
Google Scholar
Hill, F., Bordes, A., Chopra, S., Weston, J.: The goldilocks principle: reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301 (2015)
Kennedy, A., Pynte, J., Murray, W.S., Paul, S.A.: Frequency and predictability effects in the dundee corpus: an eye movement analysis. Q. J. Exp. Psychol. 66(3), 601–618 (2013). https://doi.org/10.1080/17470218.2012.676054
Kliegl, R., Grabner, E., Rolfs, M., Engbert, R.: Length, frequency, and predictability effects of words on eye movements in reading. Eur. J. Cogn. Psychol. 16(1/2), 262–284 (2004)
Google Scholar
Landauer, T.K., Laham, D., Rehder, B., Schreiner, M.E.: How well can passage meaning be derived without using word order? A comparison of latent semantic analysis and humans. In: Shafto, M.G., Langley, P. (eds.) Proceedings of the 19th Annual Meeting of the Cognitive Science Society. pp. 412–417 (1997)
Google Scholar
Luke, S.G., Christianson, K.: Limits on lexical prediction during reading. Cogn. Psychol. 88, 22–60 (2016). https://doi.org/10.1016/j.cogpsych.2016.06.002
Article Google Scholar
Luke, S.G., Christianson, K.: The provo corpus: a large eye-tracking corpus with predictability norms. Behav. Res. Methods 50, 826–833 (2018)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations (ICLR 2013), Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings (2013)
Google Scholar
Mirowski, P., Vlachos, A.: Dependency recurrent neural language models for sentence completion. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 2: Short Papers). pp. 511–517 (2015)
Google Scholar
Mostafazadeh, N., Roth, M., Louis, A., Chambers, N., Allen, J.: Lsdsem 2017 shared task: the story cloze test. In: Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp. 46–51 (2017)
Google Scholar
Park, H., Park, J.: Assessment of word-level neural language models for sentence completion. Appl. Sci. 10(4), 1340 (2020)
Article Google Scholar
Santos, H., Woloszyn, V., Vieira, R.: BlogSet-BR: a Brazilian Portuguese blog corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC). pp. 661–664 (2018)
Google Scholar
Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). pp. 437–442 (2019)
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Tran, K., Bisazza, A., Monz, C.: Recurrent memory networks for language modeling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 321–331. Association for Computational Linguistics, San Diego, June 2016
Google Scholar
Vale, O.A., Baptista, J.: Novo dicionário de formas flexionadas do unitex-pb avaliação da flexão verbal. In: Anais do X Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 171–180. Brazilian Computer Society, Porto Alegre (2015)
Google Scholar
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 4339–4344 (2018)
Google Scholar
Woods, A.: Exploiting linguistic features for sentence completion. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 2: Short Papers), pp. 438–442 (2016)
Google Scholar
Xie, Q., Lai, G., Dai, Z., Hovy, E.: Large-scale cloze test dataset created by teachers. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2344–2356. Association for Computational Linguistics (2018)
Google Scholar
Yuret, D.: Ku: Word sense disambiguation by substitution. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 207–213. Association for Computational Linguistics (2007)
Google Scholar
Zweig, G., Burges, C.J.C.: The microsoft research sentence completion challenge. Tech. Rep., Microsoft Research, Technical Report MSR-TR-2011-129 (2011)
Google Scholar
Zweig, G., Platt, J.C., Meek, C., Burges, C.J., Yessenalina, A., Liu, Q.: Computational approaches to sentence completion. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 601–610. Association for Computational Linguistics, Jeju Island, July 2012
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (USP), São Carlos, Brazil
Sidney Leal, Edresson Casanova & Sandra Aluísio
Universidade Tecnológica Federal do Paraná (UTFPR) - Campus Toledo, Toledo, Brazil
Gustavo Paetzold

Authors

Sidney Leal
View author publications
You can also search for this author in PubMed Google Scholar
Edresson Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Paetzold
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sidney Leal .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leal, S., Casanova, E., Paetzold, G., Aluísio, S. (2021). Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_3
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms

The Provo Corpus: A large eye-tracking corpus with predictability norms

Human and computer estimations of Predictability of words in written language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms

The Provo Corpus: A large eye-tracking corpus with predictability norms

Human and computer estimations of Predictability of words in written language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation