Skip to main content

Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

  • 1376 Accesses

Abstract

Predictability corpora built via Cloze task generally accompany eye-tracking data for the study of processing costs of linguistic structures in tasks of reading for comprehension. Two semantic measures are commonly calculated to evaluate expectations about forthcoming words: (i) the semantic fit of the target word with the previous context of a sentence, and (ii) semantic similarity scores that represent the semantic similarity between the target word and Cloze task responses for it. For Brazilian Portuguese (BP), there was no large eye-tracking corpora with predictability norms. The goal of this paper is to present a method to calculate the two semantic measures used in the first BP corpus of eye movements during silent reading of short paragraphs by undergraduate students. The method was informed by a large evaluation of both static and contextualized word embeddings, trained on large corpora of texts. Here, we make publicly available: (i) a BP corpus for a sentence-completion task to evaluate semantic similarity, (ii) a new methodology to build this corpus based on the scores of Cloze data taken from our project, and (iii) a hybrid method to compute the two semantic measures in order to build predictability corpora in BP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.nilc.icmc.usp.br/nilc/index.php/rastros.

  2. 2.

    http://www.nilc.icmc.usp.br/embeddings.

  3. 3.

    https://www.inf.pucrs.br/linatural/wordpress/pucrs-bbp-embeddings/.

  4. 4.

    Dataset and evaluation sources are available at: https://github.com/sidleal/TSD2021.

References

  1. Bianchi, B., Monzón, G.B., Ferrer, L., Slezak, D.F., Shalom, D.E., Kamienkowski, J.E.: Human and computer estimations of predictability of words in written language. Sci. Rep. 10(4396), 1–11 (2020)

    Google Scholar 

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguisti. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Correia, R., Baptista, J., Eskenazi, M., Mamede, N.: Automatic generation of Cloze question stems. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS (LNAI), vol. 7243, pp. 168–178. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28885-2_19

    Chapter  Google Scholar 

  4. Demberg, V., Keller, F.: Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 109(2), 192–210 (2008)

    Article  Google Scholar 

  5. Deutsch, D., Roth, D.: Summary cloze: a new task for content selection in topic-focused summarization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3711–3720 (2019)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423

  7. Fonseca, E.F., Garcia Rosa, J.L., Aluísio, Maria, S.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in portuguese. J. Braz. Comput. Soc, Open Access 21(2), 1340 (2015)

    Google Scholar 

  8. Frank, S.: Word embedding distance does not predict word reading time. In: Proceedings of the 39th Annual Conference of the Cognitive Science Society (CogSci). pp. 385–390. Cognitive Science Society, Austin (2017)

    Google Scholar 

  9. Hartmann, N.S., Fonseca, E.R., Shulby, C.D., Treviso, M.V., Rodrigues, J.S., Aluísio, S.M.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 122–131. SBC, Porto Alegre(2017)

    Google Scholar 

  10. Hermann, K.M. et al .: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems pp. 1693–1701 (2015)

    Google Scholar 

  11. Hill, F., Bordes, A., Chopra, S., Weston, J.: The goldilocks principle: reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301 (2015)

  12. Kennedy, A., Pynte, J., Murray, W.S., Paul, S.A.: Frequency and predictability effects in the dundee corpus: an eye movement analysis. Q. J. Exp. Psychol. 66(3), 601–618 (2013). https://doi.org/10.1080/17470218.2012.676054

  13. Kliegl, R., Grabner, E., Rolfs, M., Engbert, R.: Length, frequency, and predictability effects of words on eye movements in reading. Eur. J. Cogn. Psychol. 16(1/2), 262–284 (2004)

    Google Scholar 

  14. Landauer, T.K., Laham, D., Rehder, B., Schreiner, M.E.: How well can passage meaning be derived without using word order? A comparison of latent semantic analysis and humans. In: Shafto, M.G., Langley, P. (eds.) Proceedings of the 19th Annual Meeting of the Cognitive Science Society. pp. 412–417 (1997)

    Google Scholar 

  15. Luke, S.G., Christianson, K.: Limits on lexical prediction during reading. Cogn. Psychol. 88, 22–60 (2016). https://doi.org/10.1016/j.cogpsych.2016.06.002

    Article  Google Scholar 

  16. Luke, S.G., Christianson, K.: The provo corpus: a large eye-tracking corpus with predictability norms. Behav. Res. Methods 50, 826–833 (2018)

    Article  Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations (ICLR 2013), Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings (2013)

    Google Scholar 

  18. Mirowski, P., Vlachos, A.: Dependency recurrent neural language models for sentence completion. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 2: Short Papers). pp. 511–517 (2015)

    Google Scholar 

  19. Mostafazadeh, N., Roth, M., Louis, A., Chambers, N., Allen, J.: Lsdsem 2017 shared task: the story cloze test. In: Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp. 46–51 (2017)

    Google Scholar 

  20. Park, H., Park, J.: Assessment of word-level neural language models for sentence completion. Appl. Sci. 10(4), 1340 (2020)

    Article  Google Scholar 

  21. Santos, H., Woloszyn, V., Vieira, R.: BlogSet-BR: a Brazilian Portuguese blog corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC). pp. 661–664 (2018)

    Google Scholar 

  22. Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). pp. 437–442 (2019)

    Google Scholar 

  23. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  24. Tran, K., Bisazza, A., Monz, C.: Recurrent memory networks for language modeling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 321–331. Association for Computational Linguistics, San Diego, June 2016

    Google Scholar 

  25. Vale, O.A., Baptista, J.: Novo dicionário de formas flexionadas do unitex-pb avaliação da flexão verbal. In: Anais do X Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 171–180. Brazilian Computer Society, Porto Alegre (2015)

    Google Scholar 

  26. Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 4339–4344 (2018)

    Google Scholar 

  27. Woods, A.: Exploiting linguistic features for sentence completion. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 2: Short Papers), pp. 438–442 (2016)

    Google Scholar 

  28. Xie, Q., Lai, G., Dai, Z., Hovy, E.: Large-scale cloze test dataset created by teachers. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2344–2356. Association for Computational Linguistics (2018)

    Google Scholar 

  29. Yuret, D.: Ku: Word sense disambiguation by substitution. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 207–213. Association for Computational Linguistics (2007)

    Google Scholar 

  30. Zweig, G., Burges, C.J.C.: The microsoft research sentence completion challenge. Tech. Rep., Microsoft Research, Technical Report MSR-TR-2011-129 (2011)

    Google Scholar 

  31. Zweig, G., Platt, J.C., Meek, C., Burges, C.J., Yessenalina, A., Liu, Q.: Computational approaches to sentence completion. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 601–610. Association for Computational Linguistics, Jeju Island, July 2012

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sidney Leal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Leal, S., Casanova, E., Paetzold, G., Aluísio, S. (2021). Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics