LX-DSemVectors: Distributional Semantics Models for Portuguese

Rodrigues, João; Branco, António; Neale, Steven; Silva, João

doi:10.1007/978-3-319-41552-9_27

João Rodrigues¹⁸,
António Branco¹⁸,
Steven Neale¹⁸ &
…
João Silva¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

660 Accesses
11 Citations

Abstract

In this article we describe the creation and distribution of the first publicly available word embeddings for Portuguese. Our embeddings are evaluated on their own and also compared with the original English models on a well-known analogy task. We gathered a large Portuguese corpus of 1.7 billion tokens, developed the first distributional semantic analogies test set for Portuguese, and proceeded with the first parametrization and evaluation of Portuguese word embeddings models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For a more complete description of the evaluation methods, see [22].
2.
http://code.google.com/p/word2vec/.
3.
www.jornaldigital.com.
4.
www.observador.pt.

References

Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: distributed word representations for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning. pp. 183–192. Association for Computational Linguistics, Sofia, August 2013
Google Scholar
Barreto, F., Branco, A., Ferreira, E., Mendes, A., Nascimento, M.F., Nunes, F., Silva, J.: Open resources and tools for the shallow processing of portuguese: the tagshare project. In: Proceedings of LREC 2006. Citeseer (2006)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Bowman, S.R., Potts, C., Manning, C.D.: Recursive neural networks can learn logical semantics. In: ACL-IJCNLP, p. 12 (2015)
Google Scholar
Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art pos taggers for portuguese. In: LREC (2004)
Google Scholar
Cettolo, M., Girardi, C., Federico, M.: Wit3: Web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)
Google Scholar
Fonseca, E.R., Rosa, J.L.G., Aluísio, S.M.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in portuguese. J. Braz. Comput. Soc. 21(1), 1–14 (2015)
Article Google Scholar
Garvin, P.L.: Computer participation in linguistic research. Language 38, 385–389 (1962)
Article Google Scholar
Gaudio, R.D., Burchardt, A., Branco, A.: Evaluating machine translation in a usage scenario. In: Proceedings of LREC (to appear in print, 2016)
Google Scholar
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. arXiv preprint arXiv:1404.4641 (2014)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86. Citeseer (2005)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Google Scholar
Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 (2015)
Ling, W., Luís, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W., Trancoso, I.: Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Kopeckỳ, J., Burget, L., Glembek, O., Černockỳ, J.H.: Neural network based language models for highly inflective languages. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4725–4728. IEEE (2009)
Google Scholar
do Nascimento, M.F.B., Pereira, L., Saramago, J.: Portuguese corpora at CLUL. PRAXIS 2(2.1/759), 95 (2000)
Google Scholar
Pardo, T.A.S., Nunes, M.d.G.V.: A construção de um corpus de textos científicos em português do brasil e sua marcação retórica. Tech. rep. (2003)
Google Scholar
Rehurek, R., Sojka, P.: Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic (2011)
Google Scholar
dos Santos, C., Guimaraes, V., Niterói, R., de Janeiro, R.: Boosting named entity recognition with neural character embeddings. In: Proceedings of NEWS 2015 The Fifth Named Entities Workshop, p. 25 (2015)
Google Scholar
Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1818–1826 (2014)
Google Scholar
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of EMNLP (2015)
Google Scholar
Tiedemann, J.: News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. V, pp. 237–248. John Benjamins, Amsterdam/Philadelphia (2009)
Chapter Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Chair, N.C.C., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA), Istanbul, May 2012
Google Scholar

Download references

Acknowledgements

The results reported in this paper were partially supported by the Portuguese Government’s P2020 program under the grant 08/SI/2015/3279: ASSET-Intelligent Assistance for Everyone Everywhere, and by the EC’s FP7 program under the grant number 610516: QTLeap-Quality Translation by Deep Language Engineering Approaches.

Author information

Authors and Affiliations

Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
João Rodrigues, António Branco, Steven Neale & João Silva

Authors

João Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
António Branco
View author publications
You can also search for this author in PubMed Google Scholar
Steven Neale
View author publications
You can also search for this author in PubMed Google Scholar
João Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João Rodrigues .

Editor information

Editors and Affiliations

Universidade de Lisbon, Portugal
João Silva
ISCTE-IUL, Lisbon, Portugal
Ricardo Ribeiro
Universidade de Évora, Évora, Portugal
Paulo Quaresma
Universidade de Caxias do Sul, Caxias do Suö, Brazil
André Adami
Universidade de Lisbon, Lisboa, Portugal
António Branco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigues, J., Branco, A., Neale, S., Silva, J. (2016). LX-DSemVectors: Distributional Semantics Models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-41552-9_27
Published: 21 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics