brWaC: A WaCky Corpus for Brazilian Portuguese

Boos, Rodrigo; Prestes, Kassius; Villavicencio, Aline; Padró, Muntsa

doi:10.1007/978-3-319-09761-9_22

Rodrigo Boos²⁵,
Kassius Prestes²⁵,
Aline Villavicencio²⁵ &
…
Muntsa Padró²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8775))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

650 Accesses
2 Citations

Abstract

Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.

We would like to thank the support of projects CNPq (PRONEM) 003/2011, CNPq 482520/2012-4, 312184/2012-3, 551964/2011-1, PNPD 2484/2009 and Capes-Cofecub 707/11.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ramisch, C., Villavicencio, A., Boitet, C.: Multiword expressions in the wild? the mwetoolkit comes in handy. In: Proc. of the 23rd COLING - Demonstrations, Beijing, China. The Coling 2010 Organizing Committee (August 2010)
Google Scholar
Tsvetkov, Y., Wintner, S.: Extraction of multi-word expressions from small parallel corpora. In: Coling 2010: Posters, Beijing, China, Coling 2010 (August 2010)
Google Scholar
Korhonen, A., Krymolowski, Y., Briscoe, E.J.: A large subcategorization lexicon for natural language processing applications. In: Proceedings of the 5th LREC, Genova, Italy (2006)
Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th ACL and 17th International COLING (1998)
Google Scholar
Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4), 673–721 (2010)
Article Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inf. Retr. 11(5) (October 2008)
Google Scholar
Granada, R., Lopes, L., Ramisch, C., Trojahn, C., Vieira, R., Villavicencio, A.: A comparable corpus based on aligned multilingual ontologies. In: Proceedings of the First Workshop on Multilingual Modeling, MM 2012, pp. 25–31. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar
Barbosa, L., Sridhar, V.K.R., Yarmohammadi, M., Bangalore, S.: Harvesting parallel text in multiple languages with limited supervision. In: Kay, M., Boitet, C. (eds.) COLING, pp. 201–214. Indian Institute of Technology, Bombay (2012)
Google Scholar
Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography: A pilot study of english/french collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Cambridge Scholars Publishing, Newcastle (2010)
Google Scholar
Ljubešić, N., Erjavec, T.: hrwac and slwac: Compiling web corpora for croatian and slovene. In: Proceedings of 14th International Conference on Text, Speech and Dialogue, TSD (2011)
Google Scholar
Ljubešić, N., Toral, A.: caWaC – a web corpus of Catalan. In: Proceedings of LREC 2014 (May 2014)
Google Scholar
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryiǧit, G., Kübler, S., Marinov, S., Marsi, E.: Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13, 95–135 (2007)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)
Google Scholar
Pomikálek, J.: Removing Boilerplate and Duplicate Content from Web Corpora. PhD en informatique, Masarykova univerzita, Fakulta informatiky (2011)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)
Article Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees (1994)
Google Scholar
Shuyo, N.: Language detection library for java (2010)
Google Scholar
Bick, E.: The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework. PhD thesis, Aarhus University (2002)
Google Scholar
Boos, R., Prestes, K., Villavicencio, A.: Identification of multiword expressions in the brwac. In: Proceedings of LREC 2014 (May 2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Rodrigo Boos, Kassius Prestes, Aline Villavicencio & Muntsa Padró

Authors

Rodrigo Boos
View author publications
You can also search for this author in PubMed Google Scholar
Kassius Prestes
View author publications
You can also search for this author in PubMed Google Scholar
Aline Villavicencio
View author publications
You can also search for this author in PubMed Google Scholar
Muntsa Padró
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

FCHS, Universidade do Algarve, Campus de Gambelas,, 8005-139, Faro, Portugal
Jorge Baptista
INESC-ID Lisboa, Lisbon, Portugal
Nuno Mamede
IT-University of Coimbra, Coimbra, Portugal
Sara Candeias
USP-EACH, São Paulo-SP, Brazil
Ivandré Paraboni
USP-ICMC, Universidade de São Paulo, São Carlos, SP, Brazil
Thiago A. S. Pardo
SCC-ICMC, University of São Paulo, São Carlos, SP, Brazil
Maria das Graças Volpe Nunes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boos, R., Prestes, K., Villavicencio, A., Padró, M. (2014). brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds) Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science(), vol 8775. Springer, Cham. https://doi.org/10.1007/978-3-319-09761-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-09761-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09760-2
Online ISBN: 978-3-319-09761-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics