Skip to main content

brWaC: A WaCky Corpus for Brazilian Portuguese

  • Conference paper
Computational Processing of the Portuguese Language (PROPOR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8775))

Abstract

Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.

We would like to thank the support of projects CNPq (PRONEM) 003/2011, CNPq 482520/2012-4, 312184/2012-3, 551964/2011-1, PNPD 2484/2009 and Capes-Cofecub 707/11.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ramisch, C., Villavicencio, A., Boitet, C.: Multiword expressions in the wild? the mwetoolkit comes in handy. In: Proc. of the 23rd COLING - Demonstrations, Beijing, China. The Coling 2010 Organizing Committee (August 2010)

    Google Scholar 

  2. Tsvetkov, Y., Wintner, S.: Extraction of multi-word expressions from small parallel corpora. In: Coling 2010: Posters, Beijing, China, Coling 2010 (August 2010)

    Google Scholar 

  3. Korhonen, A., Krymolowski, Y., Briscoe, E.J.: A large subcategorization lexicon for natural language processing applications. In: Proceedings of the 5th LREC, Genova, Italy (2006)

    Google Scholar 

  4. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th ACL and 17th International COLING (1998)

    Google Scholar 

  5. Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4), 673–721 (2010)

    Article  Google Scholar 

  6. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  7. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inf. Retr. 11(5) (October 2008)

    Google Scholar 

  8. Granada, R., Lopes, L., Ramisch, C., Trojahn, C., Vieira, R., Villavicencio, A.: A comparable corpus based on aligned multilingual ontologies. In: Proceedings of the First Workshop on Multilingual Modeling, MM 2012, pp. 25–31. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  9. Barbosa, L., Sridhar, V.K.R., Yarmohammadi, M., Bangalore, S.: Harvesting parallel text in multiple languages with limited supervision. In: Kay, M., Boitet, C. (eds.) COLING, pp. 201–214. Indian Institute of Technology, Bombay (2012)

    Google Scholar 

  10. Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography: A pilot study of english/french collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Cambridge Scholars Publishing, Newcastle (2010)

    Google Scholar 

  11. Ljubešić, N., Erjavec, T.: hrwac and slwac: Compiling web corpora for croatian and slovene. In: Proceedings of 14th International Conference on Text, Speech and Dialogue, TSD (2011)

    Google Scholar 

  12. Ljubešić, N., Toral, A.: caWaC – a web corpus of Catalan. In: Proceedings of LREC 2014 (May 2014)

    Google Scholar 

  13. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryiǧit, G., KĂ¼bler, S., Marinov, S., Marsi, E.: Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13, 95–135 (2007)

    Google Scholar 

  14. KohlschĂ¼tter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)

    Google Scholar 

  15. PomikĂ¡lek, J.: Removing Boilerplate and Duplicate Content from Web Corpora. PhD en informatique, Masarykova univerzita, Fakulta informatiky (2011)

    Google Scholar 

  16. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)

    Article  Google Scholar 

  17. Schmid, H.: Probabilistic part-of-speech tagging using decision trees (1994)

    Google Scholar 

  18. Shuyo, N.: Language detection library for java (2010)

    Google Scholar 

  19. Bick, E.: The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework. PhD thesis, Aarhus University (2002)

    Google Scholar 

  20. Boos, R., Prestes, K., Villavicencio, A.: Identification of multiword expressions in the brwac. In: Proceedings of LREC 2014 (May 2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Boos, R., Prestes, K., Villavicencio, A., PadrĂ³, M. (2014). brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds) Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science(), vol 8775. Springer, Cham. https://doi.org/10.1007/978-3-319-09761-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09761-9_22

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09760-2

  • Online ISBN: 978-3-319-09761-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics