Abstract
Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.
We would like to thank the support of projects CNPq (PRONEM) 003/2011, CNPq 482520/2012-4, 312184/2012-3, 551964/2011-1, PNPD 2484/2009 and Capes-Cofecub 707/11.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ramisch, C., Villavicencio, A., Boitet, C.: Multiword expressions in the wild? the mwetoolkit comes in handy. In: Proc. of the 23rd COLING - Demonstrations, Beijing, China. The Coling 2010 Organizing Committee (August 2010)
Tsvetkov, Y., Wintner, S.: Extraction of multi-word expressions from small parallel corpora. In: Coling 2010: Posters, Beijing, China, Coling 2010 (August 2010)
Korhonen, A., Krymolowski, Y., Briscoe, E.J.: A large subcategorization lexicon for natural language processing applications. In: Proceedings of the 5th LREC, Genova, Italy (2006)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th ACL and 17th International COLING (1998)
Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4), 673–721 (2010)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inf. Retr. 11(5) (October 2008)
Granada, R., Lopes, L., Ramisch, C., Trojahn, C., Vieira, R., Villavicencio, A.: A comparable corpus based on aligned multilingual ontologies. In: Proceedings of the First Workshop on Multilingual Modeling, MM 2012, pp. 25–31. Association for Computational Linguistics, Stroudsburg (2012)
Barbosa, L., Sridhar, V.K.R., Yarmohammadi, M., Bangalore, S.: Harvesting parallel text in multiple languages with limited supervision. In: Kay, M., Boitet, C. (eds.) COLING, pp. 201–214. Indian Institute of Technology, Bombay (2012)
Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography: A pilot study of english/french collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Cambridge Scholars Publishing, Newcastle (2010)
Ljubešić, N., Erjavec, T.: hrwac and slwac: Compiling web corpora for croatian and slovene. In: Proceedings of 14th International Conference on Text, Speech and Dialogue, TSD (2011)
Ljubešić, N., Toral, A.: caWaC – a web corpus of Catalan. In: Proceedings of LREC 2014 (May 2014)
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryiǧit, G., KĂ¼bler, S., Marinov, S., Marsi, E.: Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13, 95–135 (2007)
KohlschĂ¼tter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)
PomikĂ¡lek, J.: Removing Boilerplate and Duplicate Content from Web Corpora. PhD en informatique, Masarykova univerzita, Fakulta informatiky (2011)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees (1994)
Shuyo, N.: Language detection library for java (2010)
Bick, E.: The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework. PhD thesis, Aarhus University (2002)
Boos, R., Prestes, K., Villavicencio, A.: Identification of multiword expressions in the brwac. In: Proceedings of LREC 2014 (May 2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Boos, R., Prestes, K., Villavicencio, A., PadrĂ³, M. (2014). brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds) Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science(), vol 8775. Springer, Cham. https://doi.org/10.1007/978-3-319-09761-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-09761-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09760-2
Online ISBN: 978-3-319-09761-9
eBook Packages: Computer ScienceComputer Science (R0)