Skip to main content

Old Needs, New Solutions: Comparable Corpora for Language Professionals

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

Use of corpora by language service providers and language professionals remains limited due to the existence of competing resources that are likely to be perceived as less demanding in terms of time and effort required to obtain and (learn to) use them (e.g. translation memory software, term bases and so forth). These resources however have limitations that could be compensated for through the integration of comparable corpora and corpus building tools in the translator’s toolkit. This chapter provides an overview of the ways in which different types of comparable corpora can be used in translation teaching and practice. First, two traditional corpus typologies are presented, namely small and specialized “handmade” corpora collected by end-users themselves for a specific task, and large and general “manufactured” corpora collected by expert teams and made available to end users. We suggest that striking a middleground between these two opposites is vital for professional uptake. To this end, we show how the BootCaT toolkit can be used to construct largish and relatively specialized comparable corpora for a specific translation task, and how, varying the search parameters in very simple ways, the size and usability of the corpora thus constructed can be further increased. The process is exemplified with reference to a simulated task (the translation of a patient information leaflet from English into Italian) and its efficacy is evaluated through an end-user questionnaire.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This may change in the future, as more tools like Linguee (http://www.linguee.com/) provide access to the aligned web, and possibly to subsections of it.

  2. 2.

    As suggested in the previous section, in this chapter we are not specifically discussing aligned parallel corpora. In our view these are more akin to TMs than to comparable corpora in terms of the technical issues involved in their construction and consultation, and of the type of insights translators can obtain from them; they are therefore not directly relevant here.

  3. 3.

    “More advanced” corpus querying techniques, like extraction of keywords or computation of collocational scores can of course be of great interest to translators. However, their relevance and usefulness may be hard to grasp for less corpus-savvy users, and hence they are not discussed here.

  4. 4.

    See http://wacky.sslmit.unibo.it/doku.php for information about ukWaC and itWaC.

  5. 5.

    Currently BootCaT uses Bing for URL retrieval, after both Google and Yahoo! discontinued their API services.

  6. 6.

    http://www.antlab.sci.waseda.ac.jp/software.html

  7. 7.

    In this paper we define genre (loosely based on Swales [34]) as a recognizable set of communicative events with a shared purpose and common formal features.

  8. 8.

    We used the frontend developed by Eros Zanchetta [37] and available here: http://bootcat.sslmit.unibo.it/.

References

  1. Aston, G.: Corpus use and learning to translate. Textus 12, 289–314 (1999)

    Google Scholar 

  2. Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004, pp. 1313–1316, Lisbon. ELDA (2004)

    Google Scholar 

  3. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)

    Article  Google Scholar 

  4. Bernardini, S., Castagnoli, S., Ferraresi, A., Gaspari, F., Zanchetta, E.: Introducing comparapedia: a new resource for corpus-based translation studies. Paper Presented at the International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS 2010), Edge Hill University, Ormskirk (July 2010)

    Google Scholar 

  5. Biber, D., Conrad, S.: Lexical bundles in conversation and academic prose. In: Hasselgard, H., Oksefjell, S. (eds.) Out of Corpora: Studies in Honour of Stig Johansson, pp. 181–190. Rodopi, Amsterdam (1999)

    Google Scholar 

  6. Bowker, L.: Computer-Aided Translation Technology: A Practical Introduction. University of Ottawa Press, Ottawa (2002)

    Google Scholar 

  7. Bowker, L.: Examining the impact of corpora on terminographic practice in the context of translation. In: Kruger, A., Wallmach, K., Munday, J. (eds.) Corpus-Based Translation Studies, pp. 211–236. Continuum, London (2011)

    Google Scholar 

  8. Castagnoli, S.: Using the web as a source of LSP corpora in the terminology classroom. In: Baroni, M., Bernardini, S. (eds.) Wacky! Working Papers on the Web as Corpus, pp. 159–172. GEDIT, Bologna (2006)

    Google Scholar 

  9. Chama, Z.: From segment focus to context orientation. TC World, 2010. online: http://www.tcworld.info/index.php?id=167

  10. Christ, O.: A modular and flexible architecture for an integrated corpus query system. In: Proceedings of COMPLEX 1994, pp. 23–32, Budapest (1994)

    Google Scholar 

  11. Cowie, A.P. (ed.): Phraseology: Theory, Analysis, and Applications. Oxford University Press, Oxford (2001)

    Google Scholar 

  12. Crossley, S.A., Louwerse, M.M.: Multi-dimensional register classication using bi-grams. Int. J. Corpus Linguist. 12(4), 453–478 (2007)

    Google Scholar 

  13. Crowston, K., Kwasnik, B.H.: A framework for creating a facetted classification for genres: addressing issues of multidimensionality. Hawaii International Conference on System Sciences, 4, 2004. online: http://doi.ieeecomputersociety.org/10.1109/HICSS.2004.1265268

  14. Dsilets, A., Melanon, C., Patenaude, G., Brunette, L.: How translators use tools and resources to resolve translation problems: an ethnographic study. In: Proceedings of MT Summit XII-Workshop: Beyond Translation Memories, Ottawa (2009)

    Google Scholar 

  15. Fantinuoli, C.: Specialized corpora from the web and term extraction for simultaneous interpreters. In: Baroni, M., Bernardini, S. (eds.) Wacky! Working Papers on the Web as Corpus, GEDIT, Bologna pp. 173–190 (2006)

    Google Scholar 

  16. Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography: a pilot study of English-French collocation extraction and translation. In: Xiao, R. (ed.) Using Corpora in Contrastive and Translation Studies, pp. 337–359. Cambridge Scholars Publishing, Newcastle (2010)

    Google Scholar 

  17. Gatto, M.: From Body to Web. An Introduction to the Web as Corpus. Laterza, Bari (2009)

    Google Scholar 

  18. Gavioli, L.: Exploring Corpora for ESP Learning. Benjamins, Amsterdam (2005)

    Google Scholar 

  19. Ghadessy, M., Henry, A., Roseberry, R.L. (eds.) Small Corpus Studies and ELT. Benjamins, Amsterdam (2001)

    Google Scholar 

  20. Goeuriot, L., Morin, M., Daille, B.: Compilation of specialized comparable corpus in French and Japanese. In: Proceedings of the ACL-IJCNLP workshop Building and Using Comparable Corpora (BUCC 2009), 2009

    Google Scholar 

  21. Gries, S.Th., Mukherjee, J.: Lexical gravity across varieties of English: an ICE-based study of n-grams in Asian Englishes. Int. J. Corpus Linguist. 15(4), 520–548 (2010)

    Google Scholar 

  22. Heid, U.: Corpus linguistics and lexicography. In: Kytö, M., Lüdeling, A. (eds.) Corpus Linguistics: An International Handbook, pp. 131–153. Mouton de Gruyter, Berlin (2008)

    Google Scholar 

  23. Hoey, M.: Lexical priming and translation. In: Kruger, A., Wallmach, K., Munday, J. (eds.) Corpus-Based Translation Studies, pp. 153–168. Continuum, London (2011)

    Google Scholar 

  24. MeLLANGE: Corpora and e-learning questionnaire. Results summary - professionals. Internal Document (2006)

    Google Scholar 

  25. MultiTrans. Multitrans 4(tm): Taking the multilingual textbase approach to new heights. MultiCorpora White Paper, online: http://www.multicorpora.com/lesNVIAdmin/File/MCwhitepaper1.pdf (August 2005)

  26. Munday, J.: Looming large: a cross-linguistic analysis of semantic prosodies in comparable reference corpora. In: Kruger, A., Wallmach, K., Munday, J. (eds.) Corpus-Based Translation Studies, pp. 169–186. Continuum, London (2011)

    Google Scholar 

  27. Pearson, J.: Terms in Context. Benjamins, Amsterdam (1998)

    Google Scholar 

  28. Pearson, J.: Using parallel texts in the translator training environment. In: Zanettin, F., Bernardini, S., Stewart, D. (eds.) Corpora in Translator Education, pp. 15–24. St Jerome, Manchester (2003)

    Google Scholar 

  29. Philip, G.: Arriving at equivalence: Making a case for comparable general reference corpora in translation studies. In: Beeby, A., Rodríguez Inés, P., Sánchez-Gijón, P. (eds.) Corpus Use and Translating, pp. 59–73. Benjamins, Amsterdam (2009)

    Google Scholar 

  30. Rinsche, A., Zanotti, N.P.: Study on the Size of the Language Industry in the EU. European Commission - Directorate General for Translation, Brussels (2009)

    Google Scholar 

  31. Santini, M.: State-of-the-art on automatic genre identification. Technical Report ITRI-04-03, ITRI, University of Brighton, UK (2004)

    Google Scholar 

  32. Serianni, L.: Grammatica Italiana. UTET, Torino (1991)

    Google Scholar 

  33. Sharoff, S.: Creating general-purpose corpora using automated search engine. In: Baroni, M., Bernardini, S. (eds.) Wacky! Working Papers on the Web as Corpus, pp. 63–98. GEDIT, Bologna

    Google Scholar 

  34. Swales, J.: Genre Analysis. English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)

    Google Scholar 

  35. Varantola, K.: Translators and disposable corpora. In: Zanettin, F., Bernardini, S., Stewart, D. (eds.) Corpora in Translator Education, pp. 55–70. St Jerome, Manchester (2003)

    Google Scholar 

  36. Williams, I. A.: A translator’s reference needs: dictionaries or parallel texts. Target 8, 277–299 (1996)

    Google Scholar 

  37. Zanchetta, E.: Corpora for the masses: the BootCaT front-end. Pecha Kucha Presented at the Corpus Linguistics 2011 Conference. University of Birmingham, Birmingham (July 2011)

    Google Scholar 

Download references

Acknowledgments

We would like to thank the students and colleagues who have kindly accepted to evaluate the URLs for us, Claudia Lecci for her expert insights about TM software, Federico Gaspari for fruitful lunchtime discussions on corpus construction strategies as well as the anonymous reviewer and the editors of the volume for their valuable feedback and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silvia Bernardini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bernardini, S., Ferraresi, A. (2013). Old Needs, New Solutions: Comparable Corpora for Language Professionals. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics