Skip to main content

Corpus Methods in a Digitized World

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10596))

Included in the following conference series:

Abstract

Data is available like never before. We believed that back in the 1990s, but corpora are even larger today than they were then, and corpora will continue to grow for some time to come. Thus far, corpus sizes have been limited by our ability to collect data, but we are rapidly approaching a fundamental limit on supply of written and spoken language. There are only so many people in the world, and they have only so much time to communicate with one another. It is becoming feasible to digitize a non-trivial fraction of the world’s communication. This ability is creating new opportunities for new audiences to join in on the fun. Google Ngrams makes it easy for anyone to apply corpus-based methods to half a trillion words (4% of all books ever printed). The popular press is referring to corpus methods and Google Ngrams as “addictive.” Computer Scientists are talking about “digital immortality” (recording much of human communication and storing it forever). Digital immortality may not be a reality just yet, but psychologists are currently recording most of what children say and hear between 2 months and 2 years of age in order to better understand language acquisition. As the world becomes digitized, there will be many applications of corpus-based methods that include lexicography (and so much more).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 95.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte.

  2. 2.

    Storage on phones tends to use more expensive solid state disk. Those prices are also falling, though not as rapidly.

  3. 3.

    https://www.google.com/earth.

  4. 4.

    http://www.worldwidetelescope.org.

  5. 5.

    https://www.google.com/intl/en/chrome/demos/speech.html.

  6. 6.

    https://azure.microsoft.com/en-us/services/cognitive-services/speech/.

  7. 7.

    https://speech-to-text-demo.mybluemix.net.

  8. 8.

    Two examples of speech companies in the medical business are: https://www.nuance.com and https://mmodal.com.

  9. 9.

    https://www.popuparchive.com/.

  10. 10.

    https://catalog.ldc.upenn.edu/ldc2004t19.

  11. 11.

    https://catalog.ldc.upenn.edu/ldc97s62.

  12. 12.

    https://catalog.ldc.upenn.edu/ldc97s42.

  13. 13.

    https://www.ted.com/talks/deb_roy_the_birth_of_a_word.

  14. 14.

    https://www.media.mit.edu/cogmac/projects/hsp.html.

  15. 15.

    http://bergelsonlab.com.

  16. 16.

    http://darcle.org.

  17. 17.

    http://homebank.talkbank.org.

  18. 18.

    http://talkbank.org/.

  19. 19.

    https://www.clarin.eu.

  20. 20.

    https://nyu.databrary.org.

  21. 21.

    http://aphasia.talkbank.org.

  22. 22.

    https://en.wikipedia.org/wiki/As_We_May_Think.

  23. 23.

    https://archive.org/web.

  24. 24.

    https://en.wikipedia.org/wiki/Wayback_Machine.

  25. 25.

    https://www.ted.com/talks/brewster_kahle_builds_a_free_digital_library.

  26. 26.

    https://www.ted.com/talks/what_we_learned_from_5_million_books.

  27. 27.

    https://www.ted.com/talks/anne_curzan_what_makes_a_word_real.

  28. 28.

    http://www.networkworld.com/article/2197233/applications/google-s-ngram-viewer--clever-and-addictive.html.

  29. 29.

    https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram.

  30. 30.

    https://genius.com/Atodd-when-harvard-met-sally-n-gram-analysis-of-the-new-york-times-weddings-section-annotated.

  31. 31.

    https://www.theatlantic.com/technology/archive/2013/10/googles-ngram-viewer-goes-wild/280601.

  32. 32.

    https://corpus.byu.edu.

  33. 33.

    It is suggested in https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/ that the f-word appears to be more common than it is in older books because of a common OCR error involving “f” and “s” discussed in [12]. While that might explain why the f-word appears to be so much more common in the 1700s than the 1800s, it doesn’t explain why so many taboo 4-letter words are more common in the 1700s than the 1800s.

  34. 34.

    http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

  35. 35.

    It is reported in [10] that the collection contains over 5 million books and 500 million words, but we find that the collection is about 10% smaller than that.

  36. 36.

    https://code.google.com/archive/p/word2vec.

  37. 37.

    https://nlp.stanford.edu/projects/glove.

  38. 38.

    https://digitalsinology.org/when-n-grams-go-bad.

  39. 39.

    https://books.google.com/ngrams/graph?content=France,America&year_start=180 0&year_end=2000&corpus=17.

  40. 40.

    https://books.google.com/ngrams/graph?content=France,America&year_start=180 0&year_end=2000&corpus=18.

  41. 41.

    http://www.natcorp.ox.ac.uk.

  42. 42.

    http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt reports the number of words in the corpus by year between 1505 and 2008. Based on those numbers, the corpus is growing about 3% per year, or 35% per decade.

References

  1. Gemmell, J., Bell, G., Lueder, R., Drucker, S., Wong, C.: MyLifeBits: fulfilling the memex vision. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 235–238 (2002)

    Google Scholar 

  2. Bell, G., Gray, J.: Digital immortality. CACM 44(3), 28–31 (2001)

    Article  Google Scholar 

  3. Barclay, T., Gray, J., Slutz, D.: Microsoft TerraServer: a spatial data warehouse. ACM SIGMOD Record 29(2), 307–318 (2000)

    Article  Google Scholar 

  4. Szalay, A., Gray, J.: The World-wide Telescope. Science 293(5537), 2037–2040 (2001)

    Article  Google Scholar 

  5. Cieri, C., Graff, D., Kimball, O., Miller, D., Walker, K.: Fisher English Training Speech. Linguistic Data Consortium, Philadelphia (2004)

    Google Scholar 

  6. Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: ICASSP, pp. 517–520 (1992)

    Google Scholar 

  7. Canavan, A., Graff, D., Zipperlen, G.: Callhome American English Speech. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  8. Fausey, C., Jayaraman, S., Smith, L.: From faces to hands: changing visual input in the first two years. Cognition 152, 101–107 (2016)

    Article  Google Scholar 

  9. Bush, V.: As we may think. Atl. Monthly 176(1), 101–108 (1945)

    Google Scholar 

  10. Michel, J., Shen, Y., et al.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)

    Article  Google Scholar 

  11. Davies, M.: The Corpus of Contemporary American English as the first reliable monitor corpus of English. Lit. Linguist. Comput. 24(4), 447–464 (2010)

    Article  MathSciNet  Google Scholar 

  12. Pechenick, E., Danforth, C., Dodds, P.: Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE 10(10), e0137041 (2015)

    Article  Google Scholar 

  13. Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

    Google Scholar 

  15. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  16. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: NIPS, pp. 2177–2185 (2014)

    Google Scholar 

  17. Firth, J.: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. Basil Blackwell, Oxford (1957)

    Google Scholar 

  18. Hamilton, W., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, pp. 1489–1501 (2016)

    Google Scholar 

  19. Francis, N., Kucera, H.: Frequency Analysis of English Usage. Houghton Mifflin Company, Boston (1982)

    Google Scholar 

  20. Sinclair, J.: Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London (1987)

    Google Scholar 

  21. Aijmer, K., Altenberg, B.: English Corpus Linguistics. Routledge, London (2014)

    Google Scholar 

  22. Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007)

    Article  Google Scholar 

  23. Chapman, R.: Roget’s International Thesaurus, 4th edn. Harper and Row, New York (1977)

    Google Scholar 

  24. Chapman, R.: Roget’s International Thesaurus, 5th edn. Harper and Row, New York (1992)

    Google Scholar 

  25. Fillmore, C., Atkins, B.: Toward a frame-based lexicon: the semantics of RISK and its neighbors. In: Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp. 75–102. Lawrence Erlbaum Associates, Hillsdale (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenneth Ward Church .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Church, K.W. (2017). Corpus Methods in a Digitized World. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69805-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69804-5

  • Online ISBN: 978-3-319-69805-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics