Corpus Methods in a Digitized World

Church, Kenneth Ward

doi:10.1007/978-3-319-69805-2_1

Kenneth Ward Church¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10596))

Included in the following conference series:

International Conference on Computational and Corpus-Based Phraseology

1227 Accesses
2 Citations

Abstract

Data is available like never before. We believed that back in the 1990s, but corpora are even larger today than they were then, and corpora will continue to grow for some time to come. Thus far, corpus sizes have been limited by our ability to collect data, but we are rapidly approaching a fundamental limit on supply of written and spoken language. There are only so many people in the world, and they have only so much time to communicate with one another. It is becoming feasible to digitize a non-trivial fraction of the world’s communication. This ability is creating new opportunities for new audiences to join in on the fun. Google Ngrams makes it easy for anyone to apply corpus-based methods to half a trillion words (4% of all books ever printed). The popular press is referring to corpus methods and Google Ngrams as “addictive.” Computer Scientists are talking about “digital immortality” (recording much of human communication and storing it forever). Digital immortality may not be a reality just yet, but psychologists are currently recording most of what children say and hear between 2 months and 2 years of age in order to better understand language acquisition. As the world becomes digitized, there will be many applications of corpus-based methods that include lexicography (and so much more).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 95.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte.
2.
Storage on phones tends to use more expensive solid state disk. Those prices are also falling, though not as rapidly.
3.
https://www.google.com/earth.
4.
http://www.worldwidetelescope.org.
5.
https://www.google.com/intl/en/chrome/demos/speech.html.
6.
https://azure.microsoft.com/en-us/services/cognitive-services/speech/.
7.
https://speech-to-text-demo.mybluemix.net.
8.
Two examples of speech companies in the medical business are: https://www.nuance.com and https://mmodal.com.
9.
https://www.popuparchive.com/.
10.
https://catalog.ldc.upenn.edu/ldc2004t19.
11.
https://catalog.ldc.upenn.edu/ldc97s62.
12.
https://catalog.ldc.upenn.edu/ldc97s42.
13.
https://www.ted.com/talks/deb_roy_the_birth_of_a_word.
14.
https://www.media.mit.edu/cogmac/projects/hsp.html.
15.
http://bergelsonlab.com.
16.
http://darcle.org.
17.
http://homebank.talkbank.org.
18.
http://talkbank.org/.
19.
https://www.clarin.eu.
20.
https://nyu.databrary.org.
21.
http://aphasia.talkbank.org.
22.
https://en.wikipedia.org/wiki/As_We_May_Think.
23.
https://archive.org/web.
24.
https://en.wikipedia.org/wiki/Wayback_Machine.
25.
https://www.ted.com/talks/brewster_kahle_builds_a_free_digital_library.
26.
https://www.ted.com/talks/what_we_learned_from_5_million_books.
27.
https://www.ted.com/talks/anne_curzan_what_makes_a_word_real.
28.
http://www.networkworld.com/article/2197233/applications/google-s-ngram-viewer--clever-and-addictive.html.
29.
https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram.
30.
https://genius.com/Atodd-when-harvard-met-sally-n-gram-analysis-of-the-new-york-times-weddings-section-annotated.
31.
https://www.theatlantic.com/technology/archive/2013/10/googles-ngram-viewer-goes-wild/280601.
32.
https://corpus.byu.edu.
33.
It is suggested in https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/ that the f-word appears to be more common than it is in older books because of a common OCR error involving “f” and “s” discussed in [12]. While that might explain why the f-word appears to be so much more common in the 1700s than the 1800s, it doesn’t explain why so many taboo 4-letter words are more common in the 1700s than the 1800s.
34.
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.
35.
It is reported in [10] that the collection contains over 5 million books and 500 million words, but we find that the collection is about 10% smaller than that.
36.
https://code.google.com/archive/p/word2vec.
37.
https://nlp.stanford.edu/projects/glove.
38.
https://digitalsinology.org/when-n-grams-go-bad.
39.
https://books.google.com/ngrams/graph?content=France,America&year_start=180 0&year_end=2000&corpus=17.
40.
https://books.google.com/ngrams/graph?content=France,America&year_start=180 0&year_end=2000&corpus=18.
41.
http://www.natcorp.ox.ac.uk.
42.
http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt reports the number of words in the corpus by year between 1505 and 2008. Based on those numbers, the corpus is growing about 3% per year, or 35% per decade.

References

Gemmell, J., Bell, G., Lueder, R., Drucker, S., Wong, C.: MyLifeBits: fulfilling the memex vision. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 235–238 (2002)
Google Scholar
Bell, G., Gray, J.: Digital immortality. CACM 44(3), 28–31 (2001)
Article Google Scholar
Barclay, T., Gray, J., Slutz, D.: Microsoft TerraServer: a spatial data warehouse. ACM SIGMOD Record 29(2), 307–318 (2000)
Article Google Scholar
Szalay, A., Gray, J.: The World-wide Telescope. Science 293(5537), 2037–2040 (2001)
Article Google Scholar
Cieri, C., Graff, D., Kimball, O., Miller, D., Walker, K.: Fisher English Training Speech. Linguistic Data Consortium, Philadelphia (2004)
Google Scholar
Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: ICASSP, pp. 517–520 (1992)
Google Scholar
Canavan, A., Graff, D., Zipperlen, G.: Callhome American English Speech. Linguistic Data Consortium, Philadelphia (1997)
Google Scholar
Fausey, C., Jayaraman, S., Smith, L.: From faces to hands: changing visual input in the first two years. Cognition 152, 101–107 (2016)
Article Google Scholar
Bush, V.: As we may think. Atl. Monthly 176(1), 101–108 (1945)
Google Scholar
Michel, J., Shen, Y., et al.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Article Google Scholar
Davies, M.: The Corpus of Contemporary American English as the first reliable monitor corpus of English. Lit. Linguist. Comput. 24(4), 447–464 (2010)
Article MathSciNet Google Scholar
Pechenick, E., Danforth, C., Dodds, P.: Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE 10(10), e0137041 (2015)
Article Google Scholar
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: NIPS, pp. 2177–2185 (2014)
Google Scholar
Firth, J.: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. Basil Blackwell, Oxford (1957)
Google Scholar
Hamilton, W., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, pp. 1489–1501 (2016)
Google Scholar
Francis, N., Kucera, H.: Frequency Analysis of English Usage. Houghton Mifflin Company, Boston (1982)
Google Scholar
Sinclair, J.: Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London (1987)
Google Scholar
Aijmer, K., Altenberg, B.: English Corpus Linguistics. Routledge, London (2014)
Google Scholar
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007)
Article Google Scholar
Chapman, R.: Roget’s International Thesaurus, 4th edn. Harper and Row, New York (1977)
Google Scholar
Chapman, R.: Roget’s International Thesaurus, 5th edn. Harper and Row, New York (1992)
Google Scholar
Fillmore, C., Atkins, B.: Toward a frame-based lexicon: the semantics of RISK and its neighbors. In: Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp. 75–102. Lawrence Erlbaum Associates, Hillsdale (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM, Yorktown Heights, NY, USA
Kenneth Ward Church

Authors

Kenneth Ward Church
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenneth Ward Church .

Editor information

Editors and Affiliations

University of Wolverhampton, Wolverhampton, United Kingdom
Ruslan Mitkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Church, K.W. (2017). Corpus Methods in a Digitized World. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-69805-2_1
Published: 26 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics