Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish

José Ramom Pichel Campos; Pablo Gamallo Otero; Iñaki Alegria Loinaz

doi:10.1017/S1351324919000378

Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish

Published online by Cambridge University Press: 24 July 2019

José Ramom Pichel Campos

Pablo Gamallo Otero

and

Iñaki Alegria Loinaz

Show author details

José Ramom Pichel Campos*: Affiliation:
imaxin|software, Language Technologies, Galiza, Spain
Pablo Gamallo Otero: Affiliation:
CiTIUS, University of Santiago de Compostela, Galiza, Spain. Email: pablo.gamallo@usc.es
Iñaki Alegria Loinaz: Affiliation:
IXA group, Univ. of the Basque Country (UPV/EHU), Donostia/San Sebastián, Basque Country, Spain. Email: i.alegria@ehu.eus
*: *Corresponding author. Email: jramompichel@imaxin.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For this purpose, we have built historical corpora for each period, which have been compiled from different open corpus sources containing texts as close as possible to its original spelling. The results of our experiments show that a diachronic language distance based on perplexity detects the linguistic evolution that had already been explained by the historians of the three languages. It is remarkable to underline that it is an unsupervised multilingual method which only needs a raw corpora organized by periods.

Keywords

Corpus linguistics Language resources Similarity

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 4 , July 2020 , pp. 433 - 454

DOI: https://doi.org/10.1017/S1351324919000378 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alatorre, A. (2002). Los 1001 años de la lengua española, vol. 3. Fondo de Cultura Económica.Google Scholar

Asgari, E. and Mofrad, M.R.K. (2016). Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, San Diego, California, pp. 65–74.CrossRef Google Scholar

Bakker, D., Muller, A., Velupillai, V., Wichmann, S., Brown, C.H., Brown, P., Egorov, D., Mailhammer, R., Grant, A. and Holman, E.W. (2009). Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1), 169–181.CrossRef Google Scholar

Barbançon, F., Evans, S., Nakhleh, L., Ringe, D. and Warnow, T. (2013). An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30, 143–170.CrossRef Google Scholar

Barron, A.T.J., Huang, J., Spang, R.L. and DeDeo, S. (2018). Individuals, institutions, and innovation in the debates of the french revolution. Proceedings of the National Academy of Sciences 115(18), 4607–4612.CrossRef Google Scholar PubMed

Baugh, A.C. and Cable, T. (1993). A History of the English Language. Abingdon-on-Thames: Routledge.CrossRef Google Scholar

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing 8(4), 243–257.CrossRef Google Scholar

Bochkarev, V., Solovyev, V. and Wichmann, S. (2014). Universals versus historical contingencies in lexical evolution. Journal of The Royal Society Interface 11(101), 20140841.CrossRef Google Scholar PubMed

Borin, L. (2013). The why and how of measuring linguistic differences. In Approaches to Measuring Linguistic Differences. Berlin: Mouton de Gruyter, pp. 3–25.CrossRef Google Scholar

Brown, C.H., Holman, E.W., Wichmann, S. and Velupilla, V. (2008). Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals 61(4), 285–308.CrossRef Google Scholar

Capelo, R.G., Monteiro, A., Nunes, J., Rodrigues, A., Torgal, L. and Vitorino, F. (1994). História de Portugal em datas. Lisboa: Crculo de Leitores.Google Scholar

Cavnar, W.B., Trenkle, J.M. and John, M. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161–175. https://www.bibsonomy.org/bibtex/2b2f4de70229df66d0ecb9b2e25844a61/nosebrain Google Scholar

Chiswick, B.R. and Miller, P.W. (2004). Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Discussion papers. IZA.Google Scholar

Degaetano-Ortlieb, S., Kermes, H., Khamis, A. and Teich, E. (2016). An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e).Google Scholar

Degaetano-Ortlieb, S. and Teich, E. (2018). Using relative entropy for detection and analysis of periods of diachronic linguistic change. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 22–33.Google Scholar

Del Valle, J. (2013). A Political History of Spanish: The Making of a Language. Cambridge: Cambridge University Press.CrossRef Google Scholar

Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory, New Mexico State University.Google Scholar

Ellison, T.M. and Kirby, S. (2006). Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 273–280.CrossRef Google Scholar

Galves, C. and Faria, P. (2010). Tycho Brahe parsed corpus of historical Portuguese. http://www.tycho.iel.unicamp.br/tycho/corpus/en/index.html Google Scholar

Gamallo, P., Alegria, I., Pichel, J.R. and Agirrezabal, M. (2016). Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 170–177.Google Scholar

Gamallo, P., Pichel, J.R. and Alegria, I. (2017a). From language identification to language distance. Physica A: Statistical Mechanics and its Applications 484, 152–162.CrossRef Google Scholar

Gamallo, P., Pichel, J.R., de Compostela, S. and Alegria, I. (2017b). A perplexity-based method for similar languages discrimination. In VarDial 2017, p. 109.CrossRef Google Scholar

Gamallo, P., Sotelo, S. and Pichel, J.R. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. In Workshop TweetLID: Twitter Language Identification Workshop at SEPLN 2014. Girona, Spain.Google Scholar

Gao, Y., Liang, W., Shi, Y. and Huang, Q. (2014). Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications 393(C), 579–589.CrossRef Google Scholar

González, M. (2015). An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pp. 1–7.Google Scholar

Gooden, P. (2009). The Story of English: How the English Language Conquered the World. London: Quercus Books.Google Scholar

Holman, E.W., Wichmann, S., Brown, C.H., Velupillai, V., Muller, A. and Bakker, D. (2008). Explorations in automated lexicostatistics. Folia Linguistica 42(2), 331–354.CrossRef Google Scholar

Iriarte, Á., Gamallo, P. and Simões, A. (2018). Estratégias lexicométricas para detetar especificidades textuais. Linguamática 10(1), 19–26.CrossRef Google Scholar

Jágrová, K., Avgustinova, T., Stenger, I. and Fischer, A. (2019). Language models, surprisal and fantasy in slavic intercomprehension. Computer Speech & Language 53, 242–275.CrossRef Google Scholar

Jágrová, K., Stenger, I., Marti, R. and Avgustinova, T. (2016). Lexical and orthographic distances between bulgarian, czech, polish, and russian: A comparative analysis of the most frequent nouns. In Language Use and Linguistic Structure: Proceedings of the Olomouc Linguistics Colloquium, pp. 401–416.Google Scholar

Jurić, D. (2013). The Historical Development of the English Spelling System. PhD Thesis, Josip Juraj Strossmayer University of Osijek. Faculty of Humanities and Social Sciences.Google Scholar

Klarer, M. (2013). An Introduction to Literary Studies. Abingdon-on-Thames: Routledge.CrossRef Google Scholar

Kloss, H. (1967). “Abstand languages” and “Ausbau languages”. In Anthropological Linguistics, pp. 29–41.Google Scholar

Kolipakam, V., Jordan, F.M., Dunn, M., Greenhill, S.J., Bouckaert, R., Gray, R.D. and Verkerk, A. (2018). A bayesian phylogenetic study of the dravidian language family. Royal Society Open Science 5(3), 171504.CrossRef Google Scholar PubMed

Kondrak, G. (2005). N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval. Springer, pp. 115–126.CrossRef Google Scholar

Kroon, M., Medvedeva, M. and Plank, B. (2018). When simple n-gram models outperform syntactic approaches: Discriminating between dutch and flemish. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 244–253.Google Scholar

Lai, M., Patti, V., Ruffo, G. and Rosso, P. (2018). Stance evolution and twitter interactions in an italian political debate. In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 15–27.Google Scholar

Lapesa, R. and Pidal, R.M. (1942). Historia de la lengua española.Google Scholar

List, J.-M., Walworth, M., Greenhill, S.J., Tresoldi, T. and Forkel, R. (2018). Sequence comparison in computational historical linguistics. Journal of Language Evolution 3(2), 130–144.CrossRef Google Scholar

Liu, H.T. and Cong, J. (2013). Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin 58(10), 1139–1144.CrossRef Google Scholar

Malmasi, S., Zampieri, M., Ljubeši, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), Osaka, Japan, pp. 1–14.Google Scholar

Mastin, L. (2011). The history of english. Available at https://www.thehistoryofenglish.com/history.html (accessed 10 July 2019).Google Scholar

Mattoso, J. and Ramos, R. (1994). História de portugal. Editorial Estampa.Google Scholar

Millar, R.M. and Trask, L. (2015). Trask’s Historical Linguistics. Abingdon-on-Thames: Routledge.CrossRef Google Scholar

Nakhleh, L., Ringe, D.A. and Warnow, T. (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382–420.CrossRef Google Scholar

Nerbonne, J. and Heeringa, W. (1997a). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18.Google Scholar

Nerbonne, J. and Heeringa, W. (1997b). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON-97), pp. 11–18.Google Scholar

Pechenick, E.A., Danforth, C.M. and Dodds, P.S. (2015). Characterizing the google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PloS one 10(10), e0137041.CrossRef Google Scholar PubMed

Petroni, F. and Serva, M. (2010). Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11), 2280–2283.CrossRef Google Scholar

Pichel, J.R., Gamallo, P. and Alegria, I. (2018). Measuring language distance among historical varieties using perplexity. Application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 145–155.Google Scholar

Rama, T., Borin, L., Mikros, G.K. and Macutek, J. (2015). Comparative evaluation of string similarity measures for automatic language classification. In Rama, T. and Borin, L. (eds), Sequences in Language and Text. De Gruyter Mouton. ISBN = 978-3-11-036287-9.Google Scholar

Rama, T. and Singh, A.K. (2009). From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pp. 355–359.Google Scholar

Rissanen, M., Kytö, M. and Palander-Collin, M. (1993). Early English in the Computer Age: Explorations Through the Helsinki Corpus vol. 11. Berlin: Walter de Gruyter.Google Scholar

Sanders, A. (1994). The Short Oxford History of English Literature. Oxford: Clarendon Press.Google Scholar

Saraiva, A.J. (2001). História da literatura portuguesa. Porto: Porto Editora.Google Scholar

Saraiva, J.H. (1978). História concisa de Portugal. Publ. Europa-América.Google Scholar

Satterthwaite-Phillips, D. (2011). Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and “megalo”-comparison) for the Subgrouping of Tibeto-Burman. Stanford University.Google Scholar

Singh, A.K. and Surana, H. (2007). Can corpus based measures be used for comparative study of languages? In Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology. Association for Computational Linguistics, pp. 40–47.CrossRef Google Scholar

Smith, J. (2003). An Historical Study of English: Function, Form and Change. Berlin: Routledge.CrossRef Google Scholar

Specia, L., Scarton, C. and Paetzold, G.H. (2018). Quality estimation for machine translation. Synthesis Lectures on Human Language Technologies 11(1), 1–162.CrossRef Google Scholar

Stenger, I., Jágrová, K., Fischer, A., Avgustinova, T., Klakow, D. and Marti, R. (2017). Modeling the impact of orthographic coding on Czech–Polish and Bulgarian-Russian reading intercomprehension. Nordic Journal of Linguistics 40(2), 175–199.CrossRef Google Scholar

Swadesh, M. (1952). Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 96, 452–463.Google Scholar

Teyssier, P. (1982). História da língua portuguesa.Google Scholar

Th. Gries, S. and Hilpert, M. (2008). The identification of stages in diachronic data: variability-based neighbour clustering. Corpora 3(1), 59–81.CrossRef Google Scholar

Wieling, M. and Nerbonne, J. (2015). Advances in dialectometry. Annual Review Linguistic, 1(1), 243–264.CrossRef Google Scholar

Williams, E.B. (1962). From Latin to Portuguese: Historical Phonology and Morphology of the Portugese Language. Berlin: University of Pennsylvania Press.CrossRef Google Scholar

Xavier, M.F., Brocardo, M.T. and Vincente, M.G. (1994). Cipm–um corpus informatizado do português medieval. Actas do X Encontro da Associação Portuguesa de Linguística 2, 599–612.Google Scholar

Yujian, L. and Bo, L. (2007). A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6), 1091–1095.CrossRef Google Scholar PubMed

Zampieri, M. (2017). Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803.Google Scholar

Zampieri, M., Gebre, B.G. and Diwersy, S. (2013). N-gram language models and POS distribution for the identification of Spanish varieties. In Proceedings of TALN, vol. 2, pp. 580–587.Google Scholar

Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J., Scherrer, Y., Samardžić, T., Ljubešić, N., Tiedemann, J.et al. (2018). Language identification and morphosyntactic tagging: The second vardial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 1–17.Google Scholar

Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A. and Fresno, V. (2015). Tweetlid: a benchmark for tweet language identification 50, 1–38.Google Scholar

Article contents

Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests