Hostname: page-component-8448b6f56d-dnltx Total loading time: 0 Render date: 2024-04-19T23:07:19.429Z Has data issue: false hasContentIssue false

Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish

Published online by Cambridge University Press:  24 July 2019

José Ramom Pichel Campos*
Affiliation:
imaxin|software, Language Technologies, Galiza, Spain
Pablo Gamallo Otero
Affiliation:
CiTIUS, University of Santiago de Compostela, Galiza, Spain. Email: pablo.gamallo@usc.es
Iñaki Alegria Loinaz
Affiliation:
IXA group, Univ. of the Basque Country (UPV/EHU), Donostia/San Sebastián, Basque Country, Spain. Email: i.alegria@ehu.eus
*
*Corresponding author. Email: jramompichel@imaxin.com

Abstract

The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For this purpose, we have built historical corpora for each period, which have been compiled from different open corpus sources containing texts as close as possible to its original spelling. The results of our experiments show that a diachronic language distance based on perplexity detects the linguistic evolution that had already been explained by the historians of the three languages. It is remarkable to underline that it is an unsupervised multilingual method which only needs a raw corpora organized by periods.

Type
Article
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alatorre, A. (2002). Los 1001 años de la lengua española, vol. 3. Fondo de Cultura Económica.Google Scholar
Asgari, E. and Mofrad, M.R.K. (2016). Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, San Diego, California, pp. 65–74.CrossRefGoogle Scholar
Bakker, D., Muller, A., Velupillai, V., Wichmann, S., Brown, C.H., Brown, P., Egorov, D., Mailhammer, R., Grant, A. and Holman, E.W. (2009). Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1), 169181.CrossRefGoogle Scholar
Barbançon, F., Evans, S., Nakhleh, L., Ringe, D. and Warnow, T. (2013). An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30, 143170.CrossRefGoogle Scholar
Barron, A.T.J., Huang, J., Spang, R.L. and DeDeo, S. (2018). Individuals, institutions, and innovation in the debates of the french revolution. Proceedings of the National Academy of Sciences 115(18), 46074612.CrossRefGoogle ScholarPubMed
Baugh, A.C. and Cable, T. (1993). A History of the English Language. Abingdon-on-Thames: Routledge.CrossRefGoogle Scholar
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing 8(4), 243257.CrossRefGoogle Scholar
Bochkarev, V., Solovyev, V. and Wichmann, S. (2014). Universals versus historical contingencies in lexical evolution. Journal of The Royal Society Interface 11(101), 20140841.CrossRefGoogle ScholarPubMed
Borin, L. (2013). The why and how of measuring linguistic differences. In Approaches to Measuring Linguistic Differences. Berlin: Mouton de Gruyter, pp. 325.CrossRefGoogle Scholar
Brown, C.H., Holman, E.W., Wichmann, S. and Velupilla, V. (2008). Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals 61(4), 285308.CrossRefGoogle Scholar
Capelo, R.G., Monteiro, A., Nunes, J., Rodrigues, A., Torgal, L. and Vitorino, F. (1994). História de Portugal em datas. Lisboa: Crculo de Leitores.Google Scholar
Cavnar, W.B., Trenkle, J.M. and John, M. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161–175. https://www.bibsonomy.org/bibtex/2b2f4de70229df66d0ecb9b2e25844a61/nosebrainGoogle Scholar
Chiswick, B.R. and Miller, P.W. (2004). Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Discussion papers. IZA.Google Scholar
Degaetano-Ortlieb, S., Kermes, H., Khamis, A. and Teich, E. (2016). An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e).Google Scholar
Degaetano-Ortlieb, S. and Teich, E. (2018). Using relative entropy for detection and analysis of periods of diachronic linguistic change. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 22–33.Google Scholar
Del Valle, J. (2013). A Political History of Spanish: The Making of a Language. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory, New Mexico State University.Google Scholar
Ellison, T.M. and Kirby, S. (2006). Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 273–280.CrossRefGoogle Scholar
Galves, C. and Faria, P. (2010). Tycho Brahe parsed corpus of historical Portuguese. http://www.tycho.iel.unicamp.br/tycho/corpus/en/index.htmlGoogle Scholar
Gamallo, P., Alegria, I., Pichel, J.R. and Agirrezabal, M. (2016). Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 170–177.Google Scholar
Gamallo, P., Pichel, J.R. and Alegria, I. (2017a). From language identification to language distance. Physica A: Statistical Mechanics and its Applications 484, 152162.CrossRefGoogle Scholar
Gamallo, P., Pichel, J.R., de Compostela, S. and Alegria, I. (2017b). A perplexity-based method for similar languages discrimination. In VarDial 2017, p. 109.CrossRefGoogle Scholar
Gamallo, P., Sotelo, S. and Pichel, J.R. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. In Workshop TweetLID: Twitter Language Identification Workshop at SEPLN 2014. Girona, Spain.Google Scholar
Gao, Y., Liang, W., Shi, Y. and Huang, Q. (2014). Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications 393(C), 579589.CrossRefGoogle Scholar
González, M. (2015). An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pp. 1–7.Google Scholar
Gooden, P. (2009). The Story of English: How the English Language Conquered the World. London: Quercus Books.Google Scholar
Holman, E.W., Wichmann, S., Brown, C.H., Velupillai, V., Muller, A. and Bakker, D. (2008). Explorations in automated lexicostatistics. Folia Linguistica 42(2), 331354.CrossRefGoogle Scholar
Iriarte, Á., Gamallo, P. and Simões, A. (2018). Estratégias lexicométricas para detetar especificidades textuais. Linguamática 10(1), 1926.CrossRefGoogle Scholar
Jágrová, K., Avgustinova, T., Stenger, I. and Fischer, A. (2019). Language models, surprisal and fantasy in slavic intercomprehension. Computer Speech & Language 53, 242275.CrossRefGoogle Scholar
Jágrová, K., Stenger, I., Marti, R. and Avgustinova, T. (2016). Lexical and orthographic distances between bulgarian, czech, polish, and russian: A comparative analysis of the most frequent nouns. In Language Use and Linguistic Structure: Proceedings of the Olomouc Linguistics Colloquium, pp. 401–416.Google Scholar
Jurić, D. (2013). The Historical Development of the English Spelling System. PhD Thesis, Josip Juraj Strossmayer University of Osijek. Faculty of Humanities and Social Sciences.Google Scholar
Klarer, M. (2013). An Introduction to Literary Studies. Abingdon-on-Thames: Routledge.CrossRefGoogle Scholar
Kloss, H. (1967). “Abstand languages” and “Ausbau languages”. In Anthropological Linguistics, pp. 29–41.Google Scholar
Kolipakam, V., Jordan, F.M., Dunn, M., Greenhill, S.J., Bouckaert, R., Gray, R.D. and Verkerk, A. (2018). A bayesian phylogenetic study of the dravidian language family. Royal Society Open Science 5(3), 171504.CrossRefGoogle ScholarPubMed
Kondrak, G. (2005). N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval. Springer, pp. 115126.CrossRefGoogle Scholar
Kroon, M., Medvedeva, M. and Plank, B. (2018). When simple n-gram models outperform syntactic approaches: Discriminating between dutch and flemish. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 244–253.Google Scholar
Lai, M., Patti, V., Ruffo, G. and Rosso, P. (2018). Stance evolution and twitter interactions in an italian political debate. In International Conference on Applications of Natural Language to Information Systems. Springer, pp. 1527.Google Scholar
Lapesa, R. and Pidal, R.M. (1942). Historia de la lengua española.Google Scholar
List, J.-M., Walworth, M., Greenhill, S.J., Tresoldi, T. and Forkel, R. (2018). Sequence comparison in computational historical linguistics. Journal of Language Evolution 3(2), 130144.CrossRefGoogle Scholar
Liu, H.T. and Cong, J. (2013). Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin 58(10), 11391144.CrossRefGoogle Scholar
Malmasi, S., Zampieri, M., Ljubeši, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), Osaka, Japan, pp. 1–14.Google Scholar
Mastin, L. (2011). The history of english. Available at https://www.thehistoryofenglish.com/history.html (accessed 10 July 2019).Google Scholar
Mattoso, J. and Ramos, R. (1994). História de portugal. Editorial Estampa.Google Scholar
Millar, R.M. and Trask, L. (2015). Trask’s Historical Linguistics. Abingdon-on-Thames: Routledge.CrossRefGoogle Scholar
Nakhleh, L., Ringe, D.A. and Warnow, T. (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382420.CrossRefGoogle Scholar
Nerbonne, J. and Heeringa, W. (1997a). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18.Google Scholar
Nerbonne, J. and Heeringa, W. (1997b). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON-97), pp. 11–18.Google Scholar
Pechenick, E.A., Danforth, C.M. and Dodds, P.S. (2015). Characterizing the google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PloS one 10(10), e0137041.CrossRefGoogle ScholarPubMed
Petroni, F. and Serva, M. (2010). Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11), 22802283.CrossRefGoogle Scholar
Pichel, J.R., Gamallo, P. and Alegria, I. (2018). Measuring language distance among historical varieties using perplexity. Application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 145–155.Google Scholar
Rama, T., Borin, L., Mikros, G.K. and Macutek, J. (2015). Comparative evaluation of string similarity measures for automatic language classification. In Rama, T. and Borin, L. (eds), Sequences in Language and Text. De Gruyter Mouton. ISBN = 978-3-11-036287-9.Google Scholar
Rama, T. and Singh, A.K. (2009). From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pp. 355–359.Google Scholar
Rissanen, M., Kytö, M. and Palander-Collin, M. (1993). Early English in the Computer Age: Explorations Through the Helsinki Corpus vol. 11. Berlin: Walter de Gruyter.Google Scholar
Sanders, A. (1994). The Short Oxford History of English Literature. Oxford: Clarendon Press.Google Scholar
Saraiva, A.J. (2001). História da literatura portuguesa. Porto: Porto Editora.Google Scholar
Saraiva, J.H. (1978). História concisa de Portugal. Publ. Europa-América.Google Scholar
Satterthwaite-Phillips, D. (2011). Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and “megalo”-comparison) for the Subgrouping of Tibeto-Burman. Stanford University.Google Scholar
Singh, A.K. and Surana, H. (2007). Can corpus based measures be used for comparative study of languages? In Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology. Association for Computational Linguistics, pp. 4047.CrossRefGoogle Scholar
Smith, J. (2003). An Historical Study of English: Function, Form and Change. Berlin: Routledge.CrossRefGoogle Scholar
Specia, L., Scarton, C. and Paetzold, G.H. (2018). Quality estimation for machine translation. Synthesis Lectures on Human Language Technologies 11(1), 1162.CrossRefGoogle Scholar
Stenger, I., Jágrová, K., Fischer, A., Avgustinova, T., Klakow, D. and Marti, R. (2017). Modeling the impact of orthographic coding on Czech–Polish and Bulgarian-Russian reading intercomprehension. Nordic Journal of Linguistics 40(2), 175199.CrossRefGoogle Scholar
Swadesh, M. (1952). Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 96, 452463.Google Scholar
Teyssier, P. (1982). História da língua portuguesa.Google Scholar
Th. Gries, S. and Hilpert, M. (2008). The identification of stages in diachronic data: variability-based neighbour clustering. Corpora 3(1), 5981.CrossRefGoogle Scholar
Wieling, M. and Nerbonne, J. (2015). Advances in dialectometry. Annual Review Linguistic, 1(1), 243264.CrossRefGoogle Scholar
Williams, E.B. (1962). From Latin to Portuguese: Historical Phonology and Morphology of the Portugese Language. Berlin: University of Pennsylvania Press.CrossRefGoogle Scholar
Xavier, M.F., Brocardo, M.T. and Vincente, M.G. (1994). Cipm–um corpus informatizado do português medieval. Actas do X Encontro da Associação Portuguesa de Linguística 2, 599612.Google Scholar
Yujian, L. and Bo, L. (2007). A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6), 10911095.CrossRefGoogle ScholarPubMed
Zampieri, M. (2017). Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803.Google Scholar
Zampieri, M., Gebre, B.G. and Diwersy, S. (2013). N-gram language models and POS distribution for the identification of Spanish varieties. In Proceedings of TALN, vol. 2, pp. 580587.Google Scholar
Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J., Scherrer, Y., Samardžić, T., Ljubešić, N., Tiedemann, J.et al. (2018). Language identification and morphosyntactic tagging: The second vardial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 117.Google Scholar
Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A. and Fresno, V. (2015). Tweetlid: a benchmark for tweet language identification 50, 138.Google Scholar