Abstract
In this paper we outline a method of finding texts in minor languages of Russia in social networks by the example of VKontakte. We find language-specific markers – special tokens that contain letter combinations unique to a certain language and highly frequent in texts in this language. We use Yandex.XML to generate lists of web-pages that contain texts in these languages. We then download data from web-pages in the https://vk.com domain through Vkontakte API.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Orekhov, B.V., Reshetnikov K.Yu.: To the assessment of Wikipedia as a linguistic source (К oцeнкe Bикипeдии кaк лингвиcтичecкoгo иcтoчникa), Contemporary Russian on the Internet (Coвpeмeнный pyccкий язык в интepнeтe), Moscow, Jazyki slavjanskoy kul’tury, pp. 310–321 (2014)
Pischlöger, C.: Besermyan in the internet: social networks as a chance for language maintaining? (Бecepмянe в интepнeтe: coциaльныe ceти кaк шaнc для coxpaнeния poднoгo языкa?), Problems of ethno-cultural interaction in the Ural-Volga region: history and the present (Пpoблeмы этнoкyльтypнoгo взaимoдeйcтвия в Уpaлo-Пoвoлжьe: иcтopия и coвpeмeннocть), Samara, pp. 216–219 (2013)
Boleda, G., Bott, S., Meza, R., et al.: CUCWeb: a Catalan corpus built from the web. In: Proceedings of Second Workshop on the Web as a Corpus at EACL 2006 (2006)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)
Guevara, E.: NoWaC: a large web-based corpus for Norwegian. In: NAACL HLT 2010 6th Web as Corpus Workshop, pp. 1–7 (2010)
Ljubešić, N., Erjavec, T.: hrWaC and slWac: compiling web corpora for Croatian and Slovene. In: Proceedings of 14th International Conference, Pilsen, Czech Republic, pp. 395–402 (2011)
Zaliznyak, A.A.: Old Novgorod dialect (Дpeвнeнoвгopoдcкий диaлeкт), Moscow, Jazyki slavjanskoy kul’tury (2004)
Yandex.XML – https://tech.yandex.ru/xml/
VK API – https://vk.com/dev/api_requests
Pischlöger, C.: Udmurt and Besermyan languages in social networks (Удмypтcкий и бecepмянcкий языки в coциaльныx ceтяx). In: Proceedings of International Science-Practical Conference, Dedicated to 260-Anniversary of V.G. Korolenko Maтepиaлы Meждyнapoднoй нayчнo-пpaктичecкoй кoнфepeнции, пocвящeннoй 260-лeтнeмy юбилeю B.Г. Кopoлeнкo.), Glazov, pp. 187–190 (2013)
Pischlöger, C. Notes from Murjol underground: super Udmurts in cyberspace (Запис(к)и из Мурӝол Underground: Super удмурты в Cyberspace). In: Proceedings of IV International Science-Practical Conference “Florov’s Readings” (Материалы IV Международной научно-практической конференции “Флоровские чтения”), pp. 56–59. Glazov pedagogical institute, Glazov (2014)
Acknowledgements
We thank Timofey Arkhangelskiy for pointing out difficulties of language identification by the example of Udmurt.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Krylova, I., Orekhov, B., Stepanova, E., Zaydelman, L. (2016). Languages of Russia: Using Social Networks to Collect Texts. In: Braslavski, P., et al. Information Retrieval. RuSSIR 2015. Communications in Computer and Information Science, vol 573. Springer, Cham. https://doi.org/10.1007/978-3-319-41718-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-41718-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41717-2
Online ISBN: 978-3-319-41718-9
eBook Packages: Computer ScienceComputer Science (R0)