Languages of Russia: Using Social Networks to Collect Texts

Krylova, Irina; Orekhov, Boris; Stepanova, Ekaterina; Zaydelman, Lyudmila

doi:10.1007/978-3-319-41718-9_11

Irina Krylova¹⁷,
Boris Orekhov¹⁷,
Ekaterina Stepanova¹⁷ &
…
Lyudmila Zaydelman¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 573))

Included in the following conference series:

Russian Summer School in Information Retrieval

711 Accesses

Abstract

In this paper we outline a method of finding texts in minor languages of Russia in social networks by the example of VKontakte. We find language-specific markers – special tokens that contain letter combinations unique to a certain language and highly frequent in texts in this language. We use Yandex.XML to generate lists of web-pages that contain texts in these languages. We then download data from web-pages in the https://vk.com domain through Vkontakte API.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Orekhov, B.V., Reshetnikov K.Yu.: To the assessment of Wikipedia as a linguistic source (К oцeнкe Bикипeдии кaк лингвиcтичecкoгo иcтoчникa), Contemporary Russian on the Internet (Coвpeмeнный pyccкий язык в интepнeтe), Moscow, Jazyki slavjanskoy kul’tury, pp. 310–321 (2014)
Google Scholar
Pischlöger, C.: Besermyan in the internet: social networks as a chance for language maintaining? (Бecepмянe в интepнeтe: coциaльныe ceти кaк шaнc для coxpaнeния poднoгo языкa?), Problems of ethno-cultural interaction in the Ural-Volga region: history and the present (Пpoблeмы этнoкyльтypнoгo взaимoдeйcтвия в Уpaлo-Пoвoлжьe: иcтopия и coвpeмeннocть), Samara, pp. 216–219 (2013)
Google Scholar
Boleda, G., Bott, S., Meza, R., et al.: CUCWeb: a Catalan corpus built from the web. In: Proceedings of Second Workshop on the Web as a Corpus at EACL 2006 (2006)
Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)
Article Google Scholar
Guevara, E.: NoWaC: a large web-based corpus for Norwegian. In: NAACL HLT 2010 6th Web as Corpus Workshop, pp. 1–7 (2010)
Google Scholar
Ljubešić, N., Erjavec, T.: hrWaC and slWac: compiling web corpora for Croatian and Slovene. In: Proceedings of 14th International Conference, Pilsen, Czech Republic, pp. 395–402 (2011)
Google Scholar
Zaliznyak, A.A.: Old Novgorod dialect (Дpeвнeнoвгopoдcкий диaлeкт), Moscow, Jazyki slavjanskoy kul’tury (2004)
Google Scholar
Yandex.XML – https://tech.yandex.ru/xml/
VK API – https://vk.com/dev/api_requests
Pischlöger, C.: Udmurt and Besermyan languages in social networks (Удмypтcкий и бecepмянcкий языки в coциaльныx ceтяx). In: Proceedings of International Science-Practical Conference, Dedicated to 260-Anniversary of V.G. Korolenko Maтepиaлы Meждyнapoднoй нayчнo-пpaктичecкoй кoнфepeнции, пocвящeннoй 260-лeтнeмy юбилeю B.Г. Кopoлeнкo.), Glazov, pp. 187–190 (2013)
Google Scholar
Pischlöger, C. Notes from Murjol underground: super Udmurts in cyberspace (Запис(к)и из Мурӝол Underground: Super удмурты в Cyberspace). In: Proceedings of IV International Science-Practical Conference “Florov’s Readings” (Материалы IV Международной научно-практической конференции “Флоровские чтения”), pp. 56–59. Glazov pedagogical institute, Glazov (2014)
Google Scholar

Download references

Acknowledgements

We thank Timofey Arkhangelskiy for pointing out difficulties of language identification by the example of Udmurt.

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Irina Krylova, Boris Orekhov, Ekaterina Stepanova & Lyudmila Zaydelman

Authors

Irina Krylova
View author publications
You can also search for this author in PubMed Google Scholar
Boris Orekhov
View author publications
You can also search for this author in PubMed Google Scholar
Ekaterina Stepanova
View author publications
You can also search for this author in PubMed Google Scholar
Lyudmila Zaydelman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lyudmila Zaydelman .

Editor information

Editors and Affiliations

Ural Federal University , Yekaterinburg, Russia
Pavel Braslavski
University of Amsterdam, Amsterdam, The Netherlands
Ilya Markov
University of Florida , Gainsville, Florida, USA
Panos Pardalos
Eurecat , Barcelona, Spain
Yana Volkovich
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
National Research University Higher School of Economics, Saint Petersburg, Russia
Sergei Koltsov
National Research University Higher School of Economics, Saint Petersburg, Russia
Olessia Koltsova

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Krylova, I., Orekhov, B., Stepanova, E., Zaydelman, L. (2016). Languages of Russia: Using Social Networks to Collect Texts. In: Braslavski, P., et al. Information Retrieval. RuSSIR 2015. Communications in Computer and Information Science, vol 573. Springer, Cham. https://doi.org/10.1007/978-3-319-41718-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-41718-9_11
Published: 26 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41717-2
Online ISBN: 978-3-319-41718-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics