A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis

Pronoza, Ekaterina; Yagunova, Elena; Pronoza, Anton

doi:10.1007/978-3-030-02840-4_11

Ekaterina Pronoza¹⁵,
Elena Yagunova¹⁵ &
Anton Pronoza¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10633))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

460 Accesses

Abstract

In this paper we present a new Russian paraphrase corpus derived from the news feed of the social network and conduct its primary analysis. Most media agencies post their news reports on their pages in social networks, and the headlines of the messages are often the same as those of the corresponding news articles from the official websites of the agencies. However, sometimes these pairs of headlines differ, and in such cases a headline from the social network can be considered a compression or a paraphrase of the original headline. In other words, such news feed from social networks is a rich resource of textual entailment, and, as it is shown in this paper, various linguistic phenomena, e.g., irony, presupposition and attention attracting markers. We collect the described pairs of headlines and construct the Russian social network news feed paraphrase corpus based on them. We test the paraphrase detection model trained on the other existing Russian paraphrase corpus, ParaPhraser.ru, collected from official news headlines only, against the constructed dataset, and explore its linguistic and pragmatic features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://paraphraser.ru/about.
2.
There also exists another recently collected Turkish Paraphrase Corpus [7].
3.
http://paraphraser.ru/download.
4.
http://paraphraser.ru/contests/?contest_id=1.
5.
http://scikit-learn.org.
6.
A full list of the markers is not provided in this paper because some of them are either not present in the selected sample of paraphrases or are currently not of our main interest.
7.
There are, of course, other irony modifiers in the corpus, apart from irony itself (e.g., sarcasm), but they are beyond the scope of this paper.

References

Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo W.: SEM 2013 shared task: Semantic Textual Similarity. In: The Second Joint Conference on Lexical and Computational Semantics (2013)
Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)
Google Scholar
Chen, D.L., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 190–200. Portland, Oregon, USA (2011)
Google Scholar
Demir, S., El-Kahlout, l.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: LREC 2012, pp. 4081–4091 (2012)
Google Scholar
Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
Google Scholar
Dzikovska, M.O., et al.: SemEval—2013 Task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA. 13–14 June 2013
Google Scholar
Eyecioglu, A., Keller, B.: Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9623, pp. 588–599. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75477-2_42
Chapter Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium (2008)
Google Scholar
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Article Google Scholar
McCarthy, P.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing: Issues and Approaches, pp. 73–89 (2008)
Google Scholar
Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A.: ParaPhraser: Russian Paraphrase Corpus and Shared Task. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 211–225. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_18
Chapter Google Scholar
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41718-9_8
Chapter Google Scholar
Pronoza, E., Yagunova, E.: Low-Level Features for Paraphrase Identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27060-9_5
Chapter Google Scholar
Pronoza E., Yagunova E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)
Google Scholar
Pronoza, E., Yagunova, E., Kochetkova, N.: Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions? In: Sidorov, G., Herrera-Alcántara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 41–52. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1_4
Chapter Google Scholar
Regneri, M., Wang, R., Pinkal, M.: Aligning predicate-argument structures for paraphrase fragment extraction. In: LREC 2014, pp. 4300–4307 (2014)
Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación Sistemas 18(3), 491–504 (2014)
Google Scholar
Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural Language Generation, pp. 122–125, Athens, Greece (2009)
Google Scholar
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 121–128. Sofia, Bulgaria (2013)
Google Scholar
Tikhonov, A.: Slovoobrazovatelnij slovar’ russkogo yazika v dvuh tomah: Ok 145000 Slov. Russkiy Yazik, Moscow (1985)
Google Scholar

Download references

Author information

Authors and Affiliations

St.-Petersburg State University, St.-Petersburg, Russian Federation
Ekaterina Pronoza & Elena Yagunova
Institute for Informatics and Automation of the Russian Academy of Sciences, St.-Petersburg, Russian Federation
Anton Pronoza

Authors

Ekaterina Pronoza
View author publications
You can also search for this author in PubMed Google Scholar
Elena Yagunova
View author publications
You can also search for this author in PubMed Google Scholar
Anton Pronoza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ekaterina Pronoza , Elena Yagunova or Anton Pronoza .

Editor information

Editors and Affiliations

Universidad Autónoma del Estado de Hidalgo, Pachuca, Mexico
Félix Castro
INFOTEC Aguascalientes, Aguascalientes, Mexico
Sabino Miranda-Jiménez
Tecnológico de Monterrey, Atizapán de Zaragoza, Mexico
Miguel González-Mendoza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pronoza, E., Yagunova, E., Pronoza, A. (2018). A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis. In: Castro, F., Miranda-Jiménez, S., González-Mendoza, M. (eds) Advances in Computational Intelligence. MICAI 2017. Lecture Notes in Computer Science(), vol 10633. Springer, Cham. https://doi.org/10.1007/978-3-030-02840-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-02840-4_11
Published: 01 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02839-8
Online ISBN: 978-3-030-02840-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics