Skip to main content

A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis

  • Conference paper
  • First Online:
Book cover Advances in Computational Intelligence (MICAI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10633))

Included in the following conference series:

  • 460 Accesses

Abstract

In this paper we present a new Russian paraphrase corpus derived from the news feed of the social network and conduct its primary analysis. Most media agencies post their news reports on their pages in social networks, and the headlines of the messages are often the same as those of the corresponding news articles from the official websites of the agencies. However, sometimes these pairs of headlines differ, and in such cases a headline from the social network can be considered a compression or a paraphrase of the original headline. In other words, such news feed from social networks is a rich resource of textual entailment, and, as it is shown in this paper, various linguistic phenomena, e.g., irony, presupposition and attention attracting markers. We collect the described pairs of headlines and construct the Russian social network news feed paraphrase corpus based on them. We test the paraphrase detection model trained on the other existing Russian paraphrase corpus, ParaPhraser.ru, collected from official news headlines only, against the constructed dataset, and explore its linguistic and pragmatic features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://paraphraser.ru/about.

  2. 2.

    There also exists another recently collected Turkish Paraphrase Corpus [7].

  3. 3.

    http://paraphraser.ru/download.

  4. 4.

    http://paraphraser.ru/contests/?contest_id=1.

  5. 5.

    http://scikit-learn.org.

  6. 6.

    A full list of the markers is not provided in this paper because some of them are either not present in the selected sample of paraphrases or are currently not of our main interest.

  7. 7.

    There are, of course, other irony modifiers in the corpus, apart from irony itself (e.g., sarcasm), but they are beyond the scope of this paper.

References

  1. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo W.: SEM 2013 shared task: Semantic Textual Similarity. In: The Second Joint Conference on Lexical and Computational Semantics (2013)

    Google Scholar 

  2. Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)

    Google Scholar 

  3. Chen, D.L., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 190–200. Portland, Oregon, USA (2011)

    Google Scholar 

  4. Demir, S., El-Kahlout, l.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: LREC 2012, pp. 4081–4091 (2012)

    Google Scholar 

  5. Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)

    Google Scholar 

  6. Dzikovska, M.O., et al.: SemEval—2013 Task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA. 13–14 June 2013

    Google Scholar 

  7. Eyecioglu, A., Keller, B.: Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9623, pp. 588–599. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75477-2_42

    Chapter  Google Scholar 

  8. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium (2008)

    Google Scholar 

  9. Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)

    Article  Google Scholar 

  10. McCarthy, P.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing: Issues and Approaches, pp. 73–89 (2008)

    Google Scholar 

  11. Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A.: ParaPhraser: Russian Paraphrase Corpus and Shared Task. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 211–225. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_18

    Chapter  Google Scholar 

  12. Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41718-9_8

    Chapter  Google Scholar 

  13. Pronoza, E., Yagunova, E.: Low-Level Features for Paraphrase Identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27060-9_5

    Chapter  Google Scholar 

  14. Pronoza E., Yagunova E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)

    Google Scholar 

  15. Pronoza, E., Yagunova, E., Kochetkova, N.: Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions? In: Sidorov, G., Herrera-Alcántara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 41–52. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1_4

    Chapter  Google Scholar 

  16. Regneri, M., Wang, R., Pinkal, M.: Aligning predicate-argument structures for paraphrase fragment extraction. In: LREC 2014, pp. 4300–4307 (2014)

    Google Scholar 

  17. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación Sistemas 18(3), 491–504 (2014)

    Google Scholar 

  18. Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural Language Generation, pp. 122–125, Athens, Greece (2009)

    Google Scholar 

  19. Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 121–128. Sofia, Bulgaria (2013)

    Google Scholar 

  20. Tikhonov, A.: Slovoobrazovatelnij slovar’ russkogo yazika v dvuh tomah: Ok 145000 Slov. Russkiy Yazik, Moscow (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ekaterina Pronoza , Elena Yagunova or Anton Pronoza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pronoza, E., Yagunova, E., Pronoza, A. (2018). A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis. In: Castro, F., Miranda-Jiménez, S., González-Mendoza, M. (eds) Advances in Computational Intelligence. MICAI 2017. Lecture Notes in Computer Science(), vol 10633. Springer, Cham. https://doi.org/10.1007/978-3-030-02840-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02840-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02839-8

  • Online ISBN: 978-3-030-02840-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics