Skip to main content

A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

  • 1347 Accesses

Abstract

Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.yworks.com/products/yed .

  2. 2.

    The notion of presupposition is the most important notion that came into linguistics from logic ([36]).

References

  1. Abusch, D.: Presupposition triggering from alternatives. J. Semant. 27(1), 37–80 (2010)

    Article  Google Scholar 

  2. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85836-2_29

    Chapter  Google Scholar 

  3. Allwood, J.: Negation and the strength of presuppositions or there is more to speaking than words. In: Logical Grammar Reports 2 (1972). Revised version in Dahl, Ö. (eds.) Logic, Pragmatics, and Grammar, 1975, Department of Linguistics, University of Göteborg

    Google Scholar 

  4. Apresjan, V.: Semanticheskaya struktura slova I ego vzaimodeystviye s otritsaniem (In Russian). In: Computational Linguistics and Intellectual Technologies, Papers from the Annual International Conference “Dialogue”, Bekasovo, 26–30 May 2010, Issue 9(16), pp. 13–19. RGGU, Moscow (2010)

    Google Scholar 

  5. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604. ACL (2005)

    Google Scholar 

  6. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, Predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247. ACL (2014)

    Google Scholar 

  7. Boguslavskij, I.: Sfera deystviya leksicheskih edinits. Yazyki Russkoy Kultury, Moscow (1996). (In Russian)

    Google Scholar 

  8. Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)

    Google Scholar 

  9. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  10. Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)

    Google Scholar 

  11. Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)

    Article  Google Scholar 

  12. Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh (2007)

    Google Scholar 

  13. Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp. 245–249 (2010)

    Google Scholar 

  14. Dahl, Ö.: In defense of a strawsonian approach to presupposition. In: Klein, W., Levelt, W. (eds.) Synthese Language Library. Crossing the Boundaries in Linguistics. Studies Presented to Manfred Bierwisch, vol. 13, pp. 191–200. D. Reidel Pub. Co. (1981)

    Google Scholar 

  15. Dahl, Ö.: Topic-comment structure revisited. In: Dahl, Ö. (ed.) Topic and Comment, Contextual Boundness and Focus, pp. 1–24. Helmut Buske, Hamburg (1974)

    Google Scholar 

  16. Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 468–476. ACL (2009)

    Google Scholar 

  17. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  18. Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)

    Google Scholar 

  19. Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)

    Google Scholar 

  20. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium, pp. 45–52 (2008)

    Google Scholar 

  21. Friedman, J.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  22. Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), European Language Resources Association (ELRA), pp. 4276–4283. Reykjavik, Iceland (2014)

    Google Scholar 

  23. Heim, I.: Presupposition projection and the semantics of attitude verbs. J. Semant. 9(3), 183–221 (1992)

    Article  MathSciNet  Google Scholar 

  24. Heim, I.: Presupposition projection. In: van der Sandt, R. (ed.) Reader for the Nijmegen Workshop on Discourse Processes, University of Nijmegen (1990). http://semanticsarchive.net/Archive/GFiMGNjN/Presupp%20projection%2090.pdf

  25. Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 891–896. Seattle (2013)

    Google Scholar 

  26. Karttunen, L.: Presuppositions and linguistic context. In: Theoretical Linguistics, vol. 1, pp. 181–194. Mouton De Gruyter, Berlin (1974)

    Google Scholar 

  27. Kempson, R.: Presupposition and the Delimitation of Semantics. University Press, Cambridge (1975)

    Google Scholar 

  28. Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)

    Article  MATH  Google Scholar 

  29. Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006). https://doi.org/10.1007/11816508_52

    Chapter  Google Scholar 

  30. McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference, pp. 196–201. The AAAI Press, Menlo Park (2014)

    Google Scholar 

  31. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013). http://arxiv.org/abs/1301.3781/

  32. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. Curran Associates Inc. (2013)

    Google Scholar 

  33. Miller, G., Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  34. Python Data Analysis Library (pandas). http://pandas.pydata.org

  35. Paducheva, E.V.: Egotsentricheskie valentnosti I dekonstruktsiya govoryashego. In: Voprosy yazykoznaniya 2011, no. 3, pp. 3–18 (2011). http://lexicograph.ruslang.ru/TextPdf1/egocentricals.pdf. (In Russian)

  36. Paducheva, E.: Presuppositions and Semantic Typology of Projective Meanings. http://lexicograph.ruslang.ru/TextPdf1/PROJECTION-POSTER.pdf

  37. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). http://scikit-learn.org

    MathSciNet  MATH  Google Scholar 

  38. Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015, in press)

    Google Scholar 

  39. Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82. IEEE, Piscataway (2015)

    Google Scholar 

  40. Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010)

    Google Scholar 

  41. Roberts, C., Simons, M., Beaver, D., Tonhauser, J.: Presupposition, conventional implicature, and beyond: a unified account of projection. In: Proceedings of the Workshop on Presupposition, ESSLLI 2009, Bordeaux, Universite de Bordeaux, Bordeaux (2009)

    Google Scholar 

  42. Rus, V., McCarthy, P.M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206. The AAAI Press, Menlo Park (2008)

    Google Scholar 

  43. Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 2422–2429. European Language Resources Association (ELRA), Reykjavik (2014)

    Google Scholar 

  44. Russell, B.: Mr. Strawson on referring. Mind 66, 385–389 (1957)

    Article  Google Scholar 

  45. Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, vol. 21, pp. 1–9. ACL (1995)

    Google Scholar 

  46. Sidorov, G.: Non-linear Construction of N-grams in Computational Linguistics: Syntactic, Filtered, and Generalized N-grams. Sociedad Mexicana de Inteligencia Artificial, Mexico (2013). (In Spanish)

    Google Scholar 

  47. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)

    Article  Google Scholar 

  48. Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing Text Similarity using Tree Edit Distance, NAFIPS 2015 (2015, accepted)

    Google Scholar 

  49. Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems, vol. 24, pp. 801–809. MIT Press, Cambridge (2011)

    Google Scholar 

  50. Stalnaker, R.: Presuppositions. J. Philos. Logic 2, 447–457 (1973)

    Google Scholar 

  51. Stalnaker, R.: Common ground. Linguist. Philos. 25(5–6), 701–721 (2002)

    Article  Google Scholar 

  52. Wan, S., Dras, M., Dale, R. Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)

    Google Scholar 

  53. Wilson, D.: Presupposition and Non-Truth-Conditional Semantics. Academic Press, London (1975)

    Google Scholar 

  54. Yin, W., Schutze, H.: Discriminative phrase embedding for paraphrase identification. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, 31 May–5 June, pp. 1368–1373. ACL (2015)

    Google Scholar 

  55. Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)

    Google Scholar 

Download references

Acknowledgments

We would like to thank Lilia Volkova for her invaluable help. The authors also acknowledge Saint-Petersburg State University for the research grant 30.38.305.2014.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ekaterina Pronoza or Elena Yagunova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pronoza, E., Yagunova, E. (2018). A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75477-2_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75476-5

  • Online ISBN: 978-3-319-75477-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics