A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models

Pronoza, Ekaterina; Yagunova, Elena

doi:10.1007/978-3-319-75477-2_41

Ekaterina Pronoza¹⁴ &
Elena Yagunova¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1347 Accesses

Abstract

Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.yworks.com/products/yed .
2.
The notion of presupposition is the most important notion that came into linguistics from logic ([36]).

References

Abusch, D.: Presupposition triggering from alternatives. J. Semant. 27(1), 37–80 (2010)
Article Google Scholar
Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85836-2_29
Chapter Google Scholar
Allwood, J.: Negation and the strength of presuppositions or there is more to speaking than words. In: Logical Grammar Reports 2 (1972). Revised version in Dahl, Ö. (eds.) Logic, Pragmatics, and Grammar, 1975, Department of Linguistics, University of Göteborg
Google Scholar
Apresjan, V.: Semanticheskaya struktura slova I ego vzaimodeystviye s otritsaniem (In Russian). In: Computational Linguistics and Intellectual Technologies, Papers from the Annual International Conference “Dialogue”, Bekasovo, 26–30 May 2010, Issue 9(16), pp. 13–19. RGGU, Moscow (2010)
Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604. ACL (2005)
Google Scholar
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, Predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247. ACL (2014)
Google Scholar
Boguslavskij, I.: Sfera deystviya leksicheskih edinits. Yazyki Russkoy Kultury, Moscow (1996). (In Russian)
Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)
Google Scholar
Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)
Article Google Scholar
Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh (2007)
Google Scholar
Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp. 245–249 (2010)
Google Scholar
Dahl, Ö.: In defense of a strawsonian approach to presupposition. In: Klein, W., Levelt, W. (eds.) Synthese Language Library. Crossing the Boundaries in Linguistics. Studies Presented to Manfred Bierwisch, vol. 13, pp. 191–200. D. Reidel Pub. Co. (1981)
Google Scholar
Dahl, Ö.: Topic-comment structure revisited. In: Dahl, Ö. (ed.) Topic and Comment, Contextual Boundness and Focus, pp. 1–24. Helmut Buske, Hamburg (1974)
Google Scholar
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4^th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 468–476. ACL (2009)
Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
Google Scholar
Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium, pp. 45–52 (2008)
Google Scholar
Friedman, J.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Article MathSciNet MATH Google Scholar
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), European Language Resources Association (ELRA), pp. 4276–4283. Reykjavik, Iceland (2014)
Google Scholar
Heim, I.: Presupposition projection and the semantics of attitude verbs. J. Semant. 9(3), 183–221 (1992)
Article MathSciNet Google Scholar
Heim, I.: Presupposition projection. In: van der Sandt, R. (ed.) Reader for the Nijmegen Workshop on Discourse Processes, University of Nijmegen (1990). http://semanticsarchive.net/Archive/GFiMGNjN/Presupp%20projection%2090.pdf
Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 891–896. Seattle (2013)
Google Scholar
Karttunen, L.: Presuppositions and linguistic context. In: Theoretical Linguistics, vol. 1, pp. 181–194. Mouton De Gruyter, Berlin (1974)
Google Scholar
Kempson, R.: Presupposition and the Delimitation of Semantics. University Press, Cambridge (1975)
Google Scholar
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Article MATH Google Scholar
Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006). https://doi.org/10.1007/11816508_52
Chapter Google Scholar
McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference, pp. 196–201. The AAAI Press, Menlo Park (2014)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013). http://arxiv.org/abs/1301.3781/
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. Curran Associates Inc. (2013)
Google Scholar
Miller, G., Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Python Data Analysis Library (pandas). http://pandas.pydata.org
Paducheva, E.V.: Egotsentricheskie valentnosti I dekonstruktsiya govoryashego. In: Voprosy yazykoznaniya 2011, no. 3, pp. 3–18 (2011). http://lexicograph.ruslang.ru/TextPdf1/egocentricals.pdf. (In Russian)
Paducheva, E.: Presuppositions and Semantic Typology of Projective Meanings. http://lexicograph.ruslang.ru/TextPdf1/PROJECTION-POSTER.pdf
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). http://scikit-learn.org
MathSciNet MATH Google Scholar
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015, in press)
Google Scholar
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82. IEEE, Piscataway (2015)
Google Scholar
Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010)
Google Scholar
Roberts, C., Simons, M., Beaver, D., Tonhauser, J.: Presupposition, conventional implicature, and beyond: a unified account of projection. In: Proceedings of the Workshop on Presupposition, ESSLLI 2009, Bordeaux, Universite de Bordeaux, Bordeaux (2009)
Google Scholar
Rus, V., McCarthy, P.M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206. The AAAI Press, Menlo Park (2008)
Google Scholar
Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 2422–2429. European Language Resources Association (ELRA), Reykjavik (2014)
Google Scholar
Russell, B.: Mr. Strawson on referring. Mind 66, 385–389 (1957)
Article Google Scholar
Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, vol. 21, pp. 1–9. ACL (1995)
Google Scholar
Sidorov, G.: Non-linear Construction of N-grams in Computational Linguistics: Syntactic, Filtered, and Generalized N-grams. Sociedad Mexicana de Inteligencia Artificial, Mexico (2013). (In Spanish)
Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Article Google Scholar
Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing Text Similarity using Tree Edit Distance, NAFIPS 2015 (2015, accepted)
Google Scholar
Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems, vol. 24, pp. 801–809. MIT Press, Cambridge (2011)
Google Scholar
Stalnaker, R.: Presuppositions. J. Philos. Logic 2, 447–457 (1973)
Google Scholar
Stalnaker, R.: Common ground. Linguist. Philos. 25(5–6), 701–721 (2002)
Article Google Scholar
Wan, S., Dras, M., Dale, R. Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)
Google Scholar
Wilson, D.: Presupposition and Non-Truth-Conditional Semantics. Academic Press, London (1975)
Google Scholar
Yin, W., Schutze, H.: Discriminative phrase embedding for paraphrase identification. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, 31 May–5 June, pp. 1368–1373. ACL (2015)
Google Scholar
Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)
Google Scholar

Download references

Acknowledgments

We would like to thank Lilia Volkova for her invaluable help. The authors also acknowledge Saint-Petersburg State University for the research grant 30.38.305.2014.

Author information

Authors and Affiliations

Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg, 199034, Russia
Ekaterina Pronoza & Elena Yagunova

Authors

Ekaterina Pronoza
View author publications
You can also search for this author in PubMed Google Scholar
Elena Yagunova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ekaterina Pronoza or Elena Yagunova .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pronoza, E., Yagunova, E. (2018). A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-75477-2_41
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics