Abstract
Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The notion of presupposition is the most important notion that came into linguistics from logic ([36]).
References
Abusch, D.: Presupposition triggering from alternatives. J. Semant. 27(1), 37–80 (2010)
Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85836-2_29
Allwood, J.: Negation and the strength of presuppositions or there is more to speaking than words. In: Logical Grammar Reports 2 (1972). Revised version in Dahl, Ö. (eds.) Logic, Pragmatics, and Grammar, 1975, Department of Linguistics, University of Göteborg
Apresjan, V.: Semanticheskaya struktura slova I ego vzaimodeystviye s otritsaniem (In Russian). In: Computational Linguistics and Intellectual Technologies, Papers from the Annual International Conference “Dialogue”, Bekasovo, 26–30 May 2010, Issue 9(16), pp. 13–19. RGGU, Moscow (2010)
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604. ACL (2005)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, Predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247. ACL (2014)
Boguslavskij, I.: Sfera deystviya leksicheskih edinits. Yazyki Russkoy Kultury, Moscow (1996). (In Russian)
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)
Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)
Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh (2007)
Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp. 245–249 (2010)
Dahl, Ö.: In defense of a strawsonian approach to presupposition. In: Klein, W., Levelt, W. (eds.) Synthese Language Library. Crossing the Boundaries in Linguistics. Studies Presented to Manfred Bierwisch, vol. 13, pp. 191–200. D. Reidel Pub. Co. (1981)
Dahl, Ö.: Topic-comment structure revisited. In: Dahl, Ö. (ed.) Topic and Comment, Contextual Boundness and Focus, pp. 1–24. Helmut Buske, Hamburg (1974)
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 468–476. ACL (2009)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium, pp. 45–52 (2008)
Friedman, J.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), European Language Resources Association (ELRA), pp. 4276–4283. Reykjavik, Iceland (2014)
Heim, I.: Presupposition projection and the semantics of attitude verbs. J. Semant. 9(3), 183–221 (1992)
Heim, I.: Presupposition projection. In: van der Sandt, R. (ed.) Reader for the Nijmegen Workshop on Discourse Processes, University of Nijmegen (1990). http://semanticsarchive.net/Archive/GFiMGNjN/Presupp%20projection%2090.pdf
Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 891–896. Seattle (2013)
Karttunen, L.: Presuppositions and linguistic context. In: Theoretical Linguistics, vol. 1, pp. 181–194. Mouton De Gruyter, Berlin (1974)
Kempson, R.: Presupposition and the Delimitation of Semantics. University Press, Cambridge (1975)
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006). https://doi.org/10.1007/11816508_52
McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference, pp. 196–201. The AAAI Press, Menlo Park (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013). http://arxiv.org/abs/1301.3781/
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. Curran Associates Inc. (2013)
Miller, G., Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Python Data Analysis Library (pandas). http://pandas.pydata.org
Paducheva, E.V.: Egotsentricheskie valentnosti I dekonstruktsiya govoryashego. In: Voprosy yazykoznaniya 2011, no. 3, pp. 3–18 (2011). http://lexicograph.ruslang.ru/TextPdf1/egocentricals.pdf. (In Russian)
Paducheva, E.: Presuppositions and Semantic Typology of Projective Meanings. http://lexicograph.ruslang.ru/TextPdf1/PROJECTION-POSTER.pdf
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). http://scikit-learn.org
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015, in press)
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82. IEEE, Piscataway (2015)
Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010)
Roberts, C., Simons, M., Beaver, D., Tonhauser, J.: Presupposition, conventional implicature, and beyond: a unified account of projection. In: Proceedings of the Workshop on Presupposition, ESSLLI 2009, Bordeaux, Universite de Bordeaux, Bordeaux (2009)
Rus, V., McCarthy, P.M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206. The AAAI Press, Menlo Park (2008)
Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 2422–2429. European Language Resources Association (ELRA), Reykjavik (2014)
Russell, B.: Mr. Strawson on referring. Mind 66, 385–389 (1957)
Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, vol. 21, pp. 1–9. ACL (1995)
Sidorov, G.: Non-linear Construction of N-grams in Computational Linguistics: Syntactic, Filtered, and Generalized N-grams. Sociedad Mexicana de Inteligencia Artificial, Mexico (2013). (In Spanish)
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing Text Similarity using Tree Edit Distance, NAFIPS 2015 (2015, accepted)
Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems, vol. 24, pp. 801–809. MIT Press, Cambridge (2011)
Stalnaker, R.: Presuppositions. J. Philos. Logic 2, 447–457 (1973)
Stalnaker, R.: Common ground. Linguist. Philos. 25(5–6), 701–721 (2002)
Wan, S., Dras, M., Dale, R. Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)
Wilson, D.: Presupposition and Non-Truth-Conditional Semantics. Academic Press, London (1975)
Yin, W., Schutze, H.: Discriminative phrase embedding for paraphrase identification. In: Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, 31 May–5 June, pp. 1368–1373. ACL (2015)
Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)
Acknowledgments
We would like to thank Lilia Volkova for her invaluable help. The authors also acknowledge Saint-Petersburg State University for the research grant 30.38.305.2014.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Pronoza, E., Yagunova, E. (2018). A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-75477-2_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)