Abstract
In this paper,[0] we present a method for paraphrase extraction in Ancient Greek that can be applied to huge text corpora in interactive humanities applications. Since lexical databases and POS tagging are either unavailable or do not achieve sufficient accuracy for ancient languages, our approach is based on pure word embeddings and the word mover’s distance (WMD) [20]. We show how to adapt the WMD approach to paraphrase searching such that the expensive WMD computation has to be computed for a small fraction of the text segments contained in the corpus, only. Formally, the time complexity will be reduced from O(N·K3·log to
Funding statement: The project was funded by the Volkswagen Foundation within the framework of the Open – for the Extraordinary funding line from 2016 to 2019.
About the authors
Marcus Pöckelmann studied computer science at the Martin Luther University Halle-Wittenberg (Master of Science 2013) and has been a member of the research group Molitor/Ritter since 2013. Within several interdisciplinary research projects he develops web-based applications for the investigation of intertextuality together with colleagues from different disciplines of the humanities. These include the working environments LERA for the analysis of complex text variants for scholarly editions, and Paraphrasis for the retrieval and evaluation of paraphrased text passages in the ancient Greek literature.
Janis Dähne is a student of computer science at the Martin Luther University Halle-Wittenberg. He was jointly responsible for the implementation of the approach presented here. In his master thesis, Janis Dähne deals with the alignation of text variants based on linear programs.
Jörg Ritter studied Computer Science at the University of Saarland (Diplom 1997), he received his doctorate in technical computer science with a thesis on a pipelined architecture for partitioned discrete wavelet transformation based lossy image compression using FPGA’s. Since 1997, he is research associate in the group headed by Prof. Dr. Molitor at the Institute of Computer Science of Martin Luther University Halle-Wittenberg. For more than 10 years, Jörg Ritter has been working with colleagues from philology in interdisciplinary projects dealing with humanities issues, e. g., Digital Plato. Tradition and Reception (2016-2019), A New Supplement Dictionary of Sanskrit (2013-2016), Epistolary Networks. Visualizing multi-dimensional information structures in correspondence corpora (2013-2016), Semi-automatic Difference Analysis of Complex Text Variants (2012-2015). Jörg Ritter was and is significantly involved in the establishment of a research focus “eHumanities” at the Institute of Computer Science of the Martin Luther University Halle-Wittenberg.
Paul Molitor studied Computer Science and Mathematics at the University of Saarland (Diplom 1982, Promotion 1986, Habilitation 1992). He was member of the scientific staff of Prof. Dr. Gunter Hotz (1982-1994) where he leads a project in the National Research Center 124 VLSI and Parallelism (1992-1994). In 1993, he was with the Humboldt University of Berlin as Associate Professor for Circuit Technology. Since 1994 he is a Full Professor for Technical Computer Science at Martin Luther University Halle-Wittenberg. Paul Molitor’s interests lie in addition to technical computer science in combinatorial optimization and computational humanities/eHumanities. Together with Jörg Ritter and colleagues from the humanities, he has been leading several interdisciplinary third-party funded projects in the field of Digital Humanities, in particular with colleagues from the fields of German Studies, Romance studies, Jewish studies, Sanskrit studies, and ancient history / ancient Greek studies.
Acknowledgment
We would like to thank Prof. Dr. Charlotte Schubert, Professor for Ancient History at the University Leipzig, her research fellows Dr. Roxana Kath and Dr. Michaela Rücker, and Dr. Eva Wöckener-Gade who is with the Institute of Classical Philology and Comparative Literature of the University Leipzig for the great collaboration.
We thank the anonymous reviewers for their careful reading of our manuscript and their helpful comments and suggestions.
We especially thank the Volkswagen Foundation for supporting the Digital Plato project.
References
1. B. Agarwal, H. Ramampuaro, H. Langseth, and M. Ruocco. A Deep Network Model for Paraphrase Detection in Short Text Messages. In: Information Processing & Management, vol. 54, issue 6, pages 922–937, 2018.10.1016/j.ipm.2018.06.005Search in Google Scholar
2. I. Androutsopoulos and P. Malakasiotis. A Survey of Paraphrasing and Textual Entailment Methods. In: Journal of Artificial Intelligence Research, Vol. 38, pages 135–187, 2010.10.1613/jair.2985Search in Google Scholar
3. K. Atasu, T. Parnell, C. Dünner, M. Sifalakis, H. Pozidis, V. Vasileiadis, M. Vlachos, C. Berrospi and A. Labbi. Linear-Complexity Relaxed Word Mover’s Distance with GPU Acceleration. In: IEEE International Conference on Big Data, Big Data 2017, Boston, MA, USA, pp. 889–896, December 11–14 2017.10.1109/BigData.2017.8258005Search in Google Scholar
4. Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In: Journal of Machine Learning Research, Vol. 3, pages 1137–1155, 2003.10.1007/3-540-33486-6_6Search in Google Scholar
5. Y. Bizzoni, R. Del Gratta, F. Boschetti, and M. Reboul. Enhancing the Accuracy of Ancient Greek WordNet by Multilingual. Distributional Semantics. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1140–1147, 2014.Search in Google Scholar
6. Y. Bizzoni, R. Del Gratta, H. Diakoff, M. Monachini, and G. Crane. The Making of Ancient Greek WordNet. In: Proceedings of the Second Italian Conference on Computational Linguistics (CLIC-IT 2015), pages 47–50, 2015.10.4000/books.aaccademia.1312Search in Google Scholar
7. C. Blackwell and N. Smith. The Canonical Text Services protocol. Online: https://github.com/cite-architecture/cts_spec/blob/master/md/specification.md.Search in Google Scholar
8. C. Blackwell and N. Smith. The CITE Architecture (CTS/CITE) for Analysis and Alignment. In: Special issue Digital Methods for Intertextuality Studies, P. Molitor and J. Ritter (Eds.), it–Information Technology, Volume 62, 2020. de Gruyter, Berlin/Boston.10.1515/itit-2019-0044Search in Google Scholar
9. D. M. Blei, A.Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. In: Journal of Machine Learning Research, Vol. 3, pages 993–1022, 2003.Search in Google Scholar
10. T. Bremer, P. Molitor, M. Pöckelmann, J. Ritter and S. Schütz. Zum Einsatz digitaler Methoden bei der Erstellung und Nutzung genetischer Editionen gedruckter Texte mit verschiedenen Fassungen – Das Fallbeispiel der Histoire philosphique des deux Indes von Guillaume Thomas Raynal (in German). in: Editio, R. v. Nutt-Kofoth, B. Plachta and W. Woesler (Eds.), Volume 29, Issue 1, pages 29–51, Walter de Gruyter Berlin/Boston, 2015. Online: 10.1515/editio-2015-004.Search in Google Scholar
11. G. Celano, G. Crance, and S. Majidi. Part of Speech Tagging for Ancient GreekDe Gruyter Open: In: Open Linguistics, Vol. 2, pages 393–399, 2016.10.1515/opli-2016-0020Search in Google Scholar
12. N. Coffee, J.-P. Koenig, S. Poornima, R. Ossewaarde, C. Forstall, and S. Jacobson. Intertextuality in the Digital Age. Transactions of the American Philological Association, Vol. 142, pages 383–422, 2012.10.1353/apa.2012.0010Search in Google Scholar
13. N. Coffee, W.J. Scheirer, and J.P. Koenig. Tesserae. Collaborative project of the University at Buffalo Department of Classics and Department of Linguistics, the Department of Computer Science and Engineering of the University of Notre Dame, and the Département des Sciences de l’Antiquité of the University of Geneva. Online: http://tesserae.caset.buffalo.edu/about.php.Search in Google Scholar
14. R.H. Dekker and G. Middell. CollateX – Software for Collating Textual Sources. Online: https://collatex.net.Search in Google Scholar
15. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fumas and R. A. Harshman. Indexing by latent semantic analysis. In: Journal of the American Society of Information Science, 41(6):391–407, 1990.10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9Search in Google Scholar
16. E. Hamilton and H. Cairns (Eds.). The collected dialogues of Plato. Bollingen Series LXXI, Princeton University Press, 1961.10.2307/j.ctt1c84fb0Search in Google Scholar
17. Z. Harris. Distributional Structure. In: J. Katz and J. Fodor (Eds.), The Philosophy of Linguistics, pages 33–49, Oxford University Press, 1964.Search in Google Scholar
18. C. Ho and M.A. Murad and S. Doraisamy and R. Abdul Kadir. Extracting lexical and phrasal paraphrases: a review of the literature In: Artificial Intelligence Review 42, 851–894, 2014.10.1007/s10462-012-9357-8Search in Google Scholar
19. B. Korte and J. Vygen. Combinatorial optimization: Theory and Algorithms. Springer 2000.10.1007/978-3-662-21708-5Search in Google Scholar
20. M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In: Proceedings of the 32th International Conference on Machine Learning 2015, pp. 957–966, 06.–11.07.2015, Lille, France.Search in Google Scholar
21. T. Mikolov, W.T. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. In: Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 746–751, Atlanta, Georgia, 2013. 9–14 June.Search in Google Scholar
22. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, 02.–04.05.2013, Scottsdale, USA.Search in Google Scholar
23. M. Mohamed and M. Oussalah. A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. In: Language Resources & Evaluation, 2019.10.1007/s10579-019-09466-4Search in Google Scholar
24. J.B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In: Operations Research 41(2):338–350, March – April 1993. Operations Research Society of America.10.21236/ADA457044Search in Google Scholar
25. M. Pantelia (Project Director). Thesaurus Linguae Graecae. A Digital Library of Greek Literature. Online: http://stephanus.tlg.uci.edu.Search in Google Scholar
26. O. Pele and M. Werman. Fast and robust Earth Mover’s Distances. In: Proceedings of the 12th IEEE International Conference on Computer Vision, 29 September–2 October 2009, Kyoto, Japan, 2009.10.1109/ICCV.2009.5459199Search in Google Scholar
27. M. Pöckelmann, J. Ritter, E. Wöckener-Gade, and Ch. Schubert. Paraphrasensuche mittels word2vec und der Word Mover’s Distance im Altgriechischen (In German). Digital Classics Online, DCO 3(3):24–36, 2017. 10.11588/dco.2017.0.40185.Search in Google Scholar
28. M. Pöckelmann, J. Ritter, and P. Molitor. Word Mover’s Distance angewendet auf die Paraphrasenextraktion im Altgriechischen (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 45–60. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar
29. M. Regneri, R. Wang, and M. Pinkal. Aligning Predicate-Argument Structures for Paraphrase Fragment Extraction In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 4300–4307, 2014.Search in Google Scholar
30. G. Reale. A History of Ancient Philosophy II: Plato and Aristotle. State University of New York Press, New York, November 1990.Search in Google Scholar
31. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu and M. Gatford. Okapi at TREC-3. In: Proceedings of the 3rd Text Retrieval Conference (TREC), p. 109, 02.–04.11.1994, Maryland, USA.Search in Google Scholar
32. Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image databases. In: Proceedings of the 6th International Conference on Computer Vision, Bombay, 2–7 January 1998, pp. 59–66, 1998.Search in Google Scholar
33. G. Salton and C. Burckley. Term-weighting approaches in automatic text retrieval. In: Information Processing & Management, 24(5):513–523, 1988.10.1016/0306-4573(88)90021-0Search in Google Scholar
34. J. Scharloth, F. Keilholz, S. Meier-Vieracker, X. Yu, and R. Doniok. Datengeleitete Kategorienbildung in den Digital Humanities: Paraphrasen aus korpus- und computerlinguistischer Perspektive (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 61–88. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar
35. C. Schubert, P. Molitor, J. Ritter, J. Scharloth and K. Sier (Eds.). Platon Digital: Tradition und Rezeption (in German). Heidelberg: Propylaeum, (Digital Classics Books, Band 3), 2019. 10.11588/propylaeum.451.Search in Google Scholar
36. K. Sier, J. Scharloth, C. Schubert, J. Ritter, and P. Molitor. Digital Plato: Tradition and Reception. Project Application, Volkswagen Foundation, March 2015.Search in Google Scholar
37. M. Spariosu. God of many names: Play, poetry, and power in Hellenic thought from Homer to Aristotle. Duke University Press, March 1991.Search in Google Scholar
38. K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of Human Language Technology Conference anf The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 252–259, 2003.10.3115/1073445.1073478Search in Google Scholar
39. M. Werman, S. Peleg, and A. Rosenfeld. A distance metric for multidimensional histograms. In: Computer Vision, Graphics and Image Processing, 32:328–336, December 1985.10.1016/0734-189X(85)90055-6Search in Google Scholar
40. E. Wöckener-Gade, S. Jödicke, H. Ohst, E. Pulz, K. Protze, J. Rautenberg, F. Schellhardt, F. Schulze, and A.L. Visinoni. Variantensensible und formgenaue Stoppwortliste für das Altgriechische (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 325–341. Heidelberg: Propylaeum, (Digital Classics Books, Band 3), 2019. 10.11588/propylaeum.451.Search in Google Scholar
41. E. Wöckener-Gade, S. Jödicke, H. Ohst, E. Pulz, K. Protze, J. Rautenberg, F. Schellhardt, F. Schulze, and A.L. Visinoni. Ein Parallelkorpus von Paraphrasen auf Platon: Der ‘Goldstandard’ des projekts Platon Digital (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 275–323. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar
42. E. Wöckener-Gade and M. Pöckelmann. Bridging the Gap between Plato and His Successors: Towards an Annotated Gold Standard of Intertextual References to Plato in Ancient Greek literature. EADH 2018: Data in Digital Humanities, Galway 07.–09.12.2018. Online: https://eadh2018.exordo.com/programme/presentation/27.Search in Google Scholar
43. H. Zellig. Distributional Structure. In: Word, 10(2/3):146–162, 1954.10.1080/00437956.1954.11659520Search in Google Scholar
© 2020 Walter de Gruyter GmbH, Berlin/Boston