Fast paraphrase extraction in Ancient Greek literature

Marcus Pöckelmann; Janis Dähne; Jörg Ritter; Paul Molitor

doi:10.1515/itit-2019-0042

Published by De Gruyter Oldenbourg March 6, 2020

Fast paraphrase extraction in Ancient Greek literature

Marcus Pöckelmann
Marcus Pöckelmann studied computer science at the Martin Luther University Halle-Wittenberg (Master of Science 2013) and has been a member of the research group Molitor/Ritter since 2013. Within several interdisciplinary research projects he develops web-based applications for the investigation of intertextuality together with colleagues from different disciplines of the humanities. These include the working environments LERA for the analysis of complex text variants for scholarly editions, and Paraphrasis for the retrieval and evaluation of paraphrased text passages in the ancient Greek literature.
, Janis Dähne
Janis Dähne is a student of computer science at the Martin Luther University Halle-Wittenberg. He was jointly responsible for the implementation of the approach presented here. In his master thesis, Janis Dähne deals with the alignation of text variants based on linear programs.
, Jörg Ritter
Jörg Ritter studied Computer Science at the University of Saarland (Diplom 1997), he received his doctorate in technical computer science with a thesis on a pipelined architecture for partitioned discrete wavelet transformation based lossy image compression using FPGA’s. Since 1997, he is research associate in the group headed by Prof. Dr. Molitor at the Institute of Computer Science of Martin Luther University Halle-Wittenberg. For more than 10 years, Jörg Ritter has been working with colleagues from philology in interdisciplinary projects dealing with humanities issues, e. g., Digital Plato. Tradition and Reception (2016-2019), A New Supplement Dictionary of Sanskrit (2013-2016), Epistolary Networks. Visualizing multi-dimensional information structures in correspondence corpora (2013-2016), Semi-automatic Difference Analysis of Complex Text Variants (2012-2015). Jörg Ritter was and is significantly involved in the establishment of a research focus “eHumanities” at the Institute of Computer Science of the Martin Luther University Halle-Wittenberg.
and Paul Molitor
Paul Molitor studied Computer Science and Mathematics at the University of Saarland (Diplom 1982, Promotion 1986, Habilitation 1992). He was member of the scientific staff of Prof. Dr. Gunter Hotz (1982-1994) where he leads a project in the National Research Center 124 VLSI and Parallelism (1992-1994). In 1993, he was with the Humboldt University of Berlin as Associate Professor for Circuit Technology. Since 1994 he is a Full Professor for Technical Computer Science at Martin Luther University Halle-Wittenberg. Paul Molitor’s interests lie in addition to technical computer science in combinatorial optimization and computational humanities/eHumanities. Together with Jörg Ritter and colleagues from the humanities, he has been leading several interdisciplinary third-party funded projects in the field of Digital Humanities, in particular with colleagues from the fields of German Studies, Romance studies, Jewish studies, Sanskrit studies, and ancient history / ancient Greek studies.

From the journal it - Information Technology

https://doi.org/10.1515/itit-2019-0042

Showing a limited preview of this publication:

Abstract

In this paper,^[0] we present a method for paraphrase extraction in Ancient Greek that can be applied to huge text corpora in interactive humanities applications. Since lexical databases and POS tagging are either unavailable or do not achieve sufficient accuracy for ancient languages, our approach is based on pure word embeddings and the word mover’s distance (WMD) [20]. We show how to adapt the WMD approach to paraphrase searching such that the expensive WMD computation has to be computed for a small fraction of the text segments contained in the corpus, only. Formally, the time complexity will be reduced from $O(N·K3·logK)$ to $O(N+K3·logK)$ , compared to the brute-force approach which computes the WMD between each text segment of the corpus and the search query. N is the length of the corpus and K the size of its vocabulary. The method, which searches not only for paraphrases of the same length as the search query but also for paraphrases of varying lengths, was evaluated on the Thesaurus Linguae Graecae^® (TLG^®) [25]. The TLG consists of about $75·106$ Greek words. We searched the whole TLG for paraphrases for given passages of Plato. The experimental results show that our method and the brute-force approach, with only very few exceptions, propose the same text passages in the TLG as possible paraphrases. The computation times of our method are in a range that allows its application in interactive systems and let the humanities scholars work productively and smoothly.

Keywords: Intertextuality; paraphrases; paraphrase searching; paraphrase extraction; Word Mover’s Distance; word2vec; ancient languages; Ancient Greek; Plato; Plato digital

ACM CCS:

Funding statement: The project was funded by the Volkswagen Foundation within the framework of the Open – for the Extraordinary funding line from 2016 to 2019.

About the authors

Marcus Pöckelmann

Marcus Pöckelmann studied computer science at the Martin Luther University Halle-Wittenberg (Master of Science 2013) and has been a member of the research group Molitor/Ritter since 2013. Within several interdisciplinary research projects he develops web-based applications for the investigation of intertextuality together with colleagues from different disciplines of the humanities. These include the working environments LERA for the analysis of complex text variants for scholarly editions, and Paraphrasis for the retrieval and evaluation of paraphrased text passages in the ancient Greek literature.

Janis Dähne

Janis Dähne is a student of computer science at the Martin Luther University Halle-Wittenberg. He was jointly responsible for the implementation of the approach presented here. In his master thesis, Janis Dähne deals with the alignation of text variants based on linear programs.

Dr. Jörg Ritter

Jörg Ritter studied Computer Science at the University of Saarland (Diplom 1997), he received his doctorate in technical computer science with a thesis on a pipelined architecture for partitioned discrete wavelet transformation based lossy image compression using FPGA’s. Since 1997, he is research associate in the group headed by Prof. Dr. Molitor at the Institute of Computer Science of Martin Luther University Halle-Wittenberg. For more than 10 years, Jörg Ritter has been working with colleagues from philology in interdisciplinary projects dealing with humanities issues, e. g., Digital Plato. Tradition and Reception (2016-2019), A New Supplement Dictionary of Sanskrit (2013-2016), Epistolary Networks. Visualizing multi-dimensional information structures in correspondence corpora (2013-2016), Semi-automatic Difference Analysis of Complex Text Variants (2012-2015). Jörg Ritter was and is significantly involved in the establishment of a research focus “eHumanities” at the Institute of Computer Science of the Martin Luther University Halle-Wittenberg.

Prof. Dr. Paul Molitor

Paul Molitor studied Computer Science and Mathematics at the University of Saarland (Diplom 1982, Promotion 1986, Habilitation 1992). He was member of the scientific staff of Prof. Dr. Gunter Hotz (1982-1994) where he leads a project in the National Research Center 124 VLSI and Parallelism (1992-1994). In 1993, he was with the Humboldt University of Berlin as Associate Professor for Circuit Technology. Since 1994 he is a Full Professor for Technical Computer Science at Martin Luther University Halle-Wittenberg. Paul Molitor’s interests lie in addition to technical computer science in combinatorial optimization and computational humanities/eHumanities. Together with Jörg Ritter and colleagues from the humanities, he has been leading several interdisciplinary third-party funded projects in the field of Digital Humanities, in particular with colleagues from the fields of German Studies, Romance studies, Jewish studies, Sanskrit studies, and ancient history / ancient Greek studies.

Acknowledgment

We would like to thank Prof. Dr. Charlotte Schubert, Professor for Ancient History at the University Leipzig, her research fellows Dr. Roxana Kath and Dr. Michaela Rücker, and Dr. Eva Wöckener-Gade who is with the Institute of Classical Philology and Comparative Literature of the University Leipzig for the great collaboration.

We thank the anonymous reviewers for their careful reading of our manuscript and their helpful comments and suggestions.

We especially thank the Volkswagen Foundation for supporting the Digital Plato project.

References

1. B. Agarwal, H. Ramampuaro, H. Langseth, and M. Ruocco. A Deep Network Model for Paraphrase Detection in Short Text Messages. In: Information Processing & Management, vol. 54, issue 6, pages 922–937, 2018.10.1016/j.ipm.2018.06.005Search in Google Scholar

2. I. Androutsopoulos and P. Malakasiotis. A Survey of Paraphrasing and Textual Entailment Methods. In: Journal of Artificial Intelligence Research, Vol. 38, pages 135–187, 2010.10.1613/jair.2985Search in Google Scholar

3. K. Atasu, T. Parnell, C. Dünner, M. Sifalakis, H. Pozidis, V. Vasileiadis, M. Vlachos, C. Berrospi and A. Labbi. Linear-Complexity Relaxed Word Mover’s Distance with GPU Acceleration. In: IEEE International Conference on Big Data, Big Data 2017, Boston, MA, USA, pp. 889–896, December 11–14 2017.10.1109/BigData.2017.8258005Search in Google Scholar

4. Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In: Journal of Machine Learning Research, Vol. 3, pages 1137–1155, 2003.10.1007/3-540-33486-6_6Search in Google Scholar

5. Y. Bizzoni, R. Del Gratta, F. Boschetti, and M. Reboul. Enhancing the Accuracy of Ancient Greek WordNet by Multilingual. Distributional Semantics. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1140–1147, 2014.Search in Google Scholar

6. Y. Bizzoni, R. Del Gratta, H. Diakoff, M. Monachini, and G. Crane. The Making of Ancient Greek WordNet. In: Proceedings of the Second Italian Conference on Computational Linguistics (CLIC-IT 2015), pages 47–50, 2015.10.4000/books.aaccademia.1312Search in Google Scholar

7. C. Blackwell and N. Smith. The Canonical Text Services protocol. Online: https://github.com/cite-architecture/cts_spec/blob/master/md/specification.md.Search in Google Scholar

8. C. Blackwell and N. Smith. The CITE Architecture (CTS/CITE) for Analysis and Alignment. In: Special issue Digital Methods for Intertextuality Studies, P. Molitor and J. Ritter (Eds.), it–Information Technology, Volume 62, 2020. de Gruyter, Berlin/Boston.10.1515/itit-2019-0044Search in Google Scholar

9. D. M. Blei, A.Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. In: Journal of Machine Learning Research, Vol. 3, pages 993–1022, 2003.Search in Google Scholar

10. T. Bremer, P. Molitor, M. Pöckelmann, J. Ritter and S. Schütz. Zum Einsatz digitaler Methoden bei der Erstellung und Nutzung genetischer Editionen gedruckter Texte mit verschiedenen Fassungen – Das Fallbeispiel der Histoire philosphique des deux Indes von Guillaume Thomas Raynal (in German). in: Editio, R. v. Nutt-Kofoth, B. Plachta and W. Woesler (Eds.), Volume 29, Issue 1, pages 29–51, Walter de Gruyter Berlin/Boston, 2015. Online: 10.1515/editio-2015-004.Search in Google Scholar

11. G. Celano, G. Crance, and S. Majidi. Part of Speech Tagging for Ancient GreekDe Gruyter Open: In: Open Linguistics, Vol. 2, pages 393–399, 2016.10.1515/opli-2016-0020Search in Google Scholar

12. N. Coffee, J.-P. Koenig, S. Poornima, R. Ossewaarde, C. Forstall, and S. Jacobson. Intertextuality in the Digital Age. Transactions of the American Philological Association, Vol. 142, pages 383–422, 2012.10.1353/apa.2012.0010Search in Google Scholar

13. N. Coffee, W.J. Scheirer, and J.P. Koenig. Tesserae. Collaborative project of the University at Buffalo Department of Classics and Department of Linguistics, the Department of Computer Science and Engineering of the University of Notre Dame, and the Département des Sciences de l’Antiquité of the University of Geneva. Online: http://tesserae.caset.buffalo.edu/about.php.Search in Google Scholar

14. R.H. Dekker and G. Middell. CollateX – Software for Collating Textual Sources. Online: https://collatex.net.Search in Google Scholar

15. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fumas and R. A. Harshman. Indexing by latent semantic analysis. In: Journal of the American Society of Information Science, 41(6):391–407, 1990.10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9Search in Google Scholar

16. E. Hamilton and H. Cairns (Eds.). The collected dialogues of Plato. Bollingen Series LXXI, Princeton University Press, 1961.10.2307/j.ctt1c84fb0Search in Google Scholar

17. Z. Harris. Distributional Structure. In: J. Katz and J. Fodor (Eds.), The Philosophy of Linguistics, pages 33–49, Oxford University Press, 1964.Search in Google Scholar

18. C. Ho and M.A. Murad and S. Doraisamy and R. Abdul Kadir. Extracting lexical and phrasal paraphrases: a review of the literature In: Artificial Intelligence Review 42, 851–894, 2014.10.1007/s10462-012-9357-8Search in Google Scholar

19. B. Korte and J. Vygen. Combinatorial optimization: Theory and Algorithms. Springer 2000.10.1007/978-3-662-21708-5Search in Google Scholar

20. M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In: Proceedings of the 32th International Conference on Machine Learning 2015, pp. 957–966, 06.–11.07.2015, Lille, France.Search in Google Scholar

21. T. Mikolov, W.T. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. In: Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 746–751, Atlanta, Georgia, 2013. 9–14 June.Search in Google Scholar

22. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, 02.–04.05.2013, Scottsdale, USA.Search in Google Scholar

23. M. Mohamed and M. Oussalah. A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. In: Language Resources & Evaluation, 2019.10.1007/s10579-019-09466-4Search in Google Scholar

24. J.B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In: Operations Research 41(2):338–350, March – April 1993. Operations Research Society of America.10.21236/ADA457044Search in Google Scholar

25. M. Pantelia (Project Director). Thesaurus Linguae Graecae. A Digital Library of Greek Literature. Online: http://stephanus.tlg.uci.edu.Search in Google Scholar

26. O. Pele and M. Werman. Fast and robust Earth Mover’s Distances. In: Proceedings of the 12th IEEE International Conference on Computer Vision, 29 September–2 October 2009, Kyoto, Japan, 2009.10.1109/ICCV.2009.5459199Search in Google Scholar

27. M. Pöckelmann, J. Ritter, E. Wöckener-Gade, and Ch. Schubert. Paraphrasensuche mittels word2vec und der Word Mover’s Distance im Altgriechischen (In German). Digital Classics Online, DCO 3(3):24–36, 2017. 10.11588/dco.2017.0.40185.Search in Google Scholar

28. M. Pöckelmann, J. Ritter, and P. Molitor. Word Mover’s Distance angewendet auf die Paraphrasenextraktion im Altgriechischen (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 45–60. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar

29. M. Regneri, R. Wang, and M. Pinkal. Aligning Predicate-Argument Structures for Paraphrase Fragment Extraction In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 4300–4307, 2014.Search in Google Scholar

30. G. Reale. A History of Ancient Philosophy II: Plato and Aristotle. State University of New York Press, New York, November 1990.Search in Google Scholar

31. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu and M. Gatford. Okapi at TREC-3. In: Proceedings of the 3rd Text Retrieval Conference (TREC), p. 109, 02.–04.11.1994, Maryland, USA.Search in Google Scholar

32. Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image databases. In: Proceedings of the 6th International Conference on Computer Vision, Bombay, 2–7 January 1998, pp. 59–66, 1998.Search in Google Scholar

33. G. Salton and C. Burckley. Term-weighting approaches in automatic text retrieval. In: Information Processing & Management, 24(5):513–523, 1988.10.1016/0306-4573(88)90021-0Search in Google Scholar

34. J. Scharloth, F. Keilholz, S. Meier-Vieracker, X. Yu, and R. Doniok. Datengeleitete Kategorienbildung in den Digital Humanities: Paraphrasen aus korpus- und computerlinguistischer Perspektive (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 61–88. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar

35. C. Schubert, P. Molitor, J. Ritter, J. Scharloth and K. Sier (Eds.). Platon Digital: Tradition und Rezeption (in German). Heidelberg: Propylaeum, (Digital Classics Books, Band 3), 2019. 10.11588/propylaeum.451.Search in Google Scholar

36. K. Sier, J. Scharloth, C. Schubert, J. Ritter, and P. Molitor. Digital Plato: Tradition and Reception. Project Application, Volkswagen Foundation, March 2015.Search in Google Scholar

37. M. Spariosu. God of many names: Play, poetry, and power in Hellenic thought from Homer to Aristotle. Duke University Press, March 1991.Search in Google Scholar

38. K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of Human Language Technology Conference anf The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 252–259, 2003.10.3115/1073445.1073478Search in Google Scholar

39. M. Werman, S. Peleg, and A. Rosenfeld. A distance metric for multidimensional histograms. In: Computer Vision, Graphics and Image Processing, 32:328–336, December 1985.10.1016/0734-189X(85)90055-6Search in Google Scholar

40. E. Wöckener-Gade, S. Jödicke, H. Ohst, E. Pulz, K. Protze, J. Rautenberg, F. Schellhardt, F. Schulze, and A.L. Visinoni. Variantensensible und formgenaue Stoppwortliste für das Altgriechische (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 325–341. Heidelberg: Propylaeum, (Digital Classics Books, Band 3), 2019. 10.11588/propylaeum.451.Search in Google Scholar

41. E. Wöckener-Gade, S. Jödicke, H. Ohst, E. Pulz, K. Protze, J. Rautenberg, F. Schellhardt, F. Schulze, and A.L. Visinoni. Ein Parallelkorpus von Paraphrasen auf Platon: Der ‘Goldstandard’ des projekts Platon Digital (in German). In: C. Schubert, P. Molitor, J. Ritter, J. Scharloth, and K. Sier (Eds.), Platon Digital: Tradition und Rezeption, pp. 275–323. Heidelberg: Propylaeum, (Digital Classics Books, Band 3) 2019. 10.11588/propylaeum.451.Search in Google Scholar

42. E. Wöckener-Gade and M. Pöckelmann. Bridging the Gap between Plato and His Successors: Towards an Annotated Gold Standard of Intertextual References to Plato in Ancient Greek literature. EADH 2018: Data in Digital Humanities, Galway 07.–09.12.2018. Online: https://eadh2018.exordo.com/programme/presentation/27.Search in Google Scholar

43. H. Zellig. Distributional Structure. In: Word, 10(2/3):146–162, 1954.10.1080/00437956.1954.11659520Search in Google Scholar

Received: 2019-10-31

Revised: 2020-02-21

Accepted: 2020-02-23

Published Online: 2020-03-06

Published in Print: 2020-04-26

Fast paraphrase extraction in Ancient Greek literature

Abstract

About the authors

Acknowledgment

References

Journal and Issue

Articles in the same Issue