An Orthographic Similarity Measure for Graph-Based Text Representations

Deforche, Maxime; De Vos, Ilse; Bronselaer, Antoon; De Tré, Guy

doi:10.1007/978-3-031-42935-4_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14113))

Included in the following conference series:

International Conference on Flexible Query Answering Systems

236 Accesses
1 Citations

Abstract

Computing the orthographic similarity between words, sentences, paragraphs and texts has become a basic functionality of many text mining and flexible querying systems and the resulting similarity scores are often used to discover similar text documents. However, when dealing with a corpus that is inherently known for its orthographic inconsistencies and intricate interconnected nature on multiple levels (words, verses and full texts), as is the case with Byzantine book epigrams, this task becomes complex. In this paper, we propose a technique that tackles these two challenges by representing text in a graph and by computing a similarity score between multiple levels of the text, modelled as subgraphs, in a hierarchical manner. The similarity between all words is computed first, followed by the calculation of the similarity between all verses (resp. full texts) by using the formerly determined similarity scores between the words (resp. verses). The resulting similarities, on each level, allow for a deeper insight into the interconnected nature in (parts of) text collections, indicating how and to what degree the texts are related to each other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://dbbe.ugent.be.
2.
https://neo4j.com/.

References

Angles, R.: The property graph database model. AMW (2018). https://doi.org/10.1109/ICDEW.2012.31
Article Google Scholar
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Computing Surveys (CSUR) 40(1), 1–39 (2008). https://doi.org/10.1145/1322432.1322433
Article Google Scholar
Batra, S., Tyagi, C.: Comparative analysis of relational and graph databases. International Journal of Soft Computing and Engineering (IJSCE) 2(2), 509–512 (2012)
Google Scholar
Bronselaer, A., Pasi, G.: An approach to graph-based analysis of textual documents. In: 8th European Society for Fuzzy Logic and Technology, Proceedings, 634–641 (2013). https://doi.org/10.2991/eusflat.2013.96
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994
Article Google Scholar
Demoen, K., et al.: Database of Byzantine Book Epigrams (2023). https://doi.org/10.5281/zenodo.7682523
Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22(6–7), 753–758 (2001). https://doi.org/10.1016/S0167-8655(01)00017-4
Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Analysis and applications 13, 113–129 (2010). https://doi.org/10.1007/s10044-008-0141-y
Article MathSciNet MATH Google Scholar
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International journal of Computer Applications 68(13), 13–18 (2013)
Article Google Scholar
Jaro, M.A.: Probabilistic linkage of large public health data file. Statistics in Medicine 14(5–7), 491–8 (1995). https://doi.org/10.1002/sim.4780140510
Article Google Scholar
Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010). https://doi.org/10.1007/978-1-84882-983-1_2
Article Google Scholar
Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: Theory and Applications, 4 (1995)
Google Scholar
Kondrak, G.: N-gram similarity and distance. In: String Processing and Information Retrieval: 12th International Conference, SPIRE 2005, Proceedings 12:115–126, Buenos Aires, Argentina (2005). https://doi.org/10.1007/11575832_13
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, 10:707–710 (1966)
Google Scholar
Neuhaus, M., Bunke, H.: A probabilistic approach to learning costs for graph edit distance. In: Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004, IEEE, 3:389–393 (2004). https://doi.org/10.1109/ICPR.2004.1334548
Neuhaus, M., Bunke, H.: A convolution edit kernel for error-tolerant graph matching. In: 18th International Conference on Pattern Recognition (ICPR’06), IEEE, 4:220–223 (2006). https://doi.org/10.1109/ICPR.2006.57
Ricceri, R., et al.: The Database of Byzantine Book Epigrams Project: Principles, Challenges, Opportunities; preprint, https://hal.science/hal-03833929 (2022)
Rosenfeld, A.: Fuzzy graphs. In: Fuzzy sets and their applications to cognitive and decision processes, Elsevier, 1:77–95 (1975). https://doi.org/10.1016/B978-0-12-775260-0.50008-6
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. JACM 21(1), 168–173 (1974). https://doi.org/10.1145/321796.321811
Article MathSciNet MATH Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. (1990)
Google Scholar
Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965). https://doi.org/10.1016/0165-0114(78)90029-5
Article MathSciNet MATH Google Scholar
Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177–200 (1971). https://doi.org/10.1016/S0020-0255(71)80005-1
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Telecommunications and Information Processing, Ghent University, St.-Pietersnieuwstraat 41, 9000, Ghent, Belgium
Maxime Deforche, Antoon Bronselaer & Guy De Tré
Department of Linguistics, Ghent University, Blandijnberg 2, 9000, Ghent, Belgium
Ilse De Vos

Authors

Maxime Deforche
View author publications
You can also search for this author in PubMed Google Scholar
Ilse De Vos
View author publications
You can also search for this author in PubMed Google Scholar
Antoon Bronselaer
View author publications
You can also search for this author in PubMed Google Scholar
Guy De Tré
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxime Deforche .

Editor information

Editors and Affiliations

Legind Technologies, Esbjerg V, Denmark
Henrik Legind Larsen
University of Granada, Granada, Spain
Maria J. Martin-Bautista
University of Granada, Granada, Spain
M. Dolores Ruiz
Roskilde University, Roskilde, Denmark
Troels Andreasen
Consiglio Nazionale delle Ricerche, Milano, Italy
Gloria Bordogna
Ghent University, Gent, Belgium
Guy De Tré

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deforche, M., De Vos, I., Bronselaer, A., De Tré, G. (2023). An Orthographic Similarity Measure for Graph-Based Text Representations. In: Larsen, H.L., Martin-Bautista, M.J., Ruiz, M.D., Andreasen, T., Bordogna, G., De Tré, G. (eds) Flexible Query Answering Systems. FQAS 2023. Lecture Notes in Computer Science(), vol 14113. Springer, Cham. https://doi.org/10.1007/978-3-031-42935-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-42935-4_17
Published: 07 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42934-7
Online ISBN: 978-3-031-42935-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Orthographic Similarity Measure for Graph-Based Text Representations