Abstract
Computing the orthographic similarity between words, sentences, paragraphs and texts has become a basic functionality of many text mining and flexible querying systems and the resulting similarity scores are often used to discover similar text documents. However, when dealing with a corpus that is inherently known for its orthographic inconsistencies and intricate interconnected nature on multiple levels (words, verses and full texts), as is the case with Byzantine book epigrams, this task becomes complex. In this paper, we propose a technique that tackles these two challenges by representing text in a graph and by computing a similarity score between multiple levels of the text, modelled as subgraphs, in a hierarchical manner. The similarity between all words is computed first, followed by the calculation of the similarity between all verses (resp. full texts) by using the formerly determined similarity scores between the words (resp. verses). The resulting similarities, on each level, allow for a deeper insight into the interconnected nature in (parts of) text collections, indicating how and to what degree the texts are related to each other.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Angles, R.: The property graph database model. AMW (2018). https://doi.org/10.1109/ICDEW.2012.31
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Computing Surveys (CSUR) 40(1), 1–39 (2008). https://doi.org/10.1145/1322432.1322433
Batra, S., Tyagi, C.: Comparative analysis of relational and graph databases. International Journal of Soft Computing and Engineering (IJSCE) 2(2), 509–512 (2012)
Bronselaer, A., Pasi, G.: An approach to graph-based analysis of textual documents. In: 8th European Society for Fuzzy Logic and Technology, Proceedings, 634–641 (2013). https://doi.org/10.2991/eusflat.2013.96
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994
Demoen, K., et al.: Database of Byzantine Book Epigrams (2023). https://doi.org/10.5281/zenodo.7682523
Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22(6–7), 753–758 (2001). https://doi.org/10.1016/S0167-8655(01)00017-4
Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Analysis and applications 13, 113–129 (2010). https://doi.org/10.1007/s10044-008-0141-y
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International journal of Computer Applications 68(13), 13–18 (2013)
Jaro, M.A.: Probabilistic linkage of large public health data file. Statistics in Medicine 14(5–7), 491–8 (1995). https://doi.org/10.1002/sim.4780140510
Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010). https://doi.org/10.1007/978-1-84882-983-1_2
Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: Theory and Applications, 4 (1995)
Kondrak, G.: N-gram similarity and distance. In: String Processing and Information Retrieval: 12th International Conference, SPIRE 2005, Proceedings 12:115–126, Buenos Aires, Argentina (2005). https://doi.org/10.1007/11575832_13
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, 10:707–710 (1966)
Neuhaus, M., Bunke, H.: A probabilistic approach to learning costs for graph edit distance. In: Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004, IEEE, 3:389–393 (2004). https://doi.org/10.1109/ICPR.2004.1334548
Neuhaus, M., Bunke, H.: A convolution edit kernel for error-tolerant graph matching. In: 18th International Conference on Pattern Recognition (ICPR’06), IEEE, 4:220–223 (2006). https://doi.org/10.1109/ICPR.2006.57
Ricceri, R., et al.: The Database of Byzantine Book Epigrams Project: Principles, Challenges, Opportunities; preprint, https://hal.science/hal-03833929 (2022)
Rosenfeld, A.: Fuzzy graphs. In: Fuzzy sets and their applications to cognitive and decision processes, Elsevier, 1:77–95 (1975). https://doi.org/10.1016/B978-0-12-775260-0.50008-6
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. JACM 21(1), 168–173 (1974). https://doi.org/10.1145/321796.321811
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. (1990)
Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965). https://doi.org/10.1016/0165-0114(78)90029-5
Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177–200 (1971). https://doi.org/10.1016/S0020-0255(71)80005-1
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Deforche, M., De Vos, I., Bronselaer, A., De Tré, G. (2023). An Orthographic Similarity Measure for Graph-Based Text Representations. In: Larsen, H.L., Martin-Bautista, M.J., Ruiz, M.D., Andreasen, T., Bordogna, G., De Tré, G. (eds) Flexible Query Answering Systems. FQAS 2023. Lecture Notes in Computer Science(), vol 14113. Springer, Cham. https://doi.org/10.1007/978-3-031-42935-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-42935-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42934-7
Online ISBN: 978-3-031-42935-4
eBook Packages: Computer ScienceComputer Science (R0)