Skip to main content

An Orthographic Similarity Measure for Graph-Based Text Representations

  • Conference paper
  • First Online:
Flexible Query Answering Systems (FQAS 2023)

Abstract

Computing the orthographic similarity between words, sentences, paragraphs and texts has become a basic functionality of many text mining and flexible querying systems and the resulting similarity scores are often used to discover similar text documents. However, when dealing with a corpus that is inherently known for its orthographic inconsistencies and intricate interconnected nature on multiple levels (words, verses and full texts), as is the case with Byzantine book epigrams, this task becomes complex. In this paper, we propose a technique that tackles these two challenges by representing text in a graph and by computing a similarity score between multiple levels of the text, modelled as subgraphs, in a hierarchical manner. The similarity between all words is computed first, followed by the calculation of the similarity between all verses (resp. full texts) by using the formerly determined similarity scores between the words (resp. verses). The resulting similarities, on each level, allow for a deeper insight into the interconnected nature in (parts of) text collections, indicating how and to what degree the texts are related to each other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://dbbe.ugent.be.

  2. 2.

    https://neo4j.com/.

References

  1. Angles, R.: The property graph database model. AMW (2018). https://doi.org/10.1109/ICDEW.2012.31

    Article  Google Scholar 

  2. Angles, R., Gutierrez, C.: Survey of graph database models. ACM Computing Surveys (CSUR) 40(1), 1–39 (2008). https://doi.org/10.1145/1322432.1322433

    Article  Google Scholar 

  3. Batra, S., Tyagi, C.: Comparative analysis of relational and graph databases. International Journal of Soft Computing and Engineering (IJSCE) 2(2), 509–512 (2012)

    Google Scholar 

  4. Bronselaer, A., Pasi, G.: An approach to graph-based analysis of textual documents. In: 8th European Society for Fuzzy Logic and Technology, Proceedings, 634–641 (2013). https://doi.org/10.2991/eusflat.2013.96

  5. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994

    Article  Google Scholar 

  6. Demoen, K., et al.: Database of Byzantine Book Epigrams (2023). https://doi.org/10.5281/zenodo.7682523

  7. Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22(6–7), 753–758 (2001). https://doi.org/10.1016/S0167-8655(01)00017-4

  8. Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Analysis and applications 13, 113–129 (2010). https://doi.org/10.1007/s10044-008-0141-y

    Article  MathSciNet  MATH  Google Scholar 

  9. Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International journal of Computer Applications 68(13), 13–18 (2013)

    Article  Google Scholar 

  10. Jaro, M.A.: Probabilistic linkage of large public health data file. Statistics in Medicine 14(5–7), 491–8 (1995). https://doi.org/10.1002/sim.4780140510

    Article  Google Scholar 

  11. Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010). https://doi.org/10.1007/978-1-84882-983-1_2

    Article  Google Scholar 

  12. Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: Theory and Applications, 4 (1995)

    Google Scholar 

  13. Kondrak, G.: N-gram similarity and distance. In: String Processing and Information Retrieval: 12th International Conference, SPIRE 2005, Proceedings 12:115–126, Buenos Aires, Argentina (2005). https://doi.org/10.1007/11575832_13

  14. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, 10:707–710 (1966)

    Google Scholar 

  15. Neuhaus, M., Bunke, H.: A probabilistic approach to learning costs for graph edit distance. In: Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004, IEEE, 3:389–393 (2004). https://doi.org/10.1109/ICPR.2004.1334548

  16. Neuhaus, M., Bunke, H.: A convolution edit kernel for error-tolerant graph matching. In: 18th International Conference on Pattern Recognition (ICPR’06), IEEE, 4:220–223 (2006). https://doi.org/10.1109/ICPR.2006.57

  17. Ricceri, R., et al.: The Database of Byzantine Book Epigrams Project: Principles, Challenges, Opportunities; preprint, https://hal.science/hal-03833929 (2022)

  18. Rosenfeld, A.: Fuzzy graphs. In: Fuzzy sets and their applications to cognitive and decision processes, Elsevier, 1:77–95 (1975). https://doi.org/10.1016/B978-0-12-775260-0.50008-6

  19. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. JACM 21(1), 168–173 (1974). https://doi.org/10.1145/321796.321811

    Article  MathSciNet  MATH  Google Scholar 

  20. Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. (1990)

    Google Scholar 

  21. Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965). https://doi.org/10.1016/0165-0114(78)90029-5

    Article  MathSciNet  MATH  Google Scholar 

  22. Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177–200 (1971). https://doi.org/10.1016/S0020-0255(71)80005-1

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxime Deforche .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Deforche, M., De Vos, I., Bronselaer, A., De Tré, G. (2023). An Orthographic Similarity Measure for Graph-Based Text Representations. In: Larsen, H.L., Martin-Bautista, M.J., Ruiz, M.D., Andreasen, T., Bordogna, G., De Tré, G. (eds) Flexible Query Answering Systems. FQAS 2023. Lecture Notes in Computer Science(), vol 14113. Springer, Cham. https://doi.org/10.1007/978-3-031-42935-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42935-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42934-7

  • Online ISBN: 978-3-031-42935-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics