Skip to main content
Log in

Cross-document word matching for segmentation and retrieval of Ottoman divans

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Ottoman Text Archive Project (OTAP), url: http://courses.washington.edu/otap/.

  2. State Archives Office of Turkey, url: http://www.devletarsivleri.gov.tr/.

References

  1. Al-Badr BH (1995) A segmentation-free approach to text recognition with application to Arabic text, Ph.D. thesis. University of Washington, Seattle

  2. Andrews WG (1985) Poetry voice, society song: ottoman lyric poetry. University of Washington Press, Seattle and London

    Google Scholar 

  3. Andrews WG, Black N, Kalpakli M (2006) Ottoman lyric poetry: an anthology. University of Washington Press

  4. Anonymous (1897) Kulliyat-ı Divan-ı Fuzuli. Hurşid Matbaası, İstanbul

  5. Asi A, Rabaev I, Kedem K, El-Sana J (2011) User-assisted alignment of Arabic historical manuscripts. In: International workshop on historical document imaging and processing

  6. Ataer E, Duygulu P (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on Multimedia Information retrieval, pp. 155–162

  7. Ataer E, Duygulu P (2007) Matching ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the 6th ACM International conference on Image and Video Retrieval, pp. 341–347

  8. Ball G, Srihari SN, Srinivasan H (2006) Segmentation-based and segmentation-free methods for spotting handwritten Arabic words. In: 10th International Workshop on Frontiers in Handwriting Recognition

  9. Brina CD, Niels R, Overvelde A, Levi G, Hulstijn W (2008) Dynamic time warping: a new method in the study of poor handwriting. Hum Mov Sci 27(2):242–255

    Article  Google Scholar 

  10. Broumandnia A, Shanbehzadeh J, Varnoosfaderani MR (2008) Persian/Arabic handwritten word recognition using M-band packet wavelet transform. Image Vis Comput 26:829–842

    Article  Google Scholar 

  11. Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29:701–717

    Article  Google Scholar 

  12. Can E, Duygulu P, Can F, Kalpakli M (2010) Redif extraction in handwritten Ottoman literary texts. In: Proceedings of the 20th International Conference on Pattern Recognition

  13. Can EF, Duygulu P (2011) A line-based representation for matching words in historical manuscripts. Pattern Recognition Letters 32(8):1126–1138

    Article  Google Scholar 

  14. Cheung A, Bennamoun M, Bergmann NW (2001) An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit 34(2):215–233

    Article  MATH  Google Scholar 

  15. Dogan MN (1997) Mecnun ve Leyla Dilinden Siirler. Enderun Kitabevi (1997).

  16. Fischer A, Indermuhle E, Frinken V, Bunke H (2011) HMM-based alignment of inaccurate transcriptions for historical documents. In: 11th Int. Conf. on Document Analysis and Recognition, p. 53

  17. Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33:934–942

    Article  Google Scholar 

  18. Fornés A, Lladós J, Sánchez G (2008) Old handwritten musical symbol classification by a dynamic time warping based method. Graph Recognit 5046:51–60

    Google Scholar 

  19. Fornes A, Llados J, Sanchez G, Karatzas D (2010) Rotation invariant hand-drawn symbol recognition based on a dynamic time warping model. Int J Doc Anal Recognit 13(3):229–241

    Article  Google Scholar 

  20. Gatos B, Pratikakis I Segmentation-free word spotting in historical printed documents. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 271–275

  21. Howe NR, Rath TM, Manmatha R (2005) Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 377–383. ACM

  22. Huang C, Srihari SN (2008) Word segmentation of off-line handwritten documents. Document Recognition and Retrieval XV, Proc. SPIE 6815

  23. Jain A (1986) Fundamentals of digital image processing. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  24. Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recognit pp. 914–924

  25. Khurshid KCF, Vincent N (2012) Word spotting in historical printed documents using shape and sequence comparisons. Pattern Recognition 45:2598–2609

    Article  Google Scholar 

  26. Kabacali A (1998) Cumhuriyet oncesi ve sonrasi matbaa ve basin sanayii. Cem Ofset

  27. Kchaou MG, Kanoun S, Ogier JM (2012) Segmentation and word spotting methods for printed and handwritten arabic texts: a comparative study. In: International Conference on Frontiers in Handwriting Recognition

  28. Khayyat M, Lam L, Suen CY (2012) Arabic handwritten word spotting using language models pp. 43–48

  29. Khayyat M, Lam L, Suen CY (2014) Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit 47(3):1021–1030

    Article  Google Scholar 

  30. Kim S, Jeong S, Lee GS, Suen C (2001) Word segmentation in handwritten Korean text lines based on gap clustering techniques. In: Sixth International Conference on Document Analysis and Recognition, pp. 189–193

  31. Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177

    Article  Google Scholar 

  32. Kut T, Ture F (1996) Yazmadan basmaya: muteferrika, muhendishane. Yapi Kredi Kultur Merkezi, Uskudar

    Google Scholar 

  33. Lados J, Rusinol M, Fornes A, Fernandes D, Dutta A (2012) On the influence of word representations for handwritten word spotting in historical documents. International J Pattern Recognit Artif Intell 26(05)

  34. Leydier Y, Ouji A, LeBourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105

    Article  MATH  Google Scholar 

  35. Lladós, J, Pratim-Roy P, Rodríguez JA., Sánchez G (2007) Word spotting in archive documents using shape contexts. In: Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II, pp. 290–297. Springer-Verlag

  36. Louloudis G, Gatos B, Pratikakis I, Halatsis C (2009) Text line and word segmentation of handwritten documents. Pattern Recognit 42(12):3169–3183

    Article  MATH  Google Scholar 

  37. Manmatha R, Han C, Riseman E (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637

  38. Manmatha R, Han C, Riseman EM, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on Digital libraries, pp. 151–159. ACM

  39. Manmatha R, Srimal N (1999) Scale space technique for word segmentation in handwritten documents. Scale-Space Theories in Computer Vision. Lect Notes Comput Sci 1682:22–33

    Article  Google Scholar 

  40. Marcolino A, Ramos V, Ramalho M, Pinto JC (2000) Line and word matching in old documents. In: Proceedings of the 5th IberoAmerican Symposium on Pattern Recognition, pp. 123–125

  41. Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an HMM-Based cursive handwriting recognition system. Int J Pattern Recognit Artif Anal 15(1):65–90

    Article  Google Scholar 

  42. Micah KE, Manmatha R, James A (2004) Text alignment with handwritten documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries, p. 195. IEEE Computer Society, Washington DC

  43. Niels R (2004) Dynamic time warping: an intuitive way of handwriting recognition. Master’s thesis

  44. Nikolaou N, Makridis M, Gatos B, Papamarkos NSN (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput 28(4):590–604

    Article  Google Scholar 

  45. Nikolaou N, Makridis M, Gatos B, Stamatopoulos N, Papamarkos N (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput. 28(4):590–604

    Article  Google Scholar 

  46. Rath T, Manmatha R (2003) Word image matching using dynamic time warping. Proc IEEE Conf Computer Vis Pattern Recognit 2:521–527

    Google Scholar 

  47. Rath TM, Kane S, Lehman A, Partridge E, Manmatha R (2002) Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques. Tech Rep

  48. Rath TM, Lavrenko V, Manmatha R (2003) A statistical approach to retrieving historical manuscript images without recognition. Tech rep

  49. Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 369–376. ACM

  50. Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116

    Article  MATH  Google Scholar 

  51. Rothfeder JL, Feng S, Rath TM (2003) Using corner feature correspondences to rank word images by similarity. Comput Vis Pattern Recognit Workshop 3:30–36

    Google Scholar 

  52. Saykol E, Sinop A, Gudukbay U, Ulusoy O, Cetin A (2004) Content-based retrieval of historical Ottoman documents stored as textual images. IEEE Trans Image Process 13(3):314–325

    Article  Google Scholar 

  53. Seni G, Cohen E (1994) External word segmentation of off-line handwritten text lines. Pattern Recognit 27:41–52

    Article  Google Scholar 

  54. Sinno JP, Qiang Y (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  55. Srihari SN, Ball GR (2008) Language independent word spotting in scanned documents. In: Proceedings of the 11th International Conference on Asian Digital Libraries, pp. 134–143

  56. Sun S, Xu Z, Yang M (2013) Transfer learning with part-based ensembles. Lect Notes Comput Sci 7872:271–282

    Article  Google Scholar 

  57. Adamek TN, Smeaton A (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9:153–165

    Article  Google Scholar 

  58. Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: 8th International Workshop on frontiers in Handwriting Recognition

  59. Tseng YH, Lee HJ (1999) Recognition-based handwritten Chinese character segmentation using a probabilistic viterbi algorithm. Pattern Recognit Lett 20(8):791–806

    Article  Google Scholar 

  60. Tu W, Sun S (2012) Cross-domain representation-learning framework with combination of class-separate and domain-merge objectives. In: Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining, CDKD ’12, pp. 18–25. ACM

  61. Varga T, Bunke H (2005) Tree structure for word extraction from handwritten text lines. In: 8th International Conference on Document Analysis and Recognition , vol. 1, pp. 352–356

  62. Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Integrated segmentation and recognition of connected Ottoman script. Opt Eng 48(11):1–12

    Article  Google Scholar 

  63. Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Ottoman archives explorer: a retrieval system for digital Ottoman archives. J Comput Cult Herit 2(3):1–20

    Article  Google Scholar 

  64. Zand M, Naghsh A, Monadjemi A (2008) Recognition-based segmentation in Persian character recognition. In: Proceedings of the Second International Conference on Advances in Pattern Recognition. World Academy of Science, Engineering and Technology 38

  65. Zhang B, Srihari SN, Huang C (2003) Word image retrieval using binary features. Doc Recognit Retr XI 1:45–53

  66. Zirari F, Ennaji A, Nicolas S, Mammass D (2013) A methodology to spot words in historical arabic documents pp. 1–4

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pinar Duygulu.

Additional information

This work was done while the second author was a graduate student in the Department of Computer Engineering, Bilkent University, Ankara, 06800 Turkey.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duygulu, P., Arifoglu, D. & Kalpakli, M. Cross-document word matching for segmentation and retrieval of Ottoman divans. Pattern Anal Applic 19, 647–663 (2016). https://doi.org/10.1007/s10044-014-0420-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0420-8

Keywords

Navigation