Abstract
Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.
Similar content being viewed by others
Notes
Ottoman Text Archive Project (OTAP), url: http://courses.washington.edu/otap/.
State Archives Office of Turkey, url: http://www.devletarsivleri.gov.tr/.
References
Al-Badr BH (1995) A segmentation-free approach to text recognition with application to Arabic text, Ph.D. thesis. University of Washington, Seattle
Andrews WG (1985) Poetry voice, society song: ottoman lyric poetry. University of Washington Press, Seattle and London
Andrews WG, Black N, Kalpakli M (2006) Ottoman lyric poetry: an anthology. University of Washington Press
Anonymous (1897) Kulliyat-ı Divan-ı Fuzuli. Hurşid Matbaası, İstanbul
Asi A, Rabaev I, Kedem K, El-Sana J (2011) User-assisted alignment of Arabic historical manuscripts. In: International workshop on historical document imaging and processing
Ataer E, Duygulu P (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on Multimedia Information retrieval, pp. 155–162
Ataer E, Duygulu P (2007) Matching ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the 6th ACM International conference on Image and Video Retrieval, pp. 341–347
Ball G, Srihari SN, Srinivasan H (2006) Segmentation-based and segmentation-free methods for spotting handwritten Arabic words. In: 10th International Workshop on Frontiers in Handwriting Recognition
Brina CD, Niels R, Overvelde A, Levi G, Hulstijn W (2008) Dynamic time warping: a new method in the study of poor handwriting. Hum Mov Sci 27(2):242–255
Broumandnia A, Shanbehzadeh J, Varnoosfaderani MR (2008) Persian/Arabic handwritten word recognition using M-band packet wavelet transform. Image Vis Comput 26:829–842
Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29:701–717
Can E, Duygulu P, Can F, Kalpakli M (2010) Redif extraction in handwritten Ottoman literary texts. In: Proceedings of the 20th International Conference on Pattern Recognition
Can EF, Duygulu P (2011) A line-based representation for matching words in historical manuscripts. Pattern Recognition Letters 32(8):1126–1138
Cheung A, Bennamoun M, Bergmann NW (2001) An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit 34(2):215–233
Dogan MN (1997) Mecnun ve Leyla Dilinden Siirler. Enderun Kitabevi (1997).
Fischer A, Indermuhle E, Frinken V, Bunke H (2011) HMM-based alignment of inaccurate transcriptions for historical documents. In: 11th Int. Conf. on Document Analysis and Recognition, p. 53
Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33:934–942
Fornés A, Lladós J, Sánchez G (2008) Old handwritten musical symbol classification by a dynamic time warping based method. Graph Recognit 5046:51–60
Fornes A, Llados J, Sanchez G, Karatzas D (2010) Rotation invariant hand-drawn symbol recognition based on a dynamic time warping model. Int J Doc Anal Recognit 13(3):229–241
Gatos B, Pratikakis I Segmentation-free word spotting in historical printed documents. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 271–275
Howe NR, Rath TM, Manmatha R (2005) Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 377–383. ACM
Huang C, Srihari SN (2008) Word segmentation of off-line handwritten documents. Document Recognition and Retrieval XV, Proc. SPIE 6815
Jain A (1986) Fundamentals of digital image processing. Prentice-Hall, Englewood Cliffs
Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recognit pp. 914–924
Khurshid KCF, Vincent N (2012) Word spotting in historical printed documents using shape and sequence comparisons. Pattern Recognition 45:2598–2609
Kabacali A (1998) Cumhuriyet oncesi ve sonrasi matbaa ve basin sanayii. Cem Ofset
Kchaou MG, Kanoun S, Ogier JM (2012) Segmentation and word spotting methods for printed and handwritten arabic texts: a comparative study. In: International Conference on Frontiers in Handwriting Recognition
Khayyat M, Lam L, Suen CY (2012) Arabic handwritten word spotting using language models pp. 43–48
Khayyat M, Lam L, Suen CY (2014) Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit 47(3):1021–1030
Kim S, Jeong S, Lee GS, Suen C (2001) Word segmentation in handwritten Korean text lines based on gap clustering techniques. In: Sixth International Conference on Document Analysis and Recognition, pp. 189–193
Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177
Kut T, Ture F (1996) Yazmadan basmaya: muteferrika, muhendishane. Yapi Kredi Kultur Merkezi, Uskudar
Lados J, Rusinol M, Fornes A, Fernandes D, Dutta A (2012) On the influence of word representations for handwritten word spotting in historical documents. International J Pattern Recognit Artif Intell 26(05)
Leydier Y, Ouji A, LeBourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105
Lladós, J, Pratim-Roy P, Rodríguez JA., Sánchez G (2007) Word spotting in archive documents using shape contexts. In: Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II, pp. 290–297. Springer-Verlag
Louloudis G, Gatos B, Pratikakis I, Halatsis C (2009) Text line and word segmentation of handwritten documents. Pattern Recognit 42(12):3169–3183
Manmatha R, Han C, Riseman E (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637
Manmatha R, Han C, Riseman EM, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on Digital libraries, pp. 151–159. ACM
Manmatha R, Srimal N (1999) Scale space technique for word segmentation in handwritten documents. Scale-Space Theories in Computer Vision. Lect Notes Comput Sci 1682:22–33
Marcolino A, Ramos V, Ramalho M, Pinto JC (2000) Line and word matching in old documents. In: Proceedings of the 5th IberoAmerican Symposium on Pattern Recognition, pp. 123–125
Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an HMM-Based cursive handwriting recognition system. Int J Pattern Recognit Artif Anal 15(1):65–90
Micah KE, Manmatha R, James A (2004) Text alignment with handwritten documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries, p. 195. IEEE Computer Society, Washington DC
Niels R (2004) Dynamic time warping: an intuitive way of handwriting recognition. Master’s thesis
Nikolaou N, Makridis M, Gatos B, Papamarkos NSN (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput 28(4):590–604
Nikolaou N, Makridis M, Gatos B, Stamatopoulos N, Papamarkos N (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput. 28(4):590–604
Rath T, Manmatha R (2003) Word image matching using dynamic time warping. Proc IEEE Conf Computer Vis Pattern Recognit 2:521–527
Rath TM, Kane S, Lehman A, Partridge E, Manmatha R (2002) Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques. Tech Rep
Rath TM, Lavrenko V, Manmatha R (2003) A statistical approach to retrieving historical manuscript images without recognition. Tech rep
Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 369–376. ACM
Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116
Rothfeder JL, Feng S, Rath TM (2003) Using corner feature correspondences to rank word images by similarity. Comput Vis Pattern Recognit Workshop 3:30–36
Saykol E, Sinop A, Gudukbay U, Ulusoy O, Cetin A (2004) Content-based retrieval of historical Ottoman documents stored as textual images. IEEE Trans Image Process 13(3):314–325
Seni G, Cohen E (1994) External word segmentation of off-line handwritten text lines. Pattern Recognit 27:41–52
Sinno JP, Qiang Y (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Srihari SN, Ball GR (2008) Language independent word spotting in scanned documents. In: Proceedings of the 11th International Conference on Asian Digital Libraries, pp. 134–143
Sun S, Xu Z, Yang M (2013) Transfer learning with part-based ensembles. Lect Notes Comput Sci 7872:271–282
Adamek TN, Smeaton A (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9:153–165
Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: 8th International Workshop on frontiers in Handwriting Recognition
Tseng YH, Lee HJ (1999) Recognition-based handwritten Chinese character segmentation using a probabilistic viterbi algorithm. Pattern Recognit Lett 20(8):791–806
Tu W, Sun S (2012) Cross-domain representation-learning framework with combination of class-separate and domain-merge objectives. In: Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining, CDKD ’12, pp. 18–25. ACM
Varga T, Bunke H (2005) Tree structure for word extraction from handwritten text lines. In: 8th International Conference on Document Analysis and Recognition , vol. 1, pp. 352–356
Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Integrated segmentation and recognition of connected Ottoman script. Opt Eng 48(11):1–12
Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Ottoman archives explorer: a retrieval system for digital Ottoman archives. J Comput Cult Herit 2(3):1–20
Zand M, Naghsh A, Monadjemi A (2008) Recognition-based segmentation in Persian character recognition. In: Proceedings of the Second International Conference on Advances in Pattern Recognition. World Academy of Science, Engineering and Technology 38
Zhang B, Srihari SN, Huang C (2003) Word image retrieval using binary features. Doc Recognit Retr XI 1:45–53
Zirari F, Ennaji A, Nicolas S, Mammass D (2013) A methodology to spot words in historical arabic documents pp. 1–4
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was done while the second author was a graduate student in the Department of Computer Engineering, Bilkent University, Ankara, 06800 Turkey.
Rights and permissions
About this article
Cite this article
Duygulu, P., Arifoglu, D. & Kalpakli, M. Cross-document word matching for segmentation and retrieval of Ottoman divans. Pattern Anal Applic 19, 647–663 (2016). https://doi.org/10.1007/s10044-014-0420-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0420-8