Abstract
With the accelerated pace of digitization, a vast collection of Ottoman documents has become accessible to researchers and the general public. However, most users interested in these documents are unable to read them, as the text is Turkish written in the Arabic-Persian script. Manual transcription of such a massive amount of documents is also beyond the capacity of human experts. With the advancements in deep learning, we have been able to provide a solution to the long-standing problem of automatic transcription of printed Ottoman documents. We evaluated three decoding strategies including Word Beam Search that allows to use a recognition lexicon and n-gram statistics during the decoding phase. Furthermore, the effect of lexicon size and coverage and language modelling via character or word n-grams are also evaluated. Using a general purpose large lexicon of the Ottoman era (260K words and 86% test coverage), the performance is measured as \(6.59\%\) character error rate and \(28.46\%\) word error rate on a test set of 6, 828 text lines.
Berrin Yanikoglu—Part of this work was done when Z. Tandoğan, S. D. Akansu and F. Kızılırmak were students at Sabancı University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A demo of the current system is available at https://demos.sabanciuniv.edu.
- 2.
Test subset is publicly available at https://github.com/verimsu/Akis-Dataset.
References
Ottoman Turkish discovery portal. https://www.muteferriqa.com/en. Accessed 10 May 2024
Transkribus Ottoman Turkish print. https://readcoop.eu/model/ottoman-turkish-print/. Accessed 10 May 2024
https://www.osmanlica.com/. Accessed 13 Nov 2022
Ahmad, I., Mahmoud, S.A., Fink, G.A.: Open-vocabulary recognition of machine-printed Arabic text using hidden markov models. Pattern Recognit. 51, 97–111 (2016)
Ahmed, I., Mahmoud, S., Parvez, M.: Printed Arabic text recognition. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 147–168. Springer, London (2012). https://doi.org/10.1007/978-1-4471-4072-6_7
Al-Badr, B., Mahmoud, S.A.: Survey and bibliography of Arabic optical text recognition. Signal Process. 41(1), 49–77 (1995)
Al-Helali, B.M., Mahmoud, S.A.: Arabic online handwriting recognition (AOHR): a survey. ACM Comput. Surv. 50(3), 33:1–33:35 (2017)
Arifoglu, D., Sahin, E., Adiguzel, H., Duygulu, P., Kalpakli, M.: Matching Islamic patterns in Kufic images. Pattern Anal. Appl. 18(3), 601–617 (2015)
Aydemir, M.S., Aydin, B., Kaya, H., Karliaga, I., Demir, C.: Tübitak Turkish - Ottoman handwritten recognition system. In: 2014 22nd Signal Processing and Communications Applications Conference (SIU), Trabzon, Turkey, April 23-25, 2014, pp. 1918–1921. IEEE (2014)
Baierer, K., Büttner, A., Engl, E., Hinrichsen, L., Reul, C.: OCR-D & OCR4all: two complementary approaches for improved OCR of historical sources. In: Sumikawa, Y., Ikejiri, R., Doucet, A., Pfanzelter, E., Hasanuzzaman, M., Dias, G., Milligan, I., Jatowt, A. (eds.) Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021) co-located with ACM/IEEE Joint Conference on Digital Libraries 2021 (JCDL 2021), Online event, September 30-October 1, 2021. CEUR Workshop Proceedings, vol. 2981. CEUR-WS.org (2021)
Biadsy, F., El-Sana, J., Habash, N.: Online Arabic handwriting recognition using hidden Markov models (2006)
Can, E.F., Duygulu, P.: A line-based representation for matching words in historical manuscripts. Pattern Recognit. Lett. 32(8), 1126–1138 (2011)
Can, E.F., Duygulu, P., Can, F., Kalpakli, M.: Redif extraction in handwritten Ottoman literary texts. In: 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23–26 August 2010, pp. 1941–1944. IEEE Computer Society (2010)
Carbune, V., et al.: Fast multi-language LSTM-based online handwriting recognition. Int. J. Document Anal. Recognit. 23(2), 89–102 (2020)
Clanuwat, T., Lamb, A., Kitamoto, A.: Kuronet: pre-modern Japanese Kuzushiji character recognition with deep learning. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20–25, 2019, pp. 607–614. IEEE (2019)
Colutto, S., Kahle, P., Hackl, G., Mühlberger, G.: Transkribus. a platform for automated text recognition and searching of historical documents. In: 15th International Conference on eScience, eScience 2019, San Diego, CA, USA, September 24–27, 2019, pp. 463–466. IEEE (2019)
Dolek, I., Kurt, A.: A deep learning model for Ottoman OCR. Concurr. Comput. Pract. Exp. 34(20) (2022)
Duygulu, P., Arifoglu, D., Kalpakli, M.: Cross-document word matching for segmentation and retrieval of Ottoman divans. Pattern Anal. Appl. 19(3), 647–663 (2016)
Ergin, M.: Türk Dil Bilgisi. Boğaziçi Yayınları, İstanbul (2020)
Fujitake, M.: DTrOCR: decoder-only transformer for optical character recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8025–8035 (2024)
Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Cohen, W.W., Moore, A.W. (eds.) Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25–29, 2006. ACM International Conference Proceeding Series, vol. 148, pp. 369–376. ACM (2006)
Graves, A., Fernández, S., Liwicki, M., Bunke, H., Schmidhuber, J.: Unconstrained on-line handwriting recognition with recurrent neural networks. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pp. 577–584. Curran Associates, Inc. (2007)
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Hwang, K., Sung, W.: Character-level incremental speech recognition with recurrent neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pp. 5335–5339. IEEE (2016)
Jain, M., Mathew, M., Jawahar, C.V.: Unconstrained scene text and video text recognition for Arabic script. In: 1st International Workshop on Arabic Script Analysis and Recognition, ASAR 2017, Nancy, France, April 3-5, 2017, pp. 26–30. IEEE (2017)
Kizilirmak, F., Yanikoglu, B.: CNN-BiLSTM model for english handwriting recognition: Comprehensive evaluation on the IAM dataset. arXiv preprint arXiv:2307.00664 (2023)
Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 492–506. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_32
Li, M., et al.: TrOCR: transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)
Lorigo, L.M., Govindaraju, V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)
Martínek, J., Lenc, L., Král, P., Nicolaou, A., Christlein, V.: Hybrid training data for historical text OCR. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 565–570. IEEE (2019)
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020)
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017, pp. 67–72. IEEE (2017)
Rahal, N., Tounsi, M., Hussain, A., Alimi, A.M.: Deep sparse auto-encoder features learning for Arabic text recognition. IEEE Access 9, 18569–18584 (2021)
Sak, H., Güngör, T., Saraclar, M.: Resources for Turkish morphological processing. Lang. Resour. Eval. 45(2), 249–261 (2011)
Scheidl, H., Fiel, S., Sablatnig, R.: Word beam search: a connectionist temporal classification decoding algorithm. In: 16th International Conference on Frontiers in Handwriting Recognition, ICFHR 2018, Niagara Falls, NY, USA, August 5-8, 2018, pp. 253–258. IEEE Computer Society (2018)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Slimane, F., Zayene, O., Kanoun, S., Alimi, A.M., Hennebert, J., Ingold, R.: New features for complex Arabic fonts in cascading recognition system. In: Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11-15, 2012, pp. 738–741. IEEE Computer Society (2012)
Tasdemir, E.F.B.: Printed Ottoman text recognition using synthetic data and data augmentation. Int. J. Document Anal. Recognit. 26(3), 273–287 (2023)
Tasdemir, E.F.B., Yanikoglu, B.A.: Large vocabulary recognition for online Turkish handwriting with sublexical units. Turkish J. Electr. Eng. Comput. Sci. 26(5), 2218–2233 (2018)
Timurtaş, F.K.: Osmanlı Türkçesi Grameri III. Alfa, İstanbul (2017)
Yanikoglu, B.A., Kholmatov, A.: Turkish handwritten text recognition: a case of agglutinative languages. In: Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B. (eds.) Document Recognition and Retrieval X, Santa Clara, California, USA, January 22-23, 2003, Proceedings. SPIE Proceedings, vol. 5010, pp. 227–233. SPIE (2003)
Acknowledgement
This study was supported by Scientific and Technological Research Council of Turkey (TUBITAK) under the Grant Number 122E399. The authors thank TUBITAK for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tasdemir, E.F.B. et al. (2024). Automatic Transcription of Ottoman Documents Using Deep Learning. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-70442-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70441-3
Online ISBN: 978-3-031-70442-0
eBook Packages: Computer ScienceComputer Science (R0)