Dating the Historical Documents from Digitalized Books by Orthography Recognition

Brodić, Darko; Amelio, Alessia

doi:10.1007/978-3-319-68130-6_10

Darko Brodić¹¹ &
Alessia Amelio¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 733))

Included in the following conference series:

Italian Research Conference on Digital Libraries

449 Accesses

Abstract

This paper introduces a new method for automatically dating Serbian and Croatian historical documents. It is based on the concept that the documents in a certain script or language evolving in different historical periods are characterized by differences in orthography rules. Accordingly, we propose three stages of script coding, texture analysis and classification for capturing such a difference. Hence, the input document is transformed into a sequence of numerical codes, each representing an intensity value, determining an image. Then, texture analysis extracts features from the image to create a feature vector. Finally, it is classified for orthography recognition. Results obtained on two databases of historical documents in angular Glagolitic script and Slavonic-Serbian and Serbian languages extracted from digitalized books demonstrate the efficacy of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
MathSciNet Google Scholar
Baromic’s Breviary, Venice (1493)
Google Scholar
Berčić, I.: Foundations of the Old Slavic language written by Glagolitic scripts to read the church books, Prague (1862)
Google Scholar
Biller, O., El-Sana, J., Kedem, K.: The influence of language orthographic characteristics on digital word recognition. In: The 11th IAPR International Workshop on Document Analysis Systems, Tours, pp. 131–135 (2014)
Google Scholar
Brodić, D., Amelio, A., Milivojević, Z.N.: Clustering documents in evolving languages by image texture analysis. Appl. Intell. 46(4), 916–933 (2017)
Article Google Scholar
Brodić, D., Amelio, A., Milivojević, Z.N.: An approach to the language discrimination in different scripts using adjacent local binary pattern. J. Exp. Theor. Artif. Intell. 29(5), 929–947 (2017)
Google Scholar
Brodić, D., Amelio, A., Milivojević, Z.N.: Identification of Fraktur and Latin Scripts in German historical documents using image texture analysis. Appl. Artif. Intell. 30(5), 379–395 (2016)
Article Google Scholar
Brodić, D., Amelio, A., Milivojević, Z.N.: Language discrimination by texture analysis of the image corresponding to the text. Neural Comput. Appl., 1–21 (2016)
Google Scholar
Brodić, D., Maluckov, Č.A., Milivojević, Z.N., Draganov, I.R.: Differentiation of the script using adjacent local binary patterns. In: Agre, G., Hitzler, P., Krisnadhi, A.A., Kuznetsov, S.O. (eds.) AIMSA 2014. LNCS (LNAI), vol. 8722, pp. 162–169. Springer, Cham (2014). doi:10.1007/978-3-319-10554-3_15
Google Scholar
Chu, A., Sehgal, C.M., Greenleaf, J.F.: Use of gray value distribution of run lengths for texture analysis. Pattern Recogn. Lett. 11(6), 415–419 (1990)
Article MATH Google Scholar
Confusion Matrix. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Coulmas, F.: The Blackwell Encyclopedia of Writing Systems, p. 379. Blackwell, Oxford (1996)
Google Scholar
Cross Validation (1997). https://www.cs.cmu.edu/~schneide/tut5/node42.html
Dasarathy, B.R., Holder, E.B.: Image characterizations based on joint gray-level run-length distributions. Pattern Recogn. Lett. 12(8), 497–502 (1991)
Article Google Scholar
Febvre, L., Martin, H.J.: The Coming of the Book: The Impact of Printing 1450–1800, Verso (1976)
Google Scholar
Galloway, M.M.: Texture analysis using gray level run lengths. Comp. Graph. Im. Proc. 4(2), 172–179 (1975)
Article Google Scholar
Garrette, D., Alpert-Abrams, H.: An unsupervised model of orthographic variation for historical document transcription. In: The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, pp. 467–472 (2016)
Google Scholar
Ivić, P.: Overview of History of the Serbian Language, Novi Sad (1998)
Google Scholar
Lipovčan, S.: Discovering the Glagolitic Script of Croatia. Erasmus Publisher, Zagreb (2000)
Google Scholar
Missale Romanum Glagolitice, Kosinje (1483)
Google Scholar
Nosaka, R., Ohkawa, Y., Fukui, K.: Feature extraction based on co-occurrence of adjacent local binary patterns. In: Ho, Y.-S. (ed.) PSIVT 2011. LNCS, vol. 7088, pp. 82–91. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25346-1_8
Chapter Google Scholar
Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 29(1), 51–59 (1996)
Article Google Scholar
Reffle, U., Ringlstetter, C.: Unsupervised profiling of OCRed historical documents. Pattern Recogn. 46, 1346–1357 (2013)
Article Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice Hall, Egnlewood Cliffs (1995, 2003)
Google Scholar
Stefanović Karadžić, V.: Građa za Srpsku Istoriju našega vremena. Štamparija Kraljevskog Univerziteta, Budim (1828)
Google Scholar
Stojković, A.: Fisika. Štamparija Kraljevskog Univerziteta, Budim (1803)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
Zramdini, A., Ingold, R.: Optical font recognition using typographical features. IEEE Trans. Pattern Anal. Mach. Intell. 8(20), 877–882 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Technical Faculty in Bor, University of Belgrade, V.J. 12, 19210, Bor, Serbia
Darko Brodić
DIMES, University of Calabria, Via Pietro Bucci Cube 44, 87036, Rende, CS, Italy
Alessia Amelio

Authors

Darko Brodić
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Amelio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darko Brodić .

Editor information

Editors and Affiliations

University of Modena and Reggio Emilia, Modena, Italy
Costantino Grana
University of Modena and Reggio Emilia, Modena, Italy
Lorenzo Baraldi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brodić, D., Amelio, A. (2017). Dating the Historical Documents from Digitalized Books by Orthography Recognition. In: Grana, C., Baraldi, L. (eds) Digital Libraries and Archives. IRCDL 2017. Communications in Computer and Information Science, vol 733. Springer, Cham. https://doi.org/10.1007/978-3-319-68130-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-68130-6_10
Published: 11 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68129-0
Online ISBN: 978-3-319-68130-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics