Skip to main content

Advertisement

Log in

Establishing the provenance of historical manuscripts with a novel distance measure

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The recent digitization of more than 20 million books has been led by initiatives from countries wishing to preserve their cultural heritage and by several commercial endeavors, including the Google Print Library Project. It is expected that within a few years a significant fraction of the world’s books will be online. However, for millions of complete books and tens of millions of loose pages, the provenance of the manuscripts may be completely unknown or disputed, thus denying historians an understanding of the context in which the content was created. In a handful of cases, it may be possible for experts to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are a rarity and these investigations are time-consuming and expensive. One technique used by experts to establish provenance is the examination of the ornate initial letters appearing in the questioned manuscript. By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. In this work, we show for the first time that we can reproduce this ability with a computer algorithm. We use a recently introduced technique to measure texture similarity and show that it can recognize initial letters with an accuracy that rivals or exceeds human performance. A brute force implementation of this measure would require several months to process a single large book; however, we introduce a novel lower bound that allows us to process the books in hours or minutes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. We defer a detailed discussion of our experimental philosophy until Sect. 5; however, we briefly note that all our experiments are reproducible, and all code and data are available at [11].

  2. Just the unique initial letters were deleted.

  3. SIFT was faster by about 30% when given unlimited main memory. If we force it to use a smaller memory footprint, it becomes significantly slower [36].

References

  1. Antonacopoulos A, Downton AC (2007) Special issue on the analysis of historical documents. IJDAR 9(2)

  2. Alabert A, Rangel LM (2011) Classifying the typefaces of the gutenberg 42-line Bible. IJDAR 14(4)

  3. Coustaty M, Pareti R, Vincent N, Ogier JM (2011) Towards historical document indexing: extraction of drop cap letters. IJDAR 14(3)

  4. Consortium of European Research Libraries (2011) www.cerl.org/web/

  5. Ornaments typographical. www.ornements-typo-mouriau.be/

  6. Virtual Library Humanist Program (2011) www.bvh.univ-tours.fr/index.htm

  7. Agam G, Argamon S, Frieder O, Grossman D, Lewis D (2006) The Complex Document Image Processing (CDIP) Test Collection Project. Illinois Institute of Technology. http://ir.iit.edu/projects/CDIP.html

  8. Bronner E (2008) Stolen manuscripts plague israeli archives. New York Times

  9. Calvani S (2008) Frequency and figures of organised crime in art and antiquities. ISPAC

  10. Victoria and Albert Museum: Woodcut Printing (video). www.youtube.com/watch?v=mgCYovlFRNY

  11. Hu B Supporting URL for this paper. www.cs.ucr.edu/bhu002/IL/IL.html. This URL contains all data and code used in this paper

  12. Alderman K (2009) Thieves take a page out of rare books and manuscripts. Art Cult Heritage Law Newsl I(V)

  13. INTERPOL (2011) Stolen works of art. www.interpol.int/Public/WorkOfArt/woafaq.asp. Accessed 7 July 2011

  14. Atran S, Henrich J (2010) The evolution of religion: how cognitive by-products, adaptive learning heuristics, ritual displays, and group competition generate deep commitments to prosocial religions. Biological theory: integrating development, evolution, and cognition, vol 5, pp 18–30

  15. Landre J, Morain-Nicolier F (2009) Retrieval of the ornaments from the hand-press period: an overview. In: 10th ICDAR

  16. Campana B, Keogh E (2010) A compression based distance measure for texture. SDM

  17. Maltoni D, Maio D, Jain AK, Prabhakar S (2003) Handbook of fingerprint recognition, Springer, Berlin

  18. Ogier JM, Tombre K (2006) Document image analysis techniques for cultural heritage documents. In: Proceedings of 1st EVA conference, pp 107–114

  19. Basa P, Sabari PS, Nishikanta R, Ramakrishnan AG (2004) Gabor filters for document analysis in Indian bilingual documents. In: International conference on intelligent sensing and information processing, pp 123–126

  20. Delalandre M, Ogier JM, Llados J (2008) A fast CBIR system of old ornamental letter. In: Workshop on graphics recognition, LNCS, pp 135–144

  21. Fauzi MFA, Lewis PH (2008) A multiscale approach to texture-based image retrieval. J Pattern Anal Appl 11(2)

  22. Garz A, Diem M, Sablatnig R (2010) Local descriptors for document layout analysis. In: Proceedings of Addison-Wesley series in statistics, pp 29–38

  23. Ramel JY, Leriche S, Demonet ML, Busson S (2007) User-driven page layout analysis of historical printed books. IJDAR 243–261

  24. Su Z, Cao Z, Wang Y, Zhen X (2011) Identification of unreliable segments to improve skeletonization of handwriting images. J Pattern Anal Appl 14(1)

  25. Tseng YH, Lee HJ (2008) Document image binarization by two-stage block extraction and background intensity determination. J Pattern Anal Appl 11(1)

  26. Tu SF, Hsu CS (2006) A DCT-based ownership identification method with gray-level and colorful signatures. J Pattern Anal Appl 9(2–3)

    Google Scholar 

  27. Journet N, Eglin V, Ramel JY, Mullot R (2006) Dedicated texture based tools for characterization of old books. In: Proceedings of the 2nd DIAL, April 2006

  28. Moghaddam RF, Cheriet M (2009) Low quality document image modeling and enhancement. IJDAR 11(4)

  29. Hénault DR, Moghaddam RF, Cheriet M (2011) A local linear level set method for the binarization of degraded historical document image. IJDAR 14

  30. Zhu Q, Keogh E (2010) Mother fugger: mining historical manuscripts with local color patches. ICDM 699–708

  31. Li M, Chen X, Li X, Ma B, Vitányi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264

    Article  MATH  Google Scholar 

  32. Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1):99–129

    Article  MathSciNet  Google Scholar 

  33. Baudrier E, Busson S, Corsini S, Delalandre M (2009) Retrieval of the ornaments from the hand-press period: an overview, In: 10th ICDAR 2009

  34. Vedaldi A (2011) http://www.vlfeat.org/~vedaldi/index.html

  35. Garz A, Diem M, Sablatnig R (2011) Layout analysis of ancient manuscripts using local features. In: Eikonopoiia: digital imaging of ancient textual heritage

  36. Lowe DG (2004) Distinctive image features from scale-invariant key point. Int J Comput Vis 60:91–110

    Article  Google Scholar 

  37. Ancient Greek Manuscripts Hit the Internet (2010) www.foxnews.com/scitech/2010/09/27/british-library-posts-greek-manuscripts-web/. Accessed 27 Sep 2010

  38. Keogh E (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417

  39. Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the IEEE ICCV, pp 59–66

  40. Tang Q, Nasiopoulos P (2010) Efficient motion re-estimation with rate-distortion optimization for MPEG-2 to H.264/AVC transcoding. IEEE Trans Circuits Syst Video Technol 20:262–274

    Article  Google Scholar 

  41. Pigeon S, Coulombe S (2008) Very low cost algorithms for predicting the file size of jpeg images subject to changes of quality factor and scaling. In: DCC

  42. Wang X, Ye L, Keogh EJ, Shelton CR, Annotating historical archives of images. JCDL 341–350

  43. Hu B, Rakthanmanon T, Campana B, Mueen A, Keogh E (2012) Image mining of historical manuscripts to establish provenance. In: SIAM conference on data mining (SDM)

  44. Justin TP (1559) Histoire Universelle de Trogues Pompée, Réduite En Abrégé par Justin

  45. Lewis D, Agam G, Argamon S, Frieder O, Grossman D, Heard J (2006) Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference, pp 665–666

  46. Journet N, Ramel J, Mullot R, Eglin V (2008) Document image characterization using multi-resolution analysis of the texture: application to old documents. IJDAR 11:9–18

    Article  Google Scholar 

  47. Marinai S (2011) Text retrieval from early printed books. IJDAR 14(2):117–129

    Article  Google Scholar 

  48. Plötz T, Fink G (2009) Markov models for of fine handwriting recognition: a survey. IJDAR 12:269–298

    Article  Google Scholar 

  49. The Legacy Tobacco Document Library (LTDL) (2007) University of California, San Francisco. http://legacy.library.ucsf.edu/

  50. Tobacco800 Signature and Logos. http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=52

  51. Rusiñol M, Lladó J (2010) Efficient logo retrieval through hashing shape context descriptors. In: Proceedings of the ninth IAPR international workshop on document analysis systems, In: DAS10, pp 215–222

  52. Zhu G, Zheng Y, Doermann D, Jaeger S (2009) Signature detection and matching for document image retrieval. IEEE Trans Pattern Anal Mach Intell 2015–2031

  53. Zhu G, Doermann D (2007) Automatic document logo detection. IJDAR 864–868

  54. Zhu G, Jaeger S, Doermann D (2006) A robust stamp detection framework on degraded documents. IJDAR XIII:1–9

    Google Scholar 

  55. Jouili S, Coustaty M, Tabbone S, Ogier JM (2010) NAVIDOMASS: structural-based approaches towards handling historical documents. In: ICPR, pp 946–949

  56. Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. ICDM 490–497

  57. Fornés A, Dutta A, Gordo A, Lladó J (2011) CVC-MUSCIMA: A ground truth of handwritten music score images for writer identification and staff removal. IJDAR 14

  58. Renou J (1626) Les Oeuvres Pharmaceutiques du Sr Jean de Renou, Conseiller & Medecin du Roy

Download references

Acknowledgments

We thank all the digital archivists who produced the vast amounts of data that made this work possible, especially the NaviDoMass group who did exceptional work preparing and annotating the data. This work was funded by NSF awards 0803410 and 0808770. We also thank the reviewers for their useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, B., Rakthanmanon, T., Campana, B.J.L. et al. Establishing the provenance of historical manuscripts with a novel distance measure. Pattern Anal Applic 18, 313–331 (2015). https://doi.org/10.1007/s10044-013-0332-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-013-0332-z

Keywords

Navigation