Skip to main content
Log in

Mining historical manuscripts with local color patches

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Initiatives such as the Google Print Library Project and the Million Book Project have already archived more than twelve million books in digital format, and within the next decade, the majority of world’s books will be online. Although most of the data will naturally be text, there will also be tens of millions of pages of images, many in color. While there is an active research community pursuing data mining of text from historical manuscripts, there has been very little work that exploits the rich color information which is often present. In this work, we introduce a simple color measure which both addresses and exploits typical features of historical manuscripts. To enable the efficient mining of massive archives, we propose a tight lower bound to the measure. Beyond the fast similarity search, we show how this lower bound allows us to build several higher-level data mining tools, including motif discovery and link analyses. We demonstrate our ideas in several data mining tasks on manuscripts dating back to the fifteenth century.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bennett J (1834) A selection from the most remarkable and interesting of the fishes found on the coast of Ceylon. London, E. Bull

  2. Bing image search engine. http://www.bing.com/images

  3. Bourgeois Le F, Kaileh H (2004) Automatic metadata retrieval from ancient manuscripts. Doc Anal Syst 75–89

  4. Cai D, He X, Li Z, Ma WY, Wen JR (2004) Hierachical clustering of WWW image search results using visual, textual and link information. In: Proceedings of the ACM international conference on multimedia

  5. Chiang T-W, Tsai T (2008) Querying color images using user-specified wavelet features. Knowl Inf Syst 15(1): 109–129

    Article  MathSciNet  Google Scholar 

  6. D’Orbigny C (1849) Dictionnaire universel d’Histoire naturelle. Renard & Martinet, Paris

    Google Scholar 

  7. Das Ehrenbuch der Fugger (The secret book of honour of the Fugger)—BSB Cgm 9460, Augsburg, ca. 1545–1548 mit Nachträgen aus späterer Zeit

  8. Das M, Riseman EM, Draper BA (1997) FOCUS: searching for Multi-Colored Objects in a Diverse Image Database. CVPR97 (756–761)

  9. Das Sächsische Stammbuch—Mscr.Dresd.R.3. Sammlung von Bildnissen sächsischer Fürsten, mit gereimtem Text; aus der Zeit von 1500–1546. (The Saxon pedigree book, Collection of portraits of Saxon princes, with rhyming verse, from the period 1500–1546)

  10. Ding H et al (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2): 1542–1552

    Google Scholar 

  11. Dorling EE (Edward Earle) (1913) Leopards of England, and other papers on heraldry. Constable & Company, limited, London

  12. Faloutsos C et al (1994) Efficient and effective querying by image content. J Intell Inf Syst 3: 231–262

    Article  Google Scholar 

  13. Garain U, Paquet T, Heutte L (2006) On foreground-background separation in low quality document images. Int J Doc Anal 8(1): 47–63

    Article  Google Scholar 

  14. Godman, Frederick D, et al (1879–1901) Insecta. Lepidoptera-Rhopalocera, vol III (Plates)

  15. Gong Z, Liu Q (2009) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132

    Article  Google Scholar 

  16. Google image search engine. http://images.google.com

  17. Grana C, Borghesani D, Cucchisra R (2010) Automatic segmentation of digitalized historical manuscripts. Multimedia Tools Appl

  18. Gupta A, Jain R (1997) Visual information retrieval. Commun ACM 40(5): 70–79

    Article  Google Scholar 

  19. Hartemink R (2010) Heraldry of the world. http://www.ngw.nl/int/dld/o/oberkirb.htm

  20. Hartemink R (2010) (Personal communication) April 30, 2010

  21. Herwig M (2007) Google’s total library: putting the world’s books on the web

  22. Hewitson William C (1856) Illustrations of new species of exotic butterflies: selected chiefly from the collections of W. Wilson Saunders and William C. Hewitson, vol I

  23. Holley R (2009) Many hands make light work: public collaborative OCR text correction in Australian Historic Newspapers National Library of Australia. ISBN 978-0-642-27694-0

  24. Ioka M (1989) A method of defining the similarity of images on the basis of color information, technical report RT-0030, IBM Research

  25. Kelly K (2006) Scan this book! N.Y. TIMES, May 14, § 6 (Magazine), at 42

  26. Like.com. http://www.like.com/

  27. Lin J, Keogh E, Lonardi S, Patel P (2002) Finding motifs in time series. In: Proceedings of 2nd workshop on temporal data mining

  28. Liu Y, Zhang D, Lu G, Ma W-Y (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40(1): 262–282

    Article  MATH  Google Scholar 

  29. Ma WY, Manjunath B (1997) Netra: a toolbox for navigating large image databases. In: Proceedings of the IEEE international conference on image processing, pp 568–571

  30. Matas J, Koubaroulis D, Kittler J (2000) Colour image retrieval and object recognition using the multimodal neighbourhood signature. In: Proceedings of the ECCV, pp 48–64

  31. Montagu JA (James Augustus) (1840) A guide to the study of heraldry. W. Pickering, London

  32. Pass G, Zabih R (1996) Histogram refinement for content based image retrieval. In: IEEE workshop on applications of computer vision, pp 96–102

  33. Pentland A, Picard RW, Scaroff S (1996) Photobook: content-based manipulation for image databases. Int J Comput Vis 18(3): 233–254

    Article  Google Scholar 

  34. Renard L, L. Poissons Ecrevisses et Crabes, de diverses couleurs et figures extraordinaires, que l’on trouve autour des Isles Moluques et sur les côtes des Terres Australes. Amsterdam

  35. Rui Y, Huang T, Chang S (1999) Image retrieval: current techniques, promising directions and open issues. J Vis Commun Image Represent 10: 39–62

    Article  Google Scholar 

  36. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans PAMI 22(12): 1349–1380

    Article  Google Scholar 

  37. Smith JR, Chang S-F (1996) VisualSEEk: a fully automated content-based retrieval system. ACM Multimedia, pp 87–98

  38. Stehling RO, Nascimento MA, Falcão AX (2003) Cell histograms versus color histograms for image representation and retrieval. Knowl Inf Syst 5(3): 315–336

    Article  Google Scholar 

  39. Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1): 11–32

    Article  Google Scholar 

  40. Supporting webpage. http://www.cs.ucr.edu/~qzhu/localcolorpatch

  41. The naturalist’s library. Conducted by Sir William Jardine. Entomology

  42. TinEye (A reverse image search engine). http://www.tineye.com

  43. Wang JZ, Li J, Wiederhold G (2001) SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9): 947–963

    Article  Google Scholar 

  44. Wyllie Robert E (1921) Orders, decorations and insignia, military and civil; with the history and romance of their origin and a full description of each. New York Putnam’s

  45. Zhou XS, Huang TS (2000) CBIR: from low-level features to highlevel semantics. Proc SPIE Image Video Commun Process 3974: 426–431

    Google Scholar 

  46. Zhu Q, Keogh E (2010) Using CAPTCHAs to index cultural artifacts. IDA, pp 245–257

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Q., Keogh, E. Mining historical manuscripts with local color patches. Knowl Inf Syst 30, 637–665 (2012). https://doi.org/10.1007/s10115-011-0401-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0401-9

Keywords

Navigation