Skip to main content
Log in

User-configurable OCR enhancement for online natural history archives

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form- processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70–90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Marinai, S., Dengel, A. (eds.): Document Analysis Systems VI, Proceedings of 6th International Workshop on DAS2004, Italy, September 2004, Florence. LNCS 3163. Springer, Berlin Heidelberg New York (2004)

  2. Spitz, A.L.: Tilting at windmills: adventures in attempting to reconstruct Don Quixote, INCS., pp. 51–62 (2004)

  3. Poole, R.W.: In: Heppner, J.B. (eds.) Lepidopterorum Catalogus (New Series) Fascicle 118 Noctuidae Parts 1–3: 1314 pp. E.J. Brill/Flora & Fauna Publications. Leiden–New York–Kobenhavn–Koln (1989)

  4. Scoble, M.J. (ed.): Geometrid Moths of the World: A Catalogue. Volumes 1 and 2: 1016 pp. + index 129 pp. CSIRO Publishing, Canberra (1999)

  5. Pitkin, B.R., Jenkins, P.: Butterflies & Moths of the World: Generic Names & their Type-species. http://www.nhm.ac.uk/ entomology/butmoth/ (2002)

  6. Beccaloni, G., Scoble, M., Robinson, G., Pitkin, B.: The Global Lepidoptera Names Index, http://www.nhm.ac.uk/entomology/ lepindex

  7. Cracknell, C., Downton, A.C.: TABS – script-based software framework for research in image processing, analysis and understanding. In: IEE Proc. VISP, vol. 145 (3), 194–202 (1998)

  8. Niblack W. (1986). An Introduction to Digital Image Processing. Prentice hyp. Hall, Englewood Cliffs

    Google Scholar 

  9. Sauvola J., Pietikainen M. (2000). Adaptive Document Image Binarization. Pattern Recognition 33:225–236

    Article  Google Scholar 

  10. He, J., Do, D.M.Q., Downton, A.C.: A comparison of binarization methods for historical archive documents. Submitted to ICDAR2005, Seoul, Korea (2005)

  11. He, J., Downton, A.C.: Colour map classification for archive Documents. pp. 241–251

  12. He, J., Downton, A.C.: Configurable text stamp identification tool with application of fuzzy logic. In: 6th International Workshop on Document Analysis Systems, DAS 2004, pp. 201–212

  13. Nagy G., Seth S., Viswanathan M. (1992). A prototype document image analysis system for technical journals. Computer 25(7):10–22

    Article  Google Scholar 

  14. Baird, H.S., Jones, S.E., Fortune, S.J.: Image egmentation using shape-directed covers. In: Proc.10th Int. Conf. Pattern Recognition (ICPR), IEEE CS Press, Los Alamitos CA., pp. 820–825 (1990)

  15. Bottou L., Haffner P., Howard P.G., Simard P., Bengio Y., LeCun Y. (1998). High quality document image compression with DjVu. J. Elect. Imag. 7(3):410–425

    Article  Google Scholar 

  16. Rice, S.V., Jenkins, F.R., Nartker, T.A.: The test of OCR accuracy. Technical Report, ISRI, University of Nevada at Las Vegas. ISRI TR-96-01, http://www.isri.unlv.edu/ pub/ISRI/OCRtk/AT-1996.pdf (1996)

  17. Baird, H.S., Govindaraju, V., Lopresti, D.P.: Document analysis systems for digital libaries: challenges and opportunities. In: Marinai, S., Dengel, A.(eds.) Proceedings of 6th International Workshop Document Analysis Systems VI, DAS2004, Florence, Italy, September 2004. LNCS 3163. Springer, Berlin Heidelberg New York, pp. 1–16 (2006)

  18. Antonacopoulos, A., Karatzas, D.: A complete approach to the conversion of typewritten historical documents for digital archives. pp. 90–101

  19. ABBYY Ships First Omnifont OCR Software for Fraktur and Old European Language Recognition”, Press release 18 January 2005, http://www.abbyy.com/press/ press_releases.asp?param=38400

  20. ABBYY Flexicapture white paper, http://www.abbyy.com/ articles/WP%20FlexiCapture.pdf

  21. Lucas, S.M., Patoulas, G., Downton, A.C.: Fast lexicon-based word recognition in noisy images. In: Proc. ICDAR2003, 7th International Conference on Document Image Analysis and Recognition, Edinburgh, 3–6, pp. 462–466 (2003).

  22. Ishidera, E., Lucas, S.M., Downton, A.C.: Top-down likelihood word image generation model for holistic word recognition. In: Proc. DAS’02 Document Analysis Systems August 19–21, Springer-Verlag Princeton, NJ, LNCS 2423, pp. 82–94, (2002)

  23. Zou, J., Nagy, G.: Evaluation of model-based interactive flower recognition. In: Proc ICPR 2004, 17th International Conference on Pattern Recognition, vol.2, pp. 311–314, Cambridge, UK (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andy Downton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Downton, A., He, J. & Lucas, S. User-configurable OCR enhancement for online natural history archives. IJDAR 9, 263–279 (2007). https://doi.org/10.1007/s10032-006-0022-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-006-0022-0

Keywords

Navigation