Abstract
The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form- processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70–90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.
Similar content being viewed by others
References
Marinai, S., Dengel, A. (eds.): Document Analysis Systems VI, Proceedings of 6th International Workshop on DAS2004, Italy, September 2004, Florence. LNCS 3163. Springer, Berlin Heidelberg New York (2004)
Spitz, A.L.: Tilting at windmills: adventures in attempting to reconstruct Don Quixote, INCS., pp. 51–62 (2004)
Poole, R.W.: In: Heppner, J.B. (eds.) Lepidopterorum Catalogus (New Series) Fascicle 118 Noctuidae Parts 1–3: 1314 pp. E.J. Brill/Flora & Fauna Publications. Leiden–New York–Kobenhavn–Koln (1989)
Scoble, M.J. (ed.): Geometrid Moths of the World: A Catalogue. Volumes 1 and 2: 1016 pp. + index 129 pp. CSIRO Publishing, Canberra (1999)
Pitkin, B.R., Jenkins, P.: Butterflies & Moths of the World: Generic Names & their Type-species. http://www.nhm.ac.uk/ entomology/butmoth/ (2002)
Beccaloni, G., Scoble, M., Robinson, G., Pitkin, B.: The Global Lepidoptera Names Index, http://www.nhm.ac.uk/entomology/ lepindex
Cracknell, C., Downton, A.C.: TABS – script-based software framework for research in image processing, analysis and understanding. In: IEE Proc. VISP, vol. 145 (3), 194–202 (1998)
Niblack W. (1986). An Introduction to Digital Image Processing. Prentice hyp. Hall, Englewood Cliffs
Sauvola J., Pietikainen M. (2000). Adaptive Document Image Binarization. Pattern Recognition 33:225–236
He, J., Do, D.M.Q., Downton, A.C.: A comparison of binarization methods for historical archive documents. Submitted to ICDAR2005, Seoul, Korea (2005)
He, J., Downton, A.C.: Colour map classification for archive Documents. pp. 241–251
He, J., Downton, A.C.: Configurable text stamp identification tool with application of fuzzy logic. In: 6th International Workshop on Document Analysis Systems, DAS 2004, pp. 201–212
Nagy G., Seth S., Viswanathan M. (1992). A prototype document image analysis system for technical journals. Computer 25(7):10–22
Baird, H.S., Jones, S.E., Fortune, S.J.: Image egmentation using shape-directed covers. In: Proc.10th Int. Conf. Pattern Recognition (ICPR), IEEE CS Press, Los Alamitos CA., pp. 820–825 (1990)
Bottou L., Haffner P., Howard P.G., Simard P., Bengio Y., LeCun Y. (1998). High quality document image compression with DjVu. J. Elect. Imag. 7(3):410–425
Rice, S.V., Jenkins, F.R., Nartker, T.A.: The test of OCR accuracy. Technical Report, ISRI, University of Nevada at Las Vegas. ISRI TR-96-01, http://www.isri.unlv.edu/ pub/ISRI/OCRtk/AT-1996.pdf (1996)
Baird, H.S., Govindaraju, V., Lopresti, D.P.: Document analysis systems for digital libaries: challenges and opportunities. In: Marinai, S., Dengel, A.(eds.) Proceedings of 6th International Workshop Document Analysis Systems VI, DAS2004, Florence, Italy, September 2004. LNCS 3163. Springer, Berlin Heidelberg New York, pp. 1–16 (2006)
Antonacopoulos, A., Karatzas, D.: A complete approach to the conversion of typewritten historical documents for digital archives. pp. 90–101
ABBYY Ships First Omnifont OCR Software for Fraktur and Old European Language Recognition”, Press release 18 January 2005, http://www.abbyy.com/press/ press_releases.asp?param=38400
ABBYY Flexicapture white paper, http://www.abbyy.com/ articles/WP%20FlexiCapture.pdf
Lucas, S.M., Patoulas, G., Downton, A.C.: Fast lexicon-based word recognition in noisy images. In: Proc. ICDAR2003, 7th International Conference on Document Image Analysis and Recognition, Edinburgh, 3–6, pp. 462–466 (2003).
Ishidera, E., Lucas, S.M., Downton, A.C.: Top-down likelihood word image generation model for holistic word recognition. In: Proc. DAS’02 Document Analysis Systems August 19–21, Springer-Verlag Princeton, NJ, LNCS 2423, pp. 82–94, (2002)
Zou, J., Nagy, G.: Evaluation of model-based interactive flower recognition. In: Proc ICPR 2004, 17th International Conference on Pattern Recognition, vol.2, pp. 311–314, Cambridge, UK (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Downton, A., He, J. & Lucas, S. User-configurable OCR enhancement for online natural history archives. IJDAR 9, 263–279 (2007). https://doi.org/10.1007/s10032-006-0022-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-006-0022-0