ABSTRACT
Censuses comprise a wealth of information at a large (national) scale that allow governments (who commission them) and the public to have a detailed snapshot of how people live (geographical distribution and characteristics). In addition to underpinning socio-economic research, the study of historical Census statistics provides a unique opportunity to understand several characteristics in a country and its heritage. This paper presents an overview of a complete account of the background, challenges, implemented preprocessing, recognition and post-processing pipeline, and the information-rich results obtained through a pilot digitisation project on the 1961 Census of England and Wales (the first time computers were used to process data and output very detailed information, a vital part of which is only available in the form of degraded historical computer printouts). The experience gained and the resulting methodology can also be used for digitising and understanding tabular information in a large variety of application scenarios.
- Office for National Statistics, United Kingdom, https://www.ons.gov.uk/Google Scholar
- Hu, J., Kashi, R.S., Lopresti, D., Wilfong, G.T. 2002. Evaluating the performance of table processing algorithms. International Journal on Document Analysis and Recognition, Volume 4, Issue 3 (March 2002), pp 140--153. Google ScholarCross Ref
- Lopresti, D., Nagy, G. 1999. Automated Table Processing: An (Opinionated) Survey. Proceedings of the 3rd International Workshop on Graphics Recognition (Jaipur, India, 26-27 September 1999). pp 109--134.Google Scholar
- Costa e Silva, A., Jorge, A.M., Torgo, L. 2006. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR), Volume 8, Issue 2 (June 2006), pp 144--171. Google ScholarCross Ref
- Zanibbi, R., Blostein, D., Cordy, J.R. 2004. A survey of table recognition: Models, observations, transformations, and inferences. Document Analysis and Recognition, Volume 7, Issue 1 (March 2004), pp 1--16. Google ScholarDigital Library
- Lopresti, D., Nagy, G. 2001. A Tabular Survey of Automated Table Processing. Graphics Recognition Recent Advances, Volume 1941 of the series Lecture Notes in Computer Science (April 2001), pp 93--120.Google Scholar
- ABBYY FineReader Engine 11, http://www.abbyy.com/ocr-sdkGoogle Scholar
- Clausner C., Pletschacher S., and Antonacopoulos A. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR2011) (Beijing, China, September 2011), pp. 48--52. Google ScholarDigital Library
- Pletschacher S., and Antonacopoulos A. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010) (Istanbul, Turkey, August 23-26, 2010), IEEE-CS Press, pp. 257--260. Google ScholarDigital Library
- Clausner C., Pletschacher S., and Antonacopoulos A. 2011. Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods. Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR2011) (Beijing, China, September 2011), pp. 1404--1408. Google ScholarDigital Library
- Tesseract OCR, https://github.com/tesseract-ocrGoogle Scholar
- PRImA Text Evaluation Tool, University of Salford, United Kingdom, http://www.primaresearch.org/tools/PerformanceEvaluationGoogle Scholar
- InFuse, UK Data Service, http://infuse.ukdataservice.ac.uk/Google Scholar
Index Terms
- Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables
Recommendations
Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and ProcessingThe 1961 Census of England and Wales was the first UK census to make use of computers. However, only bound volumes and microfilm copies of printouts remain, locking a wealth of information in a form that is practically unusable for research. In this ...
Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study
DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural HeritageNumerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that ...
Crowdsourcing Historical Tabular Data: 1961 Census of England and Wales
HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and ProcessingThis paper describes how crowdsourcing can be incorporated as an integral part of a comprehensive technical workflow to identify, extract and validate data from large volumes of printed tabular statistics, and transform them into operable digital ...
Comments