research-article

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

Authors:
Christian Clausner

PRImA Research Lab, The University of Salford, United Kingdom

PRImA Research Lab, The University of Salford, United Kingdom
View Profile

,
Justin Hayes

Jisc, Manchester, UK

Jisc, Manchester, UK
View Profile

,
Apostolos Antonacopoulos

PRImA Research Lab, The University of Salford, United Kingdom

PRImA Research Lab, The University of Salford, United Kingdom
View Profile

,
Stefan Pletschacher

PRImA Research Lab, The University of Salford, United Kingdom

PRImA Research Lab, The University of Salford, United Kingdom
View Profile

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural HeritageJune 2017Pages 149–154https://doi.org/10.1145/3078081.3078106

Published:01 June 2017Publication History

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Pages 149–154

ABSTRACT

Censuses comprise a wealth of information at a large (national) scale that allow governments (who commission them) and the public to have a detailed snapshot of how people live (geographical distribution and characteristics). In addition to underpinning socio-economic research, the study of historical Census statistics provides a unique opportunity to understand several characteristics in a country and its heritage. This paper presents an overview of a complete account of the background, challenges, implemented preprocessing, recognition and post-processing pipeline, and the information-rich results obtained through a pilot digitisation project on the 1961 Census of England and Wales (the first time computers were used to process data and output very detailed information, a vital part of which is only available in the form of degraded historical computer printouts). The experience gained and the resulting methodology can also be used for digitising and understanding tabular information in a large variety of application scenarios.

References

Office for National Statistics, United Kingdom, https://www.ons.gov.uk/Google Scholar
Hu, J., Kashi, R.S., Lopresti, D., Wilfong, G.T. 2002. Evaluating the performance of table processing algorithms. International Journal on Document Analysis and Recognition, Volume 4, Issue 3 (March 2002), pp 140--153. Google ScholarCross Ref
Lopresti, D., Nagy, G. 1999. Automated Table Processing: An (Opinionated) Survey. Proceedings of the 3rd International Workshop on Graphics Recognition (Jaipur, India, 26-27 September 1999). pp 109--134.Google Scholar
Costa e Silva, A., Jorge, A.M., Torgo, L. 2006. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR), Volume 8, Issue 2 (June 2006), pp 144--171. Google ScholarCross Ref
Zanibbi, R., Blostein, D., Cordy, J.R. 2004. A survey of table recognition: Models, observations, transformations, and inferences. Document Analysis and Recognition, Volume 7, Issue 1 (March 2004), pp 1--16. Google ScholarDigital Library
Lopresti, D., Nagy, G. 2001. A Tabular Survey of Automated Table Processing. Graphics Recognition Recent Advances, Volume 1941 of the series Lecture Notes in Computer Science (April 2001), pp 93--120.Google Scholar
ABBYY FineReader Engine 11, http://www.abbyy.com/ocr-sdkGoogle Scholar
Clausner C., Pletschacher S., and Antonacopoulos A. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR2011) (Beijing, China, September 2011), pp. 48--52. Google ScholarDigital Library
Pletschacher S., and Antonacopoulos A. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010) (Istanbul, Turkey, August 23-26, 2010), IEEE-CS Press, pp. 257--260. Google ScholarDigital Library
Clausner C., Pletschacher S., and Antonacopoulos A. 2011. Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods. Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR2011) (Beijing, China, September 2011), pp. 1404--1408. Google ScholarDigital Library
Tesseract OCR, https://github.com/tesseract-ocrGoogle Scholar
PRImA Text Evaluation Tool, University of Salford, United Kingdom, http://www.primaresearch.org/tools/PerformanceEvaluationGoogle Scholar
InFuse, UK Data Service, http://infuse.ukdataservice.ac.uk/Google Scholar

Index Terms

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables
1. Applied computing
  1. Document management and text processing
    1. Document capture

Recommendations

Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing

The 1961 Census of England and Wales was the first UK census to make use of computers. However, only bound volumes and microfilm copies of printouts remain, locking a wealth of information in a form that is practically unusable for research. In this ...
Read More
Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study
DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that ...
Read More
Crowdsourcing Historical Tabular Data: 1961 Census of England and Wales
HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing

This paper describes how crowdsourcing can be incorporated as an integral part of a comprehensive technical workflow to identify, extract and validate data from large volumes of printed tabular statistics, and transform them into operable digital ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
June 2017
179 pages
ISBN:9781450352659
DOI:10.1145/3078081

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Census
Cultural Heritage
Digitisation
Historical
Post-processing
Preprocessing
Printed documents
Recognition
Tabular data
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DATeCH2017 Paper Acceptance Rate29of37submissions,78%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 60
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation

Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study

Crowdsourcing Historical Tabular Data: 1961 Census of England and Wales

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation

Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study

Crowdsourcing Historical Tabular Data: 1961 Census of England and Wales

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media