Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications

Kirby, Graham; Carson, Jamie; Dunlop, Fraser; Dibben, Chris; Dearle, Alan; Williamson, Lee; Garrett, Eilidh; Reid, Alice

doi:10.1007/978-3-319-19884-2_3

Graham Kirby⁵,
Jamie Carson⁵,
Fraser Dunlop⁵,
Chris Dibben⁶,
Alan Dearle⁵,
Lee Williamson⁶,
Eilidh Garrett⁷ &
…
Alice Reid⁷

Abstract

The increasing availability of digitised registration records presents a significant opportunity for research in many fields including those of human geography, genealogy and medicine. Re-examining original records allows researchers to study relationships between factors such as occupation, cause of death, illness and geographic region. This can be facilitated by coding these factors to standard classifications. This chapter describes work to develop a method for automatically coding the occupations from 29 million Scottish birth, death and marriage records, containing around 50 million occupation descriptions, to standard classifications. A range of approaches using text processing and supervised machine learning is evaluated, achieving classification performance of 75 % micro-precision/recall, 61 % macro-precision and 66 % macro-recall on a smaller test set. Further development that may be needed for classification of the full data set is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automated occupation coding with hierarchical features: a data-centric approach to classification with pre-trained language models

Article Open access 13 February 2023

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

Date Recognition in Historical Parish Records

Notes

1.
The 50 % threshold was arbitrarily chosen; the effect of varying this may be investigated in future work.
2.
For the string similarity classifier, which does not involve any training, these records were ignored.
3.
Where a record had been originally multiple-coded, one of the codes was arbitrarily selected as ‘correct’.

References

Apache Software Foundation. (2011). Apache Mahout: Scalable machine learning and data mining. http://mahout.apache.org/. Accessed 1 Nov 2014.
Bottero, W., & Prandy, K. (2001). Women’s occupations and the social order in nineteenth century Britain. Sociological Research Online, 6(2).
Google Scholar
Canadian Families Project. (2002). National sample of the 1901 census of Canada. Victoria: University of Victoria.
Google Scholar
Carson, J. K., Kirby, G. N. C., Dearle, A., et al. (2013). Exploiting historical registers: Automatic methods for coding C19th and C20th cause of death descriptions to standard classifications. New Techniques and Technologies for Statistics, 598–607.
Google Scholar
Darroch, G., & Ornstein, M. (1979). Canadian historical social mobility project. National sample of the 1871 census of Canada [computer file]. Toronto: York Institute for Social Research and Department of Sociology, York University.
Google Scholar
Dietterich, T.G. (2000). Ensemble methods in machine learning. In: Multiple classifier systems. Lecture notes in computer science, Vol. 1857, (pp. 1–15). Heidelberg: Springer.
Google Scholar
Dillon, L. (2008). 1881 Canadian census project, North Atlantic population project, and Minnesota population center. National Sample of the 1881 Census of Canada (version 2.0). Montréal: Département de Démographie, Université de Montréal [distributor].
Google Scholar
Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3), 435–447.
Google Scholar
HISCO. (2013). HISCO tree of occupational groups. http://historyofwork.iisg.nl/major.php. Accessed 1 Nov 2014.
Historical Sample of the Netherlands (HSN). (2010). Data set life courses release 2010.01.
Google Scholar
Kirby, G. N. C., Carson, J. K., Dunlop, F. R. J., et al. (2014). Automatic methods for coding historical occupation descriptions to standard classifications. In: Proceedings Workshop on Population Reconstruction, Amsterdam, February 2014. International Institute of Social History.
Google Scholar
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings 14th International Joint Conference on Artificial Intelligence (pp. 1137-1143). Burlington: Morgan Kaufmann Publishers Inc.
Google Scholar
Komarek, P. (2004). Logistic regression for data mining and high-dimensional classification. PhD Thesis, Carnegie Mellon University.
Google Scholar
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In: Proceedings AAAI-92, pp. 223–228.
Google Scholar
Leonard, S. H., Anderton, D. L., & Swedlund, A. C. (2012). Grammars of death. University of Michigan/ICPSR. https://sites.google.com/a/umich.edu/grammars-of-death/home. Accessed 1 Nov 2014.
Minnesota Population Center. (2008). North Atlantic population project: Complete count microdata. Version 2.0 [Machine-readable database]. Minneapolis.
Google Scholar
National Records Scotland. (2014). Set of cause of death records.
Google Scholar
Ng, A. Y., & Jordan, M. I. (2001). On Discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Proceedings Neural Information Processing Systems (pp. 841–848).
Google Scholar
Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008a). National sample of the 1865 census of Norway, Version 2.0., Tromsø.
Google Scholar
Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008b). National sample of the 1900 census of Norway, Version 2.0. Tromsø.
Google Scholar
Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center (2008). National sample of the 1875 census of Norway,Version 2.0. Tromsø.
Google Scholar
Prandy, K., & Lambert, P. (2012). CAMSIS: Bibliographic review. http://www.camsis.stir.ac.uk/review.html. Accessed 1 Nov 2014.
Ruggles, S., et al. (2010). Integrated public use microdata series: Version 5.0 [Machine-readable database]. Minneapolis: University of Minnesota.
Google Scholar
Schürer, K., & Woollard, M. (2003). National sample from the 1881 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].
Google Scholar
Schürer, K., & Woollard, M. (2008). National sample from the 1851 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].
Google Scholar
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.
Article Google Scholar
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.
Article Google Scholar
US Bureau of Labor Statistics. (2010). Standard occupational classification (SOC) system. http://www.bls.gov/soc/. Accessed 1 Nov 2014.
van Leeuwen, M. H. D., Maas, I., & Miles, A. (2002). HISCO: Historical international standard classification of occupations. Leuven: Leuven University Press.
Google Scholar
Witten, I., & Eibe, F. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Burlington: Morgan Kaufmann.
Google Scholar
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning, ACM (pp. 412–420).
Google Scholar
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings 21st International Conference on Machine Learning, ACM (pp. 116–123).
Google Scholar

Download references

Acknowledgments

This work was supported by the Economic and Social Research Council grant numbers ES/K00574X/1 (Digitising Scotland), ES/L007487/1 (Administrative Data Research Centre—Scotland).

This chapter is an expanded version of a paper presented at the International Workshop on Population Reconstruction in Amsterdam, February 2014 (Kirby et al. 2014). We thank the original anonymous referee, many workshop participants and the book editors, for their helpful comments.

We thank Richard Zijdeman for advice on historical coding and access to data from the Netherlands.

Author information

Authors and Affiliations

University of St Andrews, St Andrews, UK
Graham Kirby, Jamie Carson, Fraser Dunlop & Alan Dearle
University of Edinburgh, Edinburgh, UK
Chris Dibben & Lee Williamson
University of Cambridge, Cambridge, UK
Eilidh Garrett & Alice Reid

Authors

Graham Kirby
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Carson
View author publications
You can also search for this author in PubMed Google Scholar
Fraser Dunlop
View author publications
You can also search for this author in PubMed Google Scholar
Chris Dibben
View author publications
You can also search for this author in PubMed Google Scholar
Alan Dearle
View author publications
You can also search for this author in PubMed Google Scholar
Lee Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Eilidh Garrett
View author publications
You can also search for this author in PubMed Google Scholar
Alice Reid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Graham Kirby .

Editor information

Editors and Affiliations

Utrecht Universty, Utrecht, The Netherlands
Gerrit Bloothooft
The Australian National University, Canberra, Aust Capital Terr, Australia
Peter Christen
International Inst. of Social History, Amsterdam, The Netherlands
Kees Mandemakers
Leiden University, Leiden, The Netherlands
Marijn Schraagen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kirby, G. et al. (2015). Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-19884-2_3
Published: 23 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics