Skip to main content

Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications

  • Chapter
  • First Online:
Population Reconstruction

Abstract

The increasing availability of digitised registration records presents a significant opportunity for research in many fields including those of human geography, genealogy and medicine. Re-examining original records allows researchers to study relationships between factors such as occupation, cause of death, illness and geographic region. This can be facilitated by coding these factors to standard classifications. This chapter describes work to develop a method for automatically coding the occupations from 29 million Scottish birth, death and marriage records, containing around 50 million occupation descriptions, to standard classifications. A range of approaches using text processing and supervised machine learning is evaluated, achieving classification performance of 75 % micro-precision/recall, 61 % macro-precision and 66 % macro-recall on a smaller test set. Further development that may be needed for classification of the full data set is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The 50 % threshold was arbitrarily chosen; the effect of varying this may be investigated in future work.

  2. 2.

    For the string similarity classifier, which does not involve any training, these records were ignored.

  3. 3.

    Where a record had been originally multiple-coded, one of the codes was arbitrarily selected as ‘correct’.

References

  • Apache Software Foundation. (2011). Apache Mahout: Scalable machine learning and data mining. http://mahout.apache.org/. Accessed 1 Nov 2014.

  • Bottero, W., & Prandy, K. (2001). Women’s occupations and the social order in nineteenth century Britain. Sociological Research Online, 6(2).

    Google Scholar 

  • Canadian Families Project. (2002). National sample of the 1901 census of Canada. Victoria: University of Victoria.

    Google Scholar 

  • Carson, J. K., Kirby, G. N. C., Dearle, A., et al. (2013). Exploiting historical registers: Automatic methods for coding C19th and C20th cause of death descriptions to standard classifications. New Techniques and Technologies for Statistics, 598–607.

    Google Scholar 

  • Darroch, G., & Ornstein, M. (1979). Canadian historical social mobility project. National sample of the 1871 census of Canada [computer file]. Toronto: York Institute for Social Research and Department of Sociology, York University.

    Google Scholar 

  • Dietterich, T.G. (2000). Ensemble methods in machine learning. In: Multiple classifier systems. Lecture notes in computer science, Vol. 1857, (pp. 1–15). Heidelberg: Springer.

    Google Scholar 

  • Dillon, L. (2008). 1881 Canadian census project, North Atlantic population project, and Minnesota population center. National Sample of the 1881 Census of Canada (version 2.0). Montréal: Département de Démographie, Université de Montréal [distributor].

    Google Scholar 

  • Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3), 435–447.

    Google Scholar 

  • HISCO. (2013). HISCO tree of occupational groups. http://historyofwork.iisg.nl/major.php. Accessed 1 Nov 2014.

  • Historical Sample of the Netherlands (HSN). (2010). Data set life courses release 2010.01.

    Google Scholar 

  • Kirby, G. N. C., Carson, J. K., Dunlop, F. R. J., et al. (2014). Automatic methods for coding historical occupation descriptions to standard classifications. In: Proceedings Workshop on Population Reconstruction, Amsterdam, February 2014. International Institute of Social History.

    Google Scholar 

  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings 14th International Joint Conference on Artificial Intelligence (pp. 1137-1143). Burlington: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Komarek, P. (2004). Logistic regression for data mining and high-dimensional classification. PhD Thesis, Carnegie Mellon University.

    Google Scholar 

  • Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In: Proceedings AAAI-92, pp. 223–228.

    Google Scholar 

  • Leonard, S. H., Anderton, D. L., & Swedlund, A. C. (2012). Grammars of death. University of Michigan/ICPSR. https://sites.google.com/a/umich.edu/grammars-of-death/home. Accessed 1 Nov 2014.

  • Minnesota Population Center. (2008). North Atlantic population project: Complete count microdata. Version 2.0 [Machine-readable database]. Minneapolis.

    Google Scholar 

  • National Records Scotland. (2014). Set of cause of death records.

    Google Scholar 

  • Ng, A. Y., & Jordan, M. I. (2001). On Discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Proceedings Neural Information Processing Systems (pp. 841–848).

    Google Scholar 

  • Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008a). National sample of the 1865 census of Norway, Version 2.0., Tromsø.

    Google Scholar 

  • Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008b). National sample of the 1900 census of Norway, Version 2.0. Tromsø.

    Google Scholar 

  • Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center (2008). National sample of the 1875 census of Norway,Version 2.0. Tromsø.

    Google Scholar 

  • Prandy, K., & Lambert, P. (2012). CAMSIS: Bibliographic review. http://www.camsis.stir.ac.uk/review.html. Accessed 1 Nov 2014.

  • Ruggles, S., et al. (2010). Integrated public use microdata series: Version 5.0 [Machine-readable database]. Minneapolis: University of Minnesota.

    Google Scholar 

  • Schürer, K., & Woollard, M. (2003). National sample from the 1881 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].

    Google Scholar 

  • Schürer, K., & Woollard, M. (2008). National sample from the 1851 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].

    Google Scholar 

  • Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.

    Article  Google Scholar 

  • Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.

    Article  Google Scholar 

  • US Bureau of Labor Statistics. (2010). Standard occupational classification (SOC) system. http://www.bls.gov/soc/. Accessed 1 Nov 2014.

  • van Leeuwen, M. H. D., Maas, I., & Miles, A. (2002). HISCO: Historical international standard classification of occupations. Leuven: Leuven University Press.

    Google Scholar 

  • Witten, I., & Eibe, F. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Burlington: Morgan Kaufmann.

    Google Scholar 

  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning, ACM (pp. 412–420).

    Google Scholar 

  • Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings 21st International Conference on Machine Learning, ACM (pp. 116–123).

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Economic and Social Research Council grant numbers ES/K00574X/1 (Digitising Scotland), ES/L007487/1 (Administrative Data Research Centre—Scotland).

This chapter is an expanded version of a paper presented at the International Workshop on Population Reconstruction in Amsterdam, February 2014 (Kirby et al. 2014). We thank the original anonymous referee, many workshop participants and the book editors, for their helpful comments.

We thank Richard Zijdeman for advice on historical coding and access to data from the Netherlands.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Kirby .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Kirby, G. et al. (2015). Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_3

Download citation

Publish with us

Policies and ethics