Abstract
The increasing availability of digitised registration records presents a significant opportunity for research in many fields including those of human geography, genealogy and medicine. Re-examining original records allows researchers to study relationships between factors such as occupation, cause of death, illness and geographic region. This can be facilitated by coding these factors to standard classifications. This chapter describes work to develop a method for automatically coding the occupations from 29 million Scottish birth, death and marriage records, containing around 50 million occupation descriptions, to standard classifications. A range of approaches using text processing and supervised machine learning is evaluated, achieving classification performance of 75 % micro-precision/recall, 61 % macro-precision and 66 % macro-recall on a smaller test set. Further development that may be needed for classification of the full data set is discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The 50 % threshold was arbitrarily chosen; the effect of varying this may be investigated in future work.
- 2.
For the string similarity classifier, which does not involve any training, these records were ignored.
- 3.
Where a record had been originally multiple-coded, one of the codes was arbitrarily selected as ‘correct’.
References
Apache Software Foundation. (2011). Apache Mahout: Scalable machine learning and data mining. http://mahout.apache.org/. Accessed 1 Nov 2014.
Bottero, W., & Prandy, K. (2001). Women’s occupations and the social order in nineteenth century Britain. Sociological Research Online, 6(2).
Canadian Families Project. (2002). National sample of the 1901 census of Canada. Victoria: University of Victoria.
Carson, J. K., Kirby, G. N. C., Dearle, A., et al. (2013). Exploiting historical registers: Automatic methods for coding C19th and C20th cause of death descriptions to standard classifications. New Techniques and Technologies for Statistics, 598–607.
Darroch, G., & Ornstein, M. (1979). Canadian historical social mobility project. National sample of the 1871 census of Canada [computer file]. Toronto: York Institute for Social Research and Department of Sociology, York University.
Dietterich, T.G. (2000). Ensemble methods in machine learning. In: Multiple classifier systems. Lecture notes in computer science, Vol. 1857, (pp. 1–15). Heidelberg: Springer.
Dillon, L. (2008). 1881 Canadian census project, North Atlantic population project, and Minnesota population center. National Sample of the 1881 Census of Canada (version 2.0). Montréal: Département de Démographie, Université de Montréal [distributor].
Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3), 435–447.
HISCO. (2013). HISCO tree of occupational groups. http://historyofwork.iisg.nl/major.php. Accessed 1 Nov 2014.
Historical Sample of the Netherlands (HSN). (2010). Data set life courses release 2010.01.
Kirby, G. N. C., Carson, J. K., Dunlop, F. R. J., et al. (2014). Automatic methods for coding historical occupation descriptions to standard classifications. In: Proceedings Workshop on Population Reconstruction, Amsterdam, February 2014. International Institute of Social History.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings 14th International Joint Conference on Artificial Intelligence (pp. 1137-1143). Burlington: Morgan Kaufmann Publishers Inc.
Komarek, P. (2004). Logistic regression for data mining and high-dimensional classification. PhD Thesis, Carnegie Mellon University.
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In: Proceedings AAAI-92, pp. 223–228.
Leonard, S. H., Anderton, D. L., & Swedlund, A. C. (2012). Grammars of death. University of Michigan/ICPSR. https://sites.google.com/a/umich.edu/grammars-of-death/home. Accessed 1 Nov 2014.
Minnesota Population Center. (2008). North Atlantic population project: Complete count microdata. Version 2.0 [Machine-readable database]. Minneapolis.
National Records Scotland. (2014). Set of cause of death records.
Ng, A. Y., & Jordan, M. I. (2001). On Discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Proceedings Neural Information Processing Systems (pp. 841–848).
Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008a). National sample of the 1865 census of Norway, Version 2.0., Tromsø.
Norwegian Digital Archive (The National Archive), Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center. (2008b). National sample of the 1900 census of Norway, Version 2.0. Tromsø.
Norwegian Historical Data Centre (University of Tromsø) and the Minnesota Population Center (2008). National sample of the 1875 census of Norway,Version 2.0. Tromsø.
Prandy, K., & Lambert, P. (2012). CAMSIS: Bibliographic review. http://www.camsis.stir.ac.uk/review.html. Accessed 1 Nov 2014.
Ruggles, S., et al. (2010). Integrated public use microdata series: Version 5.0 [Machine-readable database]. Minneapolis: University of Minnesota.
Schürer, K., & Woollard, M. (2003). National sample from the 1881 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].
Schürer, K., & Woollard, M. (2008). National sample from the 1851 census of Great Britain [computer file], Colchester: History Data Service, UK Data Archive [distributor].
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.
US Bureau of Labor Statistics. (2010). Standard occupational classification (SOC) system. http://www.bls.gov/soc/. Accessed 1 Nov 2014.
van Leeuwen, M. H. D., Maas, I., & Miles, A. (2002). HISCO: Historical international standard classification of occupations. Leuven: Leuven University Press.
Witten, I., & Eibe, F. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Burlington: Morgan Kaufmann.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In: Proceedings 14th International Conference on Machine Learning, ACM (pp. 412–420).
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings 21st International Conference on Machine Learning, ACM (pp. 116–123).
Acknowledgments
This work was supported by the Economic and Social Research Council grant numbers ES/K00574X/1 (Digitising Scotland), ES/L007487/1 (Administrative Data Research Centre—Scotland).
This chapter is an expanded version of a paper presented at the International Workshop on Population Reconstruction in Amsterdam, February 2014 (Kirby et al. 2014). We thank the original anonymous referee, many workshop participants and the book editors, for their helpful comments.
We thank Richard Zijdeman for advice on historical coding and access to data from the Netherlands.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Kirby, G. et al. (2015). Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-19884-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)