skip to main content
10.1145/3219819.3219842acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Lessons Learned from Developing and Deploying a Large-Scale Employer Name Normalization System for Online Recruitment

Published: 19 July 2018 Publication History

Abstract

Employer name normalization, or linking employer names in job postings or resumes to entities in an employer knowledge base (KB), is important for many downstream applications in the online recruitment domain. Key challenges for employer name normalization include handling employer names from both job postings and resumes, leveraging the corresponding location and URL context, and handling name variations and duplicates in the KB. In this paper, we describe the CompanyDepot system developed at CareerBuilder, which uses machine learning techniques to address these challenges. We discuss the main challenges and share our lessons learned in deployment, maintenance, and utilization of the system over the past two years. We also share several examples of how the system has been used in applications at CareerBuilder to deliver value to end customers.

References

[1]
J. Gary Augustson and Jack Minker. 1970. An Analysis of Some Graph Theoretical Cluster Techniques. J. ACM, Vol. 17, 4 (Oct. 1970), 571--588.
[2]
Janani Balaji, Faizan Javed, Chris Min, and Sam Sander. 2017. An Ensemble Blocking Approach for Entity Resolution of Heterogeneous Datasets Proceedings of the 30th International FLAIRS (Florida Artificial Intelligence Research Society) Conference.
[3]
A. Borkovsky. 2003. Item name normalization. (April 29. 2003). US Patent 6,556,991.
[4]
R. Busa-Fekete, Gy. Szarvas, Tamas Eleto, and B. Kégl. 2012. An apple-to-apple comparison of Learning-to-rank algorithms in terms of Normalized Discounted Cumulative Gain. In B. Proceedings of ECAI-12 Workshop, Preference Learning: Problems and Applications in AI. http://www.inf.u-szeged.hu/~busarobi/publ.html
[5]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology Vol. 2 (2011), 27:1--27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6]
Alan Chern, Qiaoling Liu, Josh Chao, Mahak Goindani, and Faizan Javed. 2018. Automatically Detecting Errors in Employer Industry Classification using Job Postings. accepted by the Data Science and Engineering journal (2018).
[7]
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated.
[8]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). 226--231.
[9]
Mahak Goindani, Qiaoling Liu, Josh Chao, and Valentin Jijkoun. 2017. Employer Industry Classification Using Job Postings 2017 IEEE International Conference on Data Mining Workshops, ICDM Workshops 2017, New Orleans, LA, USA, November 18--21, 2017. 183--188.
[10]
Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity Linking via Joint Encoding of Types, Descriptions, and Context Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2681--2690. http://aclweb.org/anthology/D17-1284
[11]
Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt Mcnair. 2014. sCooL: A system for academic institution name normalization Collaboration Technologies and Systems (CTS), 2014 International Conference on. 86--93.
[12]
A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv., Vol. 31, 3 (Sept. 1999), 264--323.
[13]
Faizan Javed, Phuong Hoang, Thomas Mahoney, and Matt McNair. 2017. Large-Scale Occupational Skills Normalization for Online Recruitment Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 4627--4634.
[14]
Siddhartha Jonnalagadda and Philip Topham. 2010. NEMO: Extraction and normalization of organization names from affiliation strings. Journal of Biomedical Discovery and Collaboration Vol. 5 (2010), 50.
[15]
Amin Karami and Ronnie Johansson. 2014. Choosing dbscan parameters automatically using differential evolution. International Journal of Computer Applications, Vol. 91, 7 (2014).
[16]
Hakan Kardes, Deepak Konidena, Siddharth Agrawal, Micah Huff, and Ang Sun. 2013. Graph-based Approaches for Organization Entity Resolution in MapReduce Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing Workshop at EMNLP.
[17]
Mayank Kejriwal, Qiaoling Liu, Ferosh Jacob, and Faizan Javed. 2015. A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases Proceedings of 2015 IEEE International Conference on Big Data.
[18]
Lars Kolb, Ziad Sehili, and Erhard Rahm. 2014. Iterative Computation of Connected Graph Components with MapReduce. Datenbank-Spektrum, Vol. 14, 2 (2014), 107--117.
[19]
Robert Leaman, Rezarta Islamaj Dogan, and Zhiyong Lu. 2013. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, Vol. 29, 22 (2013), 2909--2917.
[20]
Qiaoling Liu, Faizan Javed, Vachik S. Dave, and Ankita Joshi. 2017. Supporting Employer Name Normalization at Both Entity and Cluster Level Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). 1883--1892.
[21]
Qiaoling Liu, Faizan Javed, and Matt McNair. 2016. CompanyDepot: Employer Name Normalization in the Online Recruitment Industry Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 521--530.
[22]
Walid Magdy, Kareem Darwish, Ossama Emam, and Hany Hassan. 2007. Arabic Cross-document Person Name Normalization. Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (Semitic '07). 25--32.
[23]
Wiliam P. McNeill, Hakan Kardes, and Andrew Borthwick. 2012. Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce Proceedings of 10th International Workshop on Quality in Databases (QDB) at VLDB.
[24]
Donald Metzler and W. Bruce Croft. 2007. Linear Feature-based Models for Information Retrieval. Inf. Retr., Vol. 10, 3 (June. 2007), 257--274.
[25]
Malay K. Pakhira. 2014. A Linear Time-Complexity k-Means Algorithm Using Cluster Shifting Computational Intelligence and Communication Networks (CICN), 2014 International Conference on. IEEE, 1047--1051.
[26]
Hae-Sang Park and Chi-Hyuck Jun. 2009. A Simple and Fast Algorithm for K-medoids Clustering. Expert Syst. Appl., Vol. 36, 2 (March. 2009), 3336--3341.
[27]
Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. 2017. NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). 1667--1676.
[28]
Thomas Seidl, Brigitte Boden, and Sergej Fries. 2012. CC-MR - Finding Connected Components in Huge Graphs with Mapreduce Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD'12). 458--473.
[29]
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans. Knowl. Data Eng. Vol. 27, 2 (2015), 443--460.
[30]
Joachim Wermter, Katrin Tomanek, and Udo Hahn. 2009. High-performance gene name normalization with GENO. Bioinformatics, Vol. 25, 6 (2009), 815--821.
[31]
Dongkuan Xu and Yingjie Tian. 2015. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science Vol. 2, 2 (2015), 165--193.
[32]
Baoshi Yan, Lokesh Bajaj, and Anmol Bhasin. 2011. Entity Resolution Using Social Graphs for Business Applications International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2011. 220--227.
[33]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Rec., Vol. 25, 2 (June. 1996), 103--114.

Cited By

View all
  • (2018)Automatically Detecting Errors in Employer Industry Classification Using Job PostingsData Science and Engineering10.1007/s41019-018-0071-73:3(221-231)Online publication date: 19-Aug-2018

Index Terms

  1. Lessons Learned from Developing and Deploying a Large-Scale Employer Name Normalization System for Online Recruitment

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 July 2018

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. deployment
        2. employer name normalization
        3. entity linking

        Qualifiers

        • Research-article

        Conference

        KDD '18
        Sponsor:

        Acceptance Rates

        KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
        Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)8
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 07 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2018)Automatically Detecting Errors in Employer Industry Classification Using Job PostingsData Science and Engineering10.1007/s41019-018-0071-73:3(221-231)Online publication date: 19-Aug-2018

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media