skip to main content
10.1145/3371158.3371203acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

Canonicalizing Organization Names for Recruitment Domain

Published: 15 January 2020 Publication History

Abstract

Online recruitment industry relies on various Knowledge Bases (KB) for enabling search and recommendation systems. These KBs comprise of diverse, non-standard, and large volume of named-entities as they are created from vast unstructured user-generated content (mostly CVs). Such non-standard representation of each entity causes significant vocabulary gap in KB which results in redundancy incompleteness, and ambiguity in the retrieved information. The problem is even more challenging in domains where external sources of context do not exist.
To address these challenges, we propose a two-tier architecture that (a) finds the distance parameter for clustering entities using a novel pairwise similarity between all entity mentions, and, (b) then uses these similarity (scores) to create canonical clusters representing unique entity in the KB. Our experiments on proprietary data of 25,602 unique companies and 23,690 unique institutes show that the pair-wise similarity score using Siamese network outperforms (97% and 82% F1-score) standard string similarity measures. Finally, clustering methods over the similarity scores achieve 90% and 80% micro F1-score.

References

[1]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[2]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems.
[3]
Claudio Delli Bovi, Luis Espinosa-Anke, and Roberto Navigli. 2015. Knowledge base unification via sense embeddings and disambiguation. In The 2015 Conference on Empirical Methods in Natural Language; 2015 Sept 17-21; Lisbon, Portugal.[Stroudsburg]: ACL (Association for Computational Linguistics); 2015. p. 726--36. ACL (Association for Computational Linguistics).
[4]
Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. 2010. Entity Disambiguation for Knowledge Base Population. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL.
[5]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD.
[6]
Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing Open Knowledge Bases. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM.
[7]
Jonathan Goldsmith. [n.d.]. Wikipedia API for Python. https://pypi.org/project/wikipedia/.
[8]
Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity Linking via Joint Encoding of Types, Descriptions, and Context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
[9]
Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web. ACM, 385--396.
[10]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 427--431.
[11]
Thomas Lin, Oren Etzioni, et al. 2012. Entity Linking at Web Scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. ACL.
[12]
Qiaoling Liu, Faizan Javed, Vachik S Dave, and Ankita Joshi. 2017. Supporting Employer Name Normalization at both Entity and Cluster Level. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
[13]
Qiaoling Liu, Faizan Javed, and Matt Mcnair. 2016. CompanyDepot: Employer Name Normalization in the Online Recruitment Industry. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
[14]
MediaWiki. 2018. API:Main page --- MediaWiki, The Free Wiki Engine. https://www.mediawiki.org/w/index.php?title=API:Main_page&oldid=2988334 [Online; accessed 10-December-2018].
[15]
Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning Text Similarity with Siamese Recurrent Networks. In Proceedings of the 1st Workshop on Representation Learning for NLP.
[16]
Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. 2013. Knowledge Graph Identification. In International Semantic Web Conference. Springer, 542--557.
[17]
Vijay Raghavan, Peter Bollmann, and Gwang S Jung. 1989. A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance. ACM Transactions on Information Systems (TOIS) (1989).
[18]
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering (2015).
[19]
Pang-Ning Tan et al. 2006. Introduction to Data Mining. Pearson Education India.
[20]
Salvatore Trani, Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. 2014. Dexter 2.0: An Open Source Tool for Semantically Enriching Data. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track.
[21]
Shikhar Vashishth, Prince Jain, and Partha Talukdar. 2018. CESI: Canonicalizing Open Knowledge Bases Using Embeddings and Side Information. In Proceedings of the 2018 World Wide Web Conference (WWW '18). International World Wide Web Conferences Steering Committee.
[22]
Wikipedia. 2018. Silhouette (clustering) --- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Silhouette_(clustering)&oldid=856197782.
[23]
Baoshi Yan, Lokesh Bajaj, and Anmol Bhasin. 2011. Entity Resolution using Social Graphs for Business Applications. In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on. IEEE.
[24]
Wei Zhang, Jian Su, Chew Lim Tan, and Wen Ting Wang. 2010. Entity Linking Leveraging: Automatically Generated Annotation. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL.

Cited By

View all
  • (2023)Text Classification In The Wild: A Large-Scale Long-Tailed Name Normalization DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096769(1-5)Online publication date: 4-Jun-2023
  • (2020)Canonicalizing Knowledge Bases for Recruitment DomainAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-47436-2_38(500-513)Online publication date: 11-May-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD
January 2020
399 pages
ISBN:9781450377386
DOI:10.1145/3371158
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. datasets
  2. gaze detection
  3. neural networks
  4. text tagging

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CoDS COMAD 2020
CoDS COMAD 2020: 7th ACM IKDD CoDS and 25th COMAD
January 5 - 7, 2020
Hyderabad, India

Acceptance Rates

CoDS COMAD 2020 Paper Acceptance Rate 78 of 275 submissions, 28%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Text Classification In The Wild: A Large-Scale Long-Tailed Name Normalization DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096769(1-5)Online publication date: 4-Jun-2023
  • (2020)Canonicalizing Knowledge Bases for Recruitment DomainAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-47436-2_38(500-513)Online publication date: 11-May-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media