research-article

Canonicalizing Organization Names for Recruitment Domain

Authors:

Nausheen Fatma,

Niharika Sachdeva,

Nitendra RajputAuthors Info & Claims

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

Pages 296 - 301

https://doi.org/10.1145/3371158.3371203

Published: 15 January 2020 Publication History

Abstract

Online recruitment industry relies on various Knowledge Bases (KB) for enabling search and recommendation systems. These KBs comprise of diverse, non-standard, and large volume of named-entities as they are created from vast unstructured user-generated content (mostly CVs). Such non-standard representation of each entity causes significant vocabulary gap in KB which results in redundancy incompleteness, and ambiguity in the retrieved information. The problem is even more challenging in domains where external sources of context do not exist.

To address these challenges, we propose a two-tier architecture that (a) finds the distance parameter for clustering entities using a novel pairwise similarity between all entity mentions, and, (b) then uses these similarity (scores) to create canonical clusters representing unique entity in the KB. Our experiments on proprietary data of 25,602 unique companies and 23,690 unique institutes show that the pair-wise similarity score using Siamese network outperforms (97% and 82% F1-score) standard string similarity measures. Finally, clustering methods over the similarity scores achieve 90% and 80% micro F1-score.

References

[1]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[2]

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems.

[3]

Claudio Delli Bovi, Luis Espinosa-Anke, and Roberto Navigli. 2015. Knowledge base unification via sense embeddings and disambiguation. In The 2015 Conference on Empirical Methods in Natural Language; 2015 Sept 17-21; Lisbon, Portugal.[Stroudsburg]: ACL (Association for Computational Linguistics); 2015. p. 726--36. ACL (Association for Computational Linguistics).

[4]

Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. 2010. Entity Disambiguation for Knowledge Base Population. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL.

Digital Library

[5]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD.

[6]

Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing Open Knowledge Bases. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM.

Digital Library

[7]

Jonathan Goldsmith. [n.d.]. Wikipedia API for Python. https://pypi.org/project/wikipedia/.

[8]

Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity Linking via Joint Encoding of Types, Descriptions, and Context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

[9]

Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web. ACM, 385--396.

Digital Library

[10]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 427--431.

[11]

Thomas Lin, Oren Etzioni, et al. 2012. Entity Linking at Web Scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. ACL.

Digital Library

[12]

Qiaoling Liu, Faizan Javed, Vachik S Dave, and Ankita Joshi. 2017. Supporting Employer Name Normalization at both Entity and Cluster Level. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.

Digital Library

[13]

Qiaoling Liu, Faizan Javed, and Matt Mcnair. 2016. CompanyDepot: Employer Name Normalization in the Online Recruitment Industry. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.

Digital Library

[14]

MediaWiki. 2018. API:Main page --- MediaWiki, The Free Wiki Engine. https://www.mediawiki.org/w/index.php?title=API:Main_page&oldid=2988334 [Online; accessed 10-December-2018].

[15]

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning Text Similarity with Siamese Recurrent Networks. In Proceedings of the 1st Workshop on Representation Learning for NLP.

[16]

Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. 2013. Knowledge Graph Identification. In International Semantic Web Conference. Springer, 542--557.

[17]

Vijay Raghavan, Peter Bollmann, and Gwang S Jung. 1989. A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance. ACM Transactions on Information Systems (TOIS) (1989).

[18]

Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering (2015).

[19]

Pang-Ning Tan et al. 2006. Introduction to Data Mining. Pearson Education India.

[20]

Salvatore Trani, Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. 2014. Dexter 2.0: An Open Source Tool for Semantically Enriching Data. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track.

[21]

Shikhar Vashishth, Prince Jain, and Partha Talukdar. 2018. CESI: Canonicalizing Open Knowledge Bases Using Embeddings and Side Information. In Proceedings of the 2018 World Wide Web Conference (WWW '18). International World Wide Web Conferences Steering Committee.

Digital Library

[22]

Wikipedia. 2018. Silhouette (clustering) --- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Silhouette_(clustering)&oldid=856197782.

[23]

Baoshi Yan, Lokesh Bajaj, and Anmol Bhasin. 2011. Entity Resolution using Social Graphs for Business Applications. In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on. IEEE.

Digital Library

[24]

Wei Zhang, Jian Su, Chew Lim Tan, and Wen Ting Wang. 2010. Entity Linking Leveraging: Automatically Generated Annotation. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL.

Cited By

Qi JLi SGuo ZHuang YZhou CZhang WWang XLin Z(2023)Text Classification In The Wild: A Large-Scale Long-Tailed Name Normalization DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096769(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096769
Fatma NChoudhary VSachdeva NRajput N(2020)Canonicalizing Knowledge Bases for Recruitment DomainAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-47436-2_38(500-513)Online publication date: 11-May-2020
https://dl.acm.org/doi/10.1007/978-3-030-47436-2_38

Recommendations

Named entity recognition using point prediction and active learning
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

Named entity recognition (NER) research has been spreading into specialty domains. A specialty domain corpus is smaller than a general domain corpus. Moreover, annotating a specialty domain corpus is more expensive than annotating a general corpus. ...
Exploring Representations for Singular and Multi-Concept Relations for Biomedical Named Entity Normalization
WWW '22: Companion Proceedings of the Web Conference 2022

Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to ...
Mitigating Effect of Dictionary Matching Errors in Distantly Supervised Named Entity Recognition
iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

Named entity recognition (NER) is a fundamental technique that brings basic semantic awareness to natural language processing applications and services. Since we need a large amount of training data to train a custom NER model, distant supervision that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

January 2020

399 pages

ISBN:9781450377386

DOI:10.1145/3371158

General Chairs:
Vasudeva Varma,
Subbarao Kambhampati,
Program Chairs:
Arnab Bhattacharya,
Sriraam Natarajan,
Publications Chair:
Rishiraj Saha Roy

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CoDS COMAD 2020

CoDS COMAD 2020: 7th ACM IKDD CoDS and 25th COMAD

January 5 - 7, 2020

Hyderabad, India

Acceptance Rates

CoDS COMAD 2020 Paper Acceptance Rate 78 of 275 submissions, 28%;

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qi JLi SGuo ZHuang YZhou CZhang WWang XLin Z(2023)Text Classification In The Wild: A Large-Scale Long-Tailed Name Normalization DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096769(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096769
Fatma NChoudhary VSachdeva NRajput N(2020)Canonicalizing Knowledge Bases for Recruitment DomainAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-47436-2_38(500-513)Online publication date: 11-May-2020
https://dl.acm.org/doi/10.1007/978-3-030-47436-2_38

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten