skip to main content
10.1145/3486622.3493928acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
research-article

A Framework for Duplicate Detection from Online Job Postings

Published: 13 April 2022 Publication History

Abstract

Online job boards have greatly improved the efficiency of job searching and have also provided valuable data for labour market research. However, there are a high proportion of duplicate job postings in most (if not all) job boards, because recruiters and job boards seek to improve their coverage of the market by integrating job postings from many different sources. These duplicate postings undermine the usability of job boards and the quality of labour market analytics derived from them. In this paper, we tackle the challenging problem of duplicate detection from online job postings. Specifically, we design a framework for duplicate detection from online job postings and, under the framework, implement and test 24 methods built with four different tokenisers, three vectorisers and six similarity measures. We conduct a comparative study and experimental evaluation of the 24 methods and compare their performance with a baseline approach. All methods are tested with a real-world dataset from a job boarding platform and are evaluated with six performance metrics. The experiment reveals that the top two methods are Overlap with skip-gram (OS) and Overlap with n-gram (OG), followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all above four methods outperform the baseline approach in detecting duplicates.

References

[1]
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic Clustering of the Web. In Selected Papers from the Sixth International Conference on World Wide Web (Santa Clara, California, USA). Elsevier, Essex, UK, 1157–1166.
[2]
H. Burk, F. Javed, and J. Balaji. 2017. Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW). 177–182.
[3]
Anthony P. Carnevale, Tamara Jayasundera, and Dmitri M Repnikov. 2014. Understanding Online Job Ads Data. Technical Report. Georgetown University. https://cew.georgetown.edu/wp-content/uploads/2014/11/OCLM.Tech_.Web_.pdf
[4]
Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing (Montreal, Quebec, Canada) (STOC ’02). ACM, New York, NY, USA, 380–388.
[5]
D. Deming and L. B. Kahn. 2018. Skill requirements across firms and labor markets: Evidence from job postings for professionals. Journal of Labor Economics 36, S1 (2018), 337–369.
[6]
Susan T. Dumais. 2004. Latent semantic analysis. Annual Review of Information Science and Technology 38, 1(2004), 188–230. https://doi.org/10.1002/aris.1440380105
[7]
Monika Henzinger. 2006. Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR ’06). ACM, New York, NY, USA, 284–291. https://doi.org/10.1145/1148170.1148222
[8]
Brad Hershbein and Lisa B. Kahn. 2018. Do Recessions Accelerate Routine-Biased Technological Change? Evidence from Vacancy Postings. American Economic Review 108, 7 (July 2018), 1737–72. https://doi.org/10.1257/aer.20161570
[9]
Valentin Jijkoun. 2016. Online job postings have many duplicates. But how can you detect them if they are not exact copies of each other?https://www.textkernel.com/online-job-posting-many-duplicates-can-detect-not-exact-copies/
[10]
Kaggle. 2016. Avito Duplicate Ads Detection. https://www.kaggle.com/c/avito-duplicate-ads-detection
[11]
Dan Lecocq. 2015. Near-Duplicate Detection. https://moz.com/devblog/near-duplicate-detection
[12]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting Near-duplicates for Web Crawling. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). ACM, New York, NY, USA, 141–150.
[13]
Christian Thiele. 2020. cutpointr: Determine and Evaluate Optimal Cutpoints in Binary Classification Tasks. R package version 1.0.32.https://CRAN.R-project.org/package=cutpointr
[14]
James Thurgood, Arthur Turrell, David Copple, Jyldyz Djumalieva, and Bradley Speigner. 2018. Using Online Job Vacancies to Understand the UK Labour Market from the Bottom-Up. Bank of England Working Paper742 (July 2018).
[15]
W. J. Youden. 1950. Index for rating diagnostic tests. Cancer 3, 1 (1950), 32–35.

Cited By

View all
  • (2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Explainable AI (XAI) for Constructing a Lexicon for Classifying Green Energy Jobs: A Comparative Analysis of Occupation, Industry, and Location Composition With Traditional Energy JobsIEEE Access10.1109/ACCESS.2024.343031712(142709-142720)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
December 2021
698 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document analysis
  2. duplicate detection
  3. job posting
  4. text mining

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WI-IAT '21
Sponsor:
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence
December 14 - 17, 2021
VIC, Melbourne, Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Explainable AI (XAI) for Constructing a Lexicon for Classifying Green Energy Jobs: A Comparative Analysis of Occupation, Industry, and Location Composition With Traditional Energy JobsIEEE Access10.1109/ACCESS.2024.343031712(142709-142720)Online publication date: 2024
  • (2024)Accelerated demand for interpersonal skills in the Australian post-pandemic labour marketNature Human Behaviour10.1038/s41562-023-01788-28:1(32-42)Online publication date: 8-Jan-2024
  • (2023)An algorithm for predicting job vacancies using online job postings in AustraliaHumanities and Social Sciences Communications10.1057/s41599-023-01562-910:1Online publication date: 13-Mar-2023
  • (2023)Text analysis of job offers for mismatch of educational characteristics to labour market demandsQuality & Quantity10.1007/s11135-023-01707-758:2(1799-1825)Online publication date: 21-Jul-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media