research-article

LLM Driven Web Profile Extraction for Identical Names

Authors:

Prateek Sancheti,

Kamalakar Karlapalem,

Kavita VemuriAuthors Info & Claims

WWW '24: Companion Proceedings of the ACM Web Conference 2024

Pages 1616 - 1625

https://doi.org/10.1145/3589335.3651946

Published: 13 May 2024 Publication History

Abstract

The number of individuals having identical names on the internet is increasing. Thus making the task of searching for a specific individual tedious. The user must vet through many profiles with identical names to get to the actual individual of interest. The online presence of an individual forms the profile of the individual. We need a solution that helps users by consolidating the profiles of such individuals by retrieving factual information available on the web and providing the same as a single result. We present a novel solution that retrieves web profiles belonging to those bearing identical Full Names through an end-to-end pipeline. Our solution involves information retrieval from the web (extraction), LLM-driven Named Entity Extraction (retrieval), and standardization of facts using Wikipedia, which returns profiles with fourteen multi-valued attributes. After that, profiles that correspond to the same real-world individuals are determined. We accomplish this by identifying similarities among profiles based on the extracted facts using a Prefix Tree inspired data structure (validation) and utilizing ChatGPT's contextual comprehension (revalidation). The system offers varied levels of strictness while consolidating these profiles, namely strict, relaxed, and loose matching. The novelty of our solution lies in the innovative use of GPT -- a highly powerful yet an unpredictable tool, for such a nuanced task. A study involving twenty participants, along with other results, found that one could effectively retrieve information for a specific individual.

Supplemental Material

MP4 File

Presentation video

Download
1620.40 MB

MP4 File

Supplemental video

Download
4.80 MB

References

[1]

Tasleem Arif. 2015. Exploring The Use Of Hybrid Similarity Measure For Author Name Disambiguation. International Journal of Science and Technology Research, Vol. 4, 12 (2015), 171--175.

[2]

Javier Artiles, Julio Gonzalo, and Felisa Verdejo. 2005. A Testbed for People Searching Strategies in the WWW. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil) (SIGIR '05). Association for Computing Machinery, New York, NY, USA, 569--570. https://doi.org/10.1145/1076034.1076132

Digital Library

[3]

Dhananjay Ashok and Zachary C Lipton. 2023. PromptNER: Prompting For Named Entity Recognition. arXiv preprint arXiv:2305.15444 (2023).

[4]

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York. 313--323 pages.

[5]

Amit Bagga and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the vector space model. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics.

[6]

Jawid Ahmad Baktash and Mursal Dawodi. 2023. Gpt-4: A Review on Advancements and Opportunities in Natural Language Processing. arXiv preprint arXiv:2305.03195 (2023).

[7]

Seyed-Mehdi-Reza Beheshti, Srikumar Venugopal, Seung Hwan Ryu, Boualem Benatallah, and Wei Wang. 2013. Big data and cross-document coreference resolution: Current state and future opportunities. arXiv preprint arXiv:1311.3987 (2013).

[8]

Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, and Yi Zhang. 2020. Calibration, entropy rates, and memory in language models. In International Conference on Machine Learning. PMLR, 1089--1099.

[9]

Jiawei Chen, Yaojie Lu, Hongyu Lin, Jie Lou, Wei Jia, Dai Dai, Hua Wu, Boxi Cao, Xianpei Han, and Le Sun. 2023. Learning In-context Learning for Named Entity Recognition. arXiv preprint arXiv:2305.11038 (2023).

[10]

William W Cohen, Pradeep Ravikumar, Stephen E Fienberg, et al. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks. In IIWeb, Vol. 3. 73--78.

[11]

Agust'in D Delgado, Raquel Mart'inez, V'ictor Fresno, and Soto Montalvo. 2014. A data driven approach for person name disambiguation in web search results. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 301--310.

[12]

Markus Freitag, David Vilar, David Grangier, Colin Cherry, and George Foster. 2022. A Natural Diet: Towards Improving Naturalness of Machine Translation Output. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 3340--3353. https://doi.org/10.18653/v1/2022.findings-acl.263

[13]

R Guha. 2004. Disambiguating people in search. In The Thirteenth International World Wide Web Conference, WWW2004.

[14]

Will Douglas Heaven. 2022. Language models like GPT-3 could herald a new type of search engine. In Ethics of Data and Analytics. Auerbach Publications, 57--59.

[15]

Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, Vol. 9 (2021), 962--977.

[16]

Sanjana Kamath and Rupali Wagh. 2017. Named entity recognition approaches and challenges. International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), Vol. 6, 2 (2017), 259--262.

[17]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).

[18]

Xueqin Lin, Jia Zhu, Yong Tang, Fen Yang, Bo Peng, and Weiling Li. 2017. A novel approach for author name disambiguation using ranking confidence. In Database Systems for Advanced Applications: DASFAA 2017 International Workshops: BDMS, BDQM, SeCoP, and DMMOOC, Suzhou, China, March 27--30, 2017, Proceedings 22. Springer, 169--182.

[19]

Gideon Mann and David Yarowsky. 2003. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. 33--40.

Digital Library

[20]

Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. 2023. Language Model Self-improvement by Reinforcement Learning Contemplation. arXiv preprint arXiv:2305.14483 (2023).

[21]

Joshua Robinson, Christopher Michael Rytting, and David Wingate. 2022. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353 (2022).

[22]

Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 142--147. https://aclanthology.org/W03-0419

[23]

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022. Domain adaptation for deep entity resolution. In Proceedings of the 2022 International Conference on Management of Data. 443--457.

Digital Library

[24]

Quang Minh Vu, Tomonari Masada, Atsuhiro Takasu, and Jun Adachi. 2007. Disambiguation of People in Web Search Using a Knowledge Base. In 2007 IEEE International Conference on Research, Innovation and Vision for the Future. 185--191. https://doi.org/10.1109/RIVF.2007.369155

[25]

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. 2023. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428 (2023).

[26]

Wikipedia. 2023. Barack Obama -- Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Barack%20Obama&oldid=1179864919. [Online; accessed 13-October-2023].

[27]

Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. 2023. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023).

[28]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning. PMLR, 12697--12706. io

Index Terms

LLM Driven Web Profile Extraction for Identical Names
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking
      1. Language models
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration
        Search results deduplication
    2. Web searching and information discovery

Recommendations

Effect of LLM's Personality Traits on Query Generation
SIGIR-AP 2024: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

Large language models (LLMs) have demonstrated strong performance across various natural language processing tasks and are increasingly integrated into daily life. Just as personality traits are crucial in human communication, they could also play a ...
WebSAIL wikifier at ERD 2014
ERD '14: Proceedings of the first international workshop on Entity recognition & disambiguation

In this paper, we report on our participation in Entity Recognition and Disambiguation Challenge 2014. We present WebSAIL Wikifier, an entity recognition and disambiguation system that identifies and links textual mentions to their referent entities in ...
Commercializing profile-driven optimization
HICSS '95: Proceedings of the 28th Hawaii International Conference on System Sciences

There are a broad selection of code-improving optimizations and scheduling techniques based on profile information. Industry has been slow to productize these because traditional ways of profiling are cumbersome. Profiling slows down the execution of a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '24: Companion Proceedings of the ACM Web Conference 2024

May 2024

1928 pages

ISBN:9798400701726

DOI:10.1145/3589335

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
,
Roy Ka-Wei Lee
Singapore University of Technology and Design

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '24

Sponsor:

SIGWEB

WWW '24: The ACM Web Conference 2024

May 13 - 17, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
216
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)35

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents