skip to main content
10.1145/3589335.3651946acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

LLM Driven Web Profile Extraction for Identical Names

Published: 13 May 2024 Publication History

Abstract

The number of individuals having identical names on the internet is increasing. Thus making the task of searching for a specific individual tedious. The user must vet through many profiles with identical names to get to the actual individual of interest. The online presence of an individual forms the profile of the individual. We need a solution that helps users by consolidating the profiles of such individuals by retrieving factual information available on the web and providing the same as a single result. We present a novel solution that retrieves web profiles belonging to those bearing identical Full Names through an end-to-end pipeline. Our solution involves information retrieval from the web (extraction), LLM-driven Named Entity Extraction (retrieval), and standardization of facts using Wikipedia, which returns profiles with fourteen multi-valued attributes. After that, profiles that correspond to the same real-world individuals are determined. We accomplish this by identifying similarities among profiles based on the extracted facts using a Prefix Tree inspired data structure (validation) and utilizing ChatGPT's contextual comprehension (revalidation). The system offers varied levels of strictness while consolidating these profiles, namely strict, relaxed, and loose matching. The novelty of our solution lies in the innovative use of GPT -- a highly powerful yet an unpredictable tool, for such a nuanced task. A study involving twenty participants, along with other results, found that one could effectively retrieve information for a specific individual.

Supplemental Material

MP4 File
Presentation video
MP4 File
Supplemental video

References

[1]
Tasleem Arif. 2015. Exploring The Use Of Hybrid Similarity Measure For Author Name Disambiguation. International Journal of Science and Technology Research, Vol. 4, 12 (2015), 171--175.
[2]
Javier Artiles, Julio Gonzalo, and Felisa Verdejo. 2005. A Testbed for People Searching Strategies in the WWW. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil) (SIGIR '05). Association for Computing Machinery, New York, NY, USA, 569--570. https://doi.org/10.1145/1076034.1076132
[3]
Dhananjay Ashok and Zachary C Lipton. 2023. PromptNER: Prompting For Named Entity Recognition. arXiv preprint arXiv:2305.15444 (2023).
[4]
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York. 313--323 pages.
[5]
Amit Bagga and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the vector space model. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics.
[6]
Jawid Ahmad Baktash and Mursal Dawodi. 2023. Gpt-4: A Review on Advancements and Opportunities in Natural Language Processing. arXiv preprint arXiv:2305.03195 (2023).
[7]
Seyed-Mehdi-Reza Beheshti, Srikumar Venugopal, Seung Hwan Ryu, Boualem Benatallah, and Wei Wang. 2013. Big data and cross-document coreference resolution: Current state and future opportunities. arXiv preprint arXiv:1311.3987 (2013).
[8]
Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, and Yi Zhang. 2020. Calibration, entropy rates, and memory in language models. In International Conference on Machine Learning. PMLR, 1089--1099.
[9]
Jiawei Chen, Yaojie Lu, Hongyu Lin, Jie Lou, Wei Jia, Dai Dai, Hua Wu, Boxi Cao, Xianpei Han, and Le Sun. 2023. Learning In-context Learning for Named Entity Recognition. arXiv preprint arXiv:2305.11038 (2023).
[10]
William W Cohen, Pradeep Ravikumar, Stephen E Fienberg, et al. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks. In IIWeb, Vol. 3. 73--78.
[11]
Agust'in D Delgado, Raquel Mart'inez, V'ictor Fresno, and Soto Montalvo. 2014. A data driven approach for person name disambiguation in web search results. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 301--310.
[12]
Markus Freitag, David Vilar, David Grangier, Colin Cherry, and George Foster. 2022. A Natural Diet: Towards Improving Naturalness of Machine Translation Output. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 3340--3353. https://doi.org/10.18653/v1/2022.findings-acl.263
[13]
R Guha. 2004. Disambiguating people in search. In The Thirteenth International World Wide Web Conference, WWW2004.
[14]
Will Douglas Heaven. 2022. Language models like GPT-3 could herald a new type of search engine. In Ethics of Data and Analytics. Auerbach Publications, 57--59.
[15]
Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, Vol. 9 (2021), 962--977.
[16]
Sanjana Kamath and Rupali Wagh. 2017. Named entity recognition approaches and challenges. International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), Vol. 6, 2 (2017), 259--262.
[17]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
[18]
Xueqin Lin, Jia Zhu, Yong Tang, Fen Yang, Bo Peng, and Weiling Li. 2017. A novel approach for author name disambiguation using ranking confidence. In Database Systems for Advanced Applications: DASFAA 2017 International Workshops: BDMS, BDQM, SeCoP, and DMMOOC, Suzhou, China, March 27--30, 2017, Proceedings 22. Springer, 169--182.
[19]
Gideon Mann and David Yarowsky. 2003. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. 33--40.
[20]
Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. 2023. Language Model Self-improvement by Reinforcement Learning Contemplation. arXiv preprint arXiv:2305.14483 (2023).
[21]
Joshua Robinson, Christopher Michael Rytting, and David Wingate. 2022. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353 (2022).
[22]
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 142--147. https://aclanthology.org/W03-0419
[23]
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022. Domain adaptation for deep entity resolution. In Proceedings of the 2022 International Conference on Management of Data. 443--457.
[24]
Quang Minh Vu, Tomonari Masada, Atsuhiro Takasu, and Jun Adachi. 2007. Disambiguation of People in Web Search Using a Knowledge Base. In 2007 IEEE International Conference on Research, Innovation and Vision for the Future. 185--191. https://doi.org/10.1109/RIVF.2007.369155
[25]
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. 2023. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428 (2023).
[26]
Wikipedia. 2023. Barack Obama -- Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Barack%20Obama&oldid=1179864919. [Online; accessed 13-October-2023].
[27]
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. 2023. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023).
[28]
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning. PMLR, 12697--12706. io

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '24: Companion Proceedings of the ACM Web Conference 2024
May 2024
1928 pages
ISBN:9798400701726
DOI:10.1145/3589335
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. generative pre-trained transformer
  2. identical names
  3. large language model
  4. name entity recognition
  5. web profile extraction

Qualifiers

  • Research-article

Conference

WWW '24
Sponsor:
WWW '24: The ACM Web Conference 2024
May 13 - 17, 2024
Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 216
    Total Downloads
  • Downloads (Last 12 months)216
  • Downloads (Last 6 weeks)35
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media