skip to main content
10.1145/3477495.3531876acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

End-to-end Distantly Supervised Information Extraction with Retrieval Augmentation

Published: 07 July 2022 Publication History

Abstract

Distant supervision (DS) has been a prevalent approach to generating labeled data for information extraction (IE) tasks. However, DS often suffers from noisy label problems, where the labels are extracted from the knowledge base (KB), regardless of the input context. Many efforts have been devoted to designing denoising mechanisms. However, most strategies are only designed for one specific task and cannot be directly adapted to other tasks. We propose a general paradigm (Dasiera) to resolve issues in KB-based DS. Labels from KB can be viewed as universal labels of a target entity or an entity pair. While the given context for an IE task may only contain partial/zero information about the target entities, or the entailed information may be vague. Hence the mismatch between the given context and KB labels, i.e., the given context has insufficient information to infer DS labels, can happen in IE training datasets. To solve the problem, during training, Dasiera leverages a retrieval-augmentation mechanism to complete missing information of the given context, where we seamlessly integrate a neural retriever and a general predictor in an end-to-end framework. During inference, we can keep/remove the retrieval component based on whether we want to predict solely on the given context. We have evaluated Dasiera on two IE tasks under the DS setting: named entity typing and relation extraction. Experimental results show Dasiera's superiority to other baselines in both tasks.

Supplementary Material

MP4 File (SIGIR22_sp1915.mp4)
Presentation video.

References

[1]
Abhishek, Ashish Anand, and Amit Awekar. 2017. Fine-Grained Entity Type Classification by Jointly Learning Representations and Label Embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, 797--807.
[2]
Muhammad Asif Ali, Yifang Sun, Bing Li, and Wei Wang. 2020. Fine-Grained Named Entity Typing over Distantly Supervised Data Based on Refined Representations. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 7391--7398.
[3]
Christoph Alt, Marc Hü bner, and Leonhard Hennig. 2019. Fine-tuning Pre-Trained Transformer Language Models to Distantly Supervised Relation Extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). Florence, Italy, 1388--1398.
[4]
Giusepppe Attardi. 2015. WikiExtractor. https://github.com/attardi/wikiextractor .
[5]
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). Hyderabad, India, 2670--2676.
[6]
Ermei Cao, Difeng Wang, Jiacheng Huang, and Wei Hu. 2020. Open knowledge enrichment for long-tail entities. In Proceedings of The Web Conference (WWW). Tapei, 384--394.
[7]
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). Vancouver, Canada, 1870--1879.
[8]
Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-Fine Entity Typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia, 87--96.
[9]
Despina Christou and Grigorios Tsoumakas. 2021. Improving Distantly-Supervised Relation Extraction Through BERT-Based Label and Instance Embeddings. IEEE Access, Vol. 9 (2021), 62574--62582.
[10]
Zeyu Dai, Hongliang Fei, and Ping Li. 2019. Coreference Aware Representation Learning for Neural Named Entity Recognition. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). Macao, China, 4946--4953.
[11]
Xiang Deng and Huan Sun. 2019. Leveraging 2-hop Distant Supervision from Table Entity Pairs for Relation Extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 410--420.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.
[13]
José Esquivel, Dyaa Albakour, Miguel Martinez-Alvarez, David Corney, and Samir Moussa. 2017. On the Long-Tail Entities in News. In Proceedings of the 39th European Conference on IR Research (ECIR). Aberdeen, UK, 691--697.
[14]
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, Vol. 1, 12 (2009), 2009.
[15]
Shir Gur, Natalia Neverova, Christopher Stauffer, Ser-Nam Lim, Douwe Kiela, and Austin Reiter. 2021. Cross-Modal Retrieval Augmentation for Multi-Modal Classification. In Findings of the Association for Computational Linguistics (EMNLP). Virtual Event / Punta Cana, Dominican Republic, 111--123.
[16]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning (ICML). Virtual Event, 3929--3938.
[17]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 6769--6781.
[18]
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). Florence, Italy, 6086--6096.
[19]
Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kü ttler, Mike Lewis, Wen-tau Yih, Tim Rockt"a schel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Virtual.
[20]
Yang Li, Guodong Long, Tao Shen, Tianyi Zhou, Lina Yao, Huan Huo, and Jing Jiang. 2020. Self-Attention Enhanced Selective Gate with Entity-Aware Embedding for Distantly Supervised Relation Extraction. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 8269--8276.
[21]
Ying Lin and Heng Ji. 2019. An attentive fine-grained entity typing model with latent type representation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6198--6203.
[22]
Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). Berlin, Germany, 2124--2133.
[23]
Xiao Ling and Daniel Weld. 2012. Fine-grained entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 26.
[24]
Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, and Ping Li. 2020. Extracting Knowledge from Web Text with Monte Carlo Tree Search. In Proceedings of the Web Conference (WWW). Taipei, 2585--2591.
[25]
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, Vol. 1, 1 (2019), 105--115.
[26]
Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oguz, Veselin Stoyanov, and Gargi Ghosh. 2021. Multi-Task Retrieval for Knowledge-Intensive Tasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). Virtual Event, 1098--1111.
[27]
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL). Singapore, 1003--1011.
[28]
Yasumasa Onoe and Greg Durrett. 2019. Learning to Denoise Distantly-Labeled Data for Entity Typing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 2407--2417.
[29]
Matthew Purver and Stuart Adam Battersby. 2012. Experimenting with Distant Supervision for Emotion Classification. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Avignon, France, 482--491.
[30]
Ridho Reinanda, Edgar Meij, and Maarten de Rijke. 2016. Document Filtering for Long-tail Entities. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM). Indianapolis, IN, 771--780.
[31]
Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, and Jiawei Han. 2016. Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). San Francisco, CA, 1825--1834.
[32]
Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions without Labeled Text. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), Part III. Barcelona, Spain, 148--163.
[33]
Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. 2017. Neural Architectures for Fine-grained Entity Type Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, 1271--1280.
[34]
Alisa Smirnova and Philippe Cudré -Mauroux. 2019. Relation Extraction Using Distant Supervision: A Survey. ACM Comput. Surv., Vol. 51, 5 (2019), 106:1--106:35.
[35]
Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). Florence, Italy, 2895--2905.
[36]
Mingming Sun, Wenyue Hua, Zoey Liu, Xin Wang, Kangjie Zheng, and Ping Li. 2020. A Predicate-Function-Argument Annotation of Natural Language for Open-Domain Information eXpression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 2140--2150.
[37]
Mingming Sun, Xu Li, Xin Wang, Miao Fan, Yue Feng, and Ping Li. 2018. Logician: A Unified End-to-End Neural Approach for Open-Domain Information Extraction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM). Marina Del Rey, CA, 556--564.
[38]
Shulong Tan, Zhaozhuo Xu, Weijie Zhao, Hongliang Fei, Zhixin Zhou, and Ping Li. 2021. Norm Adjusted Proximity Graph for Fast Inner Product Retrieval. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, Singapore, 1552--1560.
[39]
Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha P. Talukdar. 2018. RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). Brussels, Belgium, 1257--1266.
[40]
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2021. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). Online Event, 1405--1418.
[41]
Ji Xin, Hao Zhu, Xu Han, Zhiyuan Liu, and Maosong Sun. 2018. Put It Back: Entity Typing with Language Model Enhancement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). Brussels, Belgium, 993--998.
[42]
Wenhan Xiong, Jiawei Wu, Deren Lei, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. Imposing Label-Relational Inductive Bias for Extremely Fine-Grained Entity Typing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 773--784.
[43]
Peng Xu and Denilson Barbosa. 2019. Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 3201--3206.
[44]
Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics (COLING). Santa Fe, NM, 2145--2158.
[45]
Zhi-Xiu Ye and Zhen-Hua Ling. 2019. Distant Supervision Relation Extraction with Intra-Bag and Inter-Bag Attentions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 2810--2819.
[46]
Yue Zhang, Hongliang Fei, and Ping Li. 2021. ReadsRE: Retrieval-Augmented Distantly Supervised Relation Extraction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Virtual Event, Canada, 2257--2262.
[47]
Weijie Zhao, Shulong Tan, and Ping Li. 2020. SONG: Approximate Nearest Neighbor Search on GPU. In Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE). Dallas, TX, 1033--1044.
[48]
Zhixin Zhou, Shulong Tan, Zhaozhuo Xu, and Ping Li. 2019. Mö bius Transformation for Fast Inner Product Search on Graph. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 8216--8227.

Cited By

View all
  • (2024)Reading Broadly to Open Your Mind: Improving Open Relation Extraction With Search Documents Under Self-SupervisionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.331713936:5(2026-2040)Online publication date: May-2024
  • (2024)Improving Distantly-Supervised Relation Extraction through Label Prompt2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10579999(606-611)Online publication date: 8-May-2024
  • (2023)Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding ModelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591670(569-579)Online publication date: 19-Jul-2023

Index Terms

  1. End-to-end Distantly Supervised Information Extraction with Retrieval Augmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distant supervision
    2. information extraction
    3. named entity typing
    4. relation extraction
    5. retrieval augmentation

    Qualifiers

    • Short-paper

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reading Broadly to Open Your Mind: Improving Open Relation Extraction With Search Documents Under Self-SupervisionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.331713936:5(2026-2040)Online publication date: May-2024
    • (2024)Improving Distantly-Supervised Relation Extraction through Label Prompt2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10579999(606-611)Online publication date: 8-May-2024
    • (2023)Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding ModelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591670(569-579)Online publication date: 19-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media