skip to main content
10.1145/3394486.3406468acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
tutorial

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

Published: 20 August 2020 Publication History

Abstract

How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.

References

[1]
Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.
[2]
Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2018), 2140--2149.
[3]
Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In ICDM. IEEE, 1209--1220.
[4]
Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP.
[5]
Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. PVLDB 11, 10 (2018), 1084--1096.
[6]
Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. 2020. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages. In ACL. Association for Computational Linguistics, Online, 8105--8117. https: //www.aclweb.org/anthology/2020.acl-main.721
[7]
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In NAACL-HLT.
[8]
Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2018. GraphIE: A Graph-Based Framework for Information Extraction. In NAACL-HLT.
[9]
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. SIGMOD 2018 (2018), 1301--1316.

Cited By

View all
  • (2023)Evolution of Big Data Models from Hierarchical Models to Knowledge Graphs2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00201(1325-1330)Online publication date: Jun-2023
  • (2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
  • (2022)A coral-reef approach to extract information from HTML tablesApplied Soft Computing10.1016/j.asoc.2021.107980115:COnline publication date: 6-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

  1. information extraction
  2. semi-structured data
  3. web extraction
  4. web mining

Qualifiers

  • Tutorial

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Evolution of Big Data Models from Hierarchical Models to Knowledge Graphs2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00201(1325-1330)Online publication date: Jun-2023
  • (2022)Nested Named Entity Recognition: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/352259316:6(1-29)Online publication date: 30-Jul-2022
  • (2022)A coral-reef approach to extract information from HTML tablesApplied Soft Computing10.1016/j.asoc.2021.107980115:COnline publication date: 6-May-2022
  • (2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
  • (2021)Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity RecognitionProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467196(3231-3239)Online publication date: 14-Aug-2021
  • (2021)TCN: Table Convolutional Network for Web Table InterpretationProceedings of the Web Conference 202110.1145/3442381.3450090(4020-4032)Online publication date: 19-Apr-2021
  • (2020)Analytics of Similar-Sounding Names from the Web with Phonetic Based Clustering2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WIIAT50758.2020.00087(580-585)Online publication date: Dec-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media