research-article

Steered Training Data Generation for Learned Semantic Type Detection

Authors:

Sven Langenecker,

Christoph Sturm,

Christian Schalles Schalles,

Carsten BinnigAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 201, Pages 1 - 25

https://doi.org/10.1145/3589786

Published: 20 June 2023 Publication History

Abstract

In this paper, we introduce STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

Supplemental Material

MP4 File

Presentation video of the paper "Steered Training Data Generation For Learned Semantic Type Detection" by Sven Langenecker, Christoph Sturm, Christian Schalles und Carsten Binnig. The presentation is about a new labeling framework for the task of semantic type detection that comes with a novel Steered-Labeling procedure to generate high quality training data for non-numeric as well as for numeric data table-columns.

Download
174.03 MB

References

[1]

Alation 2022. Alation Data Catalog. https://www.alation.com/. Accessed: 2022--10--15.

[2]

Amazon Web Services 2022. AWS Glue Data Catalog. https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html. Accessed: 2022--10--15.

[3]

Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. In The Semantic Web - ISWC 2015. Springer International Publishing, Cham, 425--441.

Digital Library

[4]

Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In AAAI'19 (Honolulu, Hawaii, USA) (AAAI'19/IAAI'19/EAAI'19). AAAI Press, Article 4, 8 pages. https://doi.org/10.1609/aaai.v33i01.330129

Digital Library

[5]

Collibra 2022. Collibra Data Catalog. https://www.collibra.com/us/en/products/data-catalog. Accessed: 2022--10--15.

[6]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2021. TURL: Table Understanding through Representation Learning. In VLDB, Vol. 14. VLDB Endowment, 307 -- 319. https://doi.org/10.14778/3430915.3430921 arXiv:2006.14806v2

Digital Library

[7]

Benjamin Denham, Edmund M-K. Lai, Roopak Sinha, and M. Asif Naeem. 2022. Witan: Unsupervised Labelling Function Generation for Assisted Data Programming. In VLDB, Vol. 15. VLDB Endowment, 2334--2347. https://doi.org/10.14778/3551793.3551797

Digital Library

[8]

James Dixon. 2014. Data Lakes Revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/. Accessed: 2022--10--15.

[9]

Dremio 2022. Dremio. https://www.dremio.com/. Accessed: 2022--10--15.

[10]

Sara Evensen, Chang Ge, Dongjin Choi, and Çagatay Demiralp. 2020. Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. CoRR abs/2009.01444 (2020). arXiv:2009.01444 https://arxiv.org/abs/2009.01444

[11]

Bogdan Ghita. 2019. Public BI benchmark. https://github.com/cwida/public_bi_benchmark/tree/master. Accessed: 2022--10--15.

[12]

Google 2022. Freebase Data Dumps. https://developers.google.com/freebase. Accessed: 2022--10--15.

[13]

Google 2022. Google Cloud Data Catalog. https://cloud.google.com/data-catalog/docs/concepts/overview. Accessed: 2022--10--15.

[14]

Alon Halevy, Flip Korn, Natasha Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39 (2016), 5--14.

[15]

Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake Lassiter, Yaron Goland, and Gaurav Malhotra. 2021. Auto-Tag: Tagging-Data-By-Example in Data Lakes. CoRR abs/2112.06049 (2021). arXiv:2112.06049 https://arxiv.org/abs/2112.06049

[16]

Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In ACL 2020. ACL, Online, 4320--4333. https://doi.org/10.18653/v1/2020.acl-main.398

[17]

Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. 2021. GitTables: A Large-Scale Corpus of Relational Tables. CoRR abs/2106.07258 (2021). arXiv:2106.07258 https://arxiv.org/abs/2106.07258

[18]

Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Sigma Computing, and San Francisco. 2022. Making Table Understanding Work in Practice. CIDR 2022 (2022). https://doi.org/10.1145/nnnnnnn.nnnnnnn arXiv:2109.05173v1

[19]

Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, and Çagatay Demiralp. 2021. Making Table Understanding Work in Practice. CoRR abs/2109.05173 (2021). arXiv:2109.05173

[20]

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In SIGKDD (Anchorage, AK, USA) (KDD '19). ACM, New York, NY, USA, 1500--1508. https://doi.org/10.1145/3292500.3330993

Digital Library

[21]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE 2021. IEEE, 468--479.

[22]

Sven Langenecker, Christoph Sturm, Christian Schalles, and Carsten Binnig. 2021. Towards Learned Metadata Extraction for Data Lakes. In BTW 2021, Kai-Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.). Gesellschaft für Informatik, Bonn, 325--336. https://doi.org/10.18420/btw2021--17

[23]

Subhadip Maji, Swapna Sourav Rout, and Sudeep Choudhary. 2021. DCoM: A Deep Column Mapper for Semantic Data Type Detection. CoRR abs/2106.12871 (2021). arXiv:2106.12871 https://arxiv.org/abs/2106.12871

[24]

Neil Mallinar, Abhishek Shah, Tin Kam Ho, Rajendra Ugrani, and Ayush Gupta. 2020. Iterative Data Programming for Expanding Text Classification Corpora. In AAAI'20. AAAI Press, 13332--13337. https://ojs.aaai.org/index.php/AAAI/article/view/7045

[25]

Microsoft 2022. Azure Purview: 100 standard data-types for auto- tagging. https://docs.microsoft.com/en-us/azure/purview/supported-classifications. Accessed: 2022--10--15.

[26]

Microsoft 2022. Microsoft Power BI, Interactive Data Visualization BI. https://powerbi.microsoft.com. Accessed: 2022--10--15.

[27]

Mona Nashaat, Aindrila Ghosh, James Miller, and Shaikh Quader. 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision. ACM/IMS Trans. Data Sci. 1, 2, Article 13 (may 2020), 25 pages. https://doi.org/10.1145/3385188

Digital Library

[28]

Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. 2016. Multi-level Semantic Labelling of Numerical Values, In The Semantic Web -- ISWC 2016. Springer International, 428--445. https://doi.org/10.1007/978--3--319--46523--4

[29]

S. K. Ramnandan, Amol Mittal, Craig A. Knoblock, and Pedro Szekely. 2015. Assigning Semantic Labels to Data Sources. Lecture Notes in Computer Science 9088 (2015), 403--417. https://doi.org/10.1007/978--3--319--18818--8

Digital Library

[30]

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. In VLDB, Vol. 11. VLDB Endowment, 269--282.

Digital Library

[31]

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In NIPS (Barcelona, Spain). Curran Associates Inc., Red Hook, NY, USA, 3574--3582.

[32]

Claude Elwood Shannon. 1948. A Mathematical Theory of Communication. The Bell System Technical Journal 27 (1948), 379--423. http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

[33]

Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL (EMNLP-CoNLL '12). Association for Computational Linguistics, USA, 1201--1211.

[34]

Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çagatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-Trained Language Models. In SIGMOD 2022. ACM, New York, NY, USA, 1493--1503.

Digital Library

[35]

Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. In VLDB, Vol. 12. VLDB Endowment, 223--236. https://doi.org/10.14778/3291264.3291268

Digital Library

[36]

Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. 2021. TUTA: Tree-Based Transformers for Generally Structured Table Pre-Training. In SIGKDD (Virtual Event, Singapore) (KDD '21). ACM, New York, NY, USA, 1780--1790. https://doi.org/10.1145/3447548.3467434

Digital Library

[37]

Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Multilingual Universal Sentence Encoder for Semantic Retrieval. CoRR abs/1907.04307 (2019). arXiv:1907.04307

[38]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL2020. ACL, Online, 8413--8426. https://doi.org/10.18653/v1/2020.acl-main.745

[39]

Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Çagatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. In VLDB, Vol. 13. VLDB Endowment, 1835--1848. https://doi.org/10.14778/3407790.3407793

Digital Library

[40]

Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2011. Automatic Discovery of Attributes in Relational Databases. In SIGMOD 2011. ACM, New York, NY, USA, 109--120.

Index Terms

Steered Training Data Generation for Learned Semantic Type Detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Data management systems
    1. Information integration
      1. Extraction, transformation and loading

Recommendations

Large-scale Semantic Integration of Linked Data: A Survey

A large number of published datasets (or sources) that follow Linked Data principles is currently available and this number grows rapidly. However, the major target of Linked Data, i.e., linking and integration, is not easy to achieve. In general, ...
Few training data for Objection Detection
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering

Deep learning method of object detection has achieved excellent results, but most of the object detection network training processes are supervised learning. The performance improvement is driven by a large amount of annotation data to drive deeper and ...
Boosting LiDAR-Based Semantic Labeling by Cross-modal Training Data Generation
Computer Vision – ECCV 2018 Workshops
Abstract
Mobile robots and autonomous vehicles rely on multi-modal sensor setups to perceive and understand their surroundings. Aside from cameras, LiDAR sensors represent a central component of state-of-the-art perception systems. In addition to accurate ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 2

PACMMOD

June 2023

2310 pages

EISSN:2836-6573

DOI:10.1145/3605748

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
210
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)4

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents