skip to main content
research-article

Steered Training Data Generation for Learned Semantic Type Detection

Published: 20 June 2023 Publication History

Abstract

In this paper, we introduce STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

Supplemental Material

MP4 File
Presentation video of the paper "Steered Training Data Generation For Learned Semantic Type Detection" by Sven Langenecker, Christoph Sturm, Christian Schalles und Carsten Binnig. The presentation is about a new labeling framework for the task of semantic type detection that comes with a novel Steered-Labeling procedure to generate high quality training data for non-numeric as well as for numeric data table-columns.

References

[1]
Alation 2022. Alation Data Catalog. https://www.alation.com/. Accessed: 2022--10--15.
[2]
Amazon Web Services 2022. AWS Glue Data Catalog. https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html. Accessed: 2022--10--15.
[3]
Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. In The Semantic Web - ISWC 2015. Springer International Publishing, Cham, 425--441.
[4]
Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In AAAI'19 (Honolulu, Hawaii, USA) (AAAI'19/IAAI'19/EAAI'19). AAAI Press, Article 4, 8 pages. https://doi.org/10.1609/aaai.v33i01.330129
[5]
Collibra 2022. Collibra Data Catalog. https://www.collibra.com/us/en/products/data-catalog. Accessed: 2022--10--15.
[6]
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2021. TURL: Table Understanding through Representation Learning. In VLDB, Vol. 14. VLDB Endowment, 307 -- 319. https://doi.org/10.14778/3430915.3430921 arXiv:2006.14806v2
[7]
Benjamin Denham, Edmund M-K. Lai, Roopak Sinha, and M. Asif Naeem. 2022. Witan: Unsupervised Labelling Function Generation for Assisted Data Programming. In VLDB, Vol. 15. VLDB Endowment, 2334--2347. https://doi.org/10.14778/3551793.3551797
[8]
James Dixon. 2014. Data Lakes Revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/. Accessed: 2022--10--15.
[9]
Dremio 2022. Dremio. https://www.dremio.com/. Accessed: 2022--10--15.
[10]
Sara Evensen, Chang Ge, Dongjin Choi, and Çagatay Demiralp. 2020. Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. CoRR abs/2009.01444 (2020). arXiv:2009.01444 https://arxiv.org/abs/2009.01444
[11]
Bogdan Ghita. 2019. Public BI benchmark. https://github.com/cwida/public_bi_benchmark/tree/master. Accessed: 2022--10--15.
[12]
Google 2022. Freebase Data Dumps. https://developers.google.com/freebase. Accessed: 2022--10--15.
[13]
Google 2022. Google Cloud Data Catalog. https://cloud.google.com/data-catalog/docs/concepts/overview. Accessed: 2022--10--15.
[14]
Alon Halevy, Flip Korn, Natasha Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39 (2016), 5--14.
[15]
Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake Lassiter, Yaron Goland, and Gaurav Malhotra. 2021. Auto-Tag: Tagging-Data-By-Example in Data Lakes. CoRR abs/2112.06049 (2021). arXiv:2112.06049 https://arxiv.org/abs/2112.06049
[16]
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In ACL 2020. ACL, Online, 4320--4333. https://doi.org/10.18653/v1/2020.acl-main.398
[17]
Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. 2021. GitTables: A Large-Scale Corpus of Relational Tables. CoRR abs/2106.07258 (2021). arXiv:2106.07258 https://arxiv.org/abs/2106.07258
[18]
Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Sigma Computing, and San Francisco. 2022. Making Table Understanding Work in Practice. CIDR 2022 (2022). https://doi.org/10.1145/nnnnnnn.nnnnnnn arXiv:2109.05173v1
[19]
Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, and Çagatay Demiralp. 2021. Making Table Understanding Work in Practice. CoRR abs/2109.05173 (2021). arXiv:2109.05173
[20]
Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In SIGKDD (Anchorage, AK, USA) (KDD '19). ACM, New York, NY, USA, 1500--1508. https://doi.org/10.1145/3292500.3330993
[21]
Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE 2021. IEEE, 468--479.
[22]
Sven Langenecker, Christoph Sturm, Christian Schalles, and Carsten Binnig. 2021. Towards Learned Metadata Extraction for Data Lakes. In BTW 2021, Kai-Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.). Gesellschaft für Informatik, Bonn, 325--336. https://doi.org/10.18420/btw2021--17
[23]
Subhadip Maji, Swapna Sourav Rout, and Sudeep Choudhary. 2021. DCoM: A Deep Column Mapper for Semantic Data Type Detection. CoRR abs/2106.12871 (2021). arXiv:2106.12871 https://arxiv.org/abs/2106.12871
[24]
Neil Mallinar, Abhishek Shah, Tin Kam Ho, Rajendra Ugrani, and Ayush Gupta. 2020. Iterative Data Programming for Expanding Text Classification Corpora. In AAAI'20. AAAI Press, 13332--13337. https://ojs.aaai.org/index.php/AAAI/article/view/7045
[25]
Microsoft 2022. Azure Purview: 100 standard data-types for auto- tagging. https://docs.microsoft.com/en-us/azure/purview/supported-classifications. Accessed: 2022--10--15.
[26]
Microsoft 2022. Microsoft Power BI, Interactive Data Visualization BI. https://powerbi.microsoft.com. Accessed: 2022--10--15.
[27]
Mona Nashaat, Aindrila Ghosh, James Miller, and Shaikh Quader. 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision. ACM/IMS Trans. Data Sci. 1, 2, Article 13 (may 2020), 25 pages. https://doi.org/10.1145/3385188
[28]
Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. 2016. Multi-level Semantic Labelling of Numerical Values, In The Semantic Web -- ISWC 2016. Springer International, 428--445. https://doi.org/10.1007/978--3--319--46523--4
[29]
S. K. Ramnandan, Amol Mittal, Craig A. Knoblock, and Pedro Szekely. 2015. Assigning Semantic Labels to Data Sources. Lecture Notes in Computer Science 9088 (2015), 403--417. https://doi.org/10.1007/978--3--319--18818--8
[30]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. In VLDB, Vol. 11. VLDB Endowment, 269--282.
[31]
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In NIPS (Barcelona, Spain). Curran Associates Inc., Red Hook, NY, USA, 3574--3582.
[32]
Claude Elwood Shannon. 1948. A Mathematical Theory of Communication. The Bell System Technical Journal 27 (1948), 379--423. http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
[33]
Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL (EMNLP-CoNLL '12). Association for Computational Linguistics, USA, 1201--1211.
[34]
Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çagatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-Trained Language Models. In SIGMOD 2022. ACM, New York, NY, USA, 1493--1503.
[35]
Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. In VLDB, Vol. 12. VLDB Endowment, 223--236. https://doi.org/10.14778/3291264.3291268
[36]
Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. 2021. TUTA: Tree-Based Transformers for Generally Structured Table Pre-Training. In SIGKDD (Virtual Event, Singapore) (KDD '21). ACM, New York, NY, USA, 1780--1790. https://doi.org/10.1145/3447548.3467434
[37]
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Multilingual Universal Sentence Encoder for Semantic Retrieval. CoRR abs/1907.04307 (2019). arXiv:1907.04307
[38]
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL2020. ACL, Online, 8413--8426. https://doi.org/10.18653/v1/2020.acl-main.745
[39]
Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Çagatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. In VLDB, Vol. 13. VLDB Endowment, 1835--1848. https://doi.org/10.14778/3407790.3407793
[40]
Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2011. Automatic Discovery of Attributes in Relational Databases. In SIGMOD 2011. ACM, New York, NY, USA, 109--120.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Author Tags

  1. data discovery
  2. data lakes
  3. semantic type detection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 210
    Total Downloads
  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)4
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media