short-paper

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns

Authors:
Pranav Subramaniam

The University of Chicago, Chicago, IL, USA

The University of Chicago, Chicago, IL, USA

0009-0005-4684-0501
View Profile

,
Udayan Khurana

IBM Research, Yorktown Heights, NY, USA

IBM Research, Yorktown Heights, NY, USA

0000-0001-8113-1210
View Profile

,
Kavitha Srinivas

IBM Research, Yorktown Heights, NY, USA

IBM Research, Yorktown Heights, NY, USA

0000-0003-4610-967X
View Profile

,
Horst Samulowitz

IBM Research, Yorktown Heights, NY, USA

IBM Research, Yorktown Heights, NY, USA

0000-0002-6780-3217
View Profile

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementOctober 2023Pages 5096–5100https://doi.org/10.1145/3583780.3614750

Published:21 October 2023Publication History

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 5096–5100

ABSTRACT

Join discovery is a crucial part of exploration on data lakes. It often involves finding joinable tables that are semantically relevant. However, data lakes often contain numeric tables with unreliable column headers, and ID columns whose text names have been lost. Finding semantically relevant joins over numeric tables is a challenge. State-of-the-art describes join discovery using semantic similarity, but do not consider purely numeric tables. In this paper, we describe a system, NumJoin that includes two novel approaches for discovering joinable tables in a data lake: one that maps tables to knowledge graphs, and another that leverages numeric types and distributions. We demonstrate the effectiveness of NumJoin on a large data lake, including transportation data and finance data.

References

Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 709--720. https://doi.org/10.1109/ICDE48307.2020.00067Google Scholar
Raul Castro Fernandez, Essam Mansour, Abdulhakim A. Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 989--1000. https://doi.org/10.1109/ICDE.2018.00093Google Scholar
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow., Vol. 13, 9 (jun 2020), 1373--1387. https://doi.org/10.14778/3397230.3397235Google ScholarDigital Library
Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2020. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. CoRR, Vol. abs/2010.13273 (2020). showeprint[arXiv]2010.13273 https://arxiv.org/abs/2010.13273Google Scholar
Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2022. DeepJoin: Joinable Table Discovery with Pre-trained Language Models. https://doi.org/10.48550/ARXIV.2212.07588Google Scholar
Mahdi Esmailoghli, Jorge-Arnulfo Quiané -Ruiz, and Ziawasch Abedjan. 2021a. MATE: Multi-Attribute Table Extraction. CoRR, Vol. abs/2110.00318 (2021). showeprint[arXiv]2110.00318 https://arxiv.org/abs/2110.00318Google Scholar
Mahdi Esmailoghli, Jorge-Arnulfo Quiané-Ruiz, and Ziawasch Abedjan. 2021b. COCOA: COrrelation COefficient-Aware Data Augmentation. In EDBT.Google Scholar
Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée Miller. 2022. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. https://doi.org/10.48550/ARXIV.2210.01922Google Scholar
Catherine Faron, Chiara Ghidini, Ahmad Alobaid, Emilia Kacprzak, Oscar Corcho, Catherina Faron, and Chiara Ghidini. 2021. Typology-Based Semantic Labeling of Numeric Tabular Data. Semant. Web, Vol. 12, 1 (jan 2021), 5--20. https://doi.org/10.3233/SW-200397Google ScholarDigital Library
Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2022. SANTOS: Relationship-based Semantic Table Union Search. https://doi.org/10.48550/ARXIV.2209.13589Google Scholar
Udayan Khurana and Sainyam Galhotra. 2021. Semantic Concept Annotation for Tabular Data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM '21). Association for Computing Machinery, New York, NY, USA, 844--853. https://doi.org/10.1145/3459637.3482295Google ScholarDigital Library
Udayan Khurana, Kavitha Srinivas, and Horst Samulowitz. 2022. A Survey on Semantics in Automated Data Science. https://doi.org/10.48550/ARXIV.2205.08018Google Scholar
Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 1064--1076. https://doi.org/10.1145/3448016.3452824Google ScholarDigital Library
Dan Ofer. 2019. DBPedia Classes: Hierarchical Taxonomy of Wikipedia article classes. https://www.kaggle.com/datasets/danofer/dbpedia-classesGoogle Scholar
Sahaana Suri, Ihab F. Ilyas, Christopher Ré, and Theodoros Rekatsinas. 2021. Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. Proc. VLDB Endow., Vol. 15, 3 (nov 2021), 699--712. https://doi.org/10.14778/3494124.3494149Google ScholarDigital Library
Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 01 (Jul. 2019), 281--288. https://doi.org/10.1609/aaai.v33i01.3301281Google ScholarDigital Library
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 847--864. https://doi.org/10.1145/3299869.3300065Google ScholarDigital Library
Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-Join: Joining Tables by Leveraging Transformations. Proc. VLDB Endow., Vol. 10, 10 (jun 2017), 1034--1045. https://doi.org/10.14778/3115404.3115409Google ScholarDigital Library
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow., Vol. 9, 12 (aug 2016), 1185--1196. https://doi.org/10.14778/2994509.2994534Google ScholarDigital Library

Index Terms

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns
1. Information systems
  1. Data management systems
    1. Information integration

Recommendations

Mining Association Rules on Related Numeric Attributes
PAKDD '99: Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining

In practical applications, some property is represented by a pair of related attributes. For example, blood pressure, temperature changes etc. The existing data mining approaches for association rules can not tackle those cases, because they treat every ...
Read More
Discovering Numeric Association Rules via Evolutionary Algorithm
PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Association rules are one of the most used tools to discover relationships among attributes in a database. Nowadays, there are many efficient techniques to obtain these rules, although most of them require that the values of the attributes be discrete. ...
Read More
Discovering associations with numeric variables
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
October 2023
5508 pages
ISBN:9798400701245
DOI:10.1145/3583780
General Chairs:
Ingo Frommholz
University of Wolverhampton, UK
,
Frank Hopfgartner
University of Koblenz, Germany
,
Mark Lee
University of Birmingham, UK
,
Michael Oakes
University of Birmingham, UK
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Min Zhang
Tsinghua University, China
,
Rodrygo Santos
Federal University of Minas Gerais, Brazil
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
numeric data integration
semantic join discovery
tabular data
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 86
  Total Downloads
- Downloads (Last 12 months)86
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining Association Rules on Related Numeric Attributes

Discovering Numeric Association Rules via Evolutionary Algorithm

Discovering associations with numeric variables