Duplicate Table Discovery with Xash

Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data.Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to other hash functions, such as SimHash and other competitors, Xash results in fewer false positive candidates.

Koch, Maximilian; Esmailoghli, Mahdi; Auer, Sören; Abedjan, Ziawasch (2023): Duplicate Table Discovery with Xash. BTW 2023. DOI: 10.18420/BTW2023-18. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 367-390. Dresden, Germany. 06.-10. März 2023

Schlagwörter

data discovery , data lakes , duplicate table detection

DOI

10.18420/BTW2023-18

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

Duplicate Table Discovery with Xash

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen