skip to main content
10.1145/3185768.3186307acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Public Access

DIBS: A Data Integration Benchmark Suite

Published: 02 April 2018 Publication History

Abstract

As the generation of data becomes more prolific, the amount of time and resources necessary to perform analyses on these data increases. What is less understood, however, is the data preprocessing steps that must be applied before any meaningful analysis can begin. This problem of taking data in some initial form and transforming it into a desired one is known as data integration. Here, we introduce the Data Integration Benchmarking Suite (DIBS), a suite of applications that are representative of data integration workloads across many disciplines. We apply a comprehensive characterization to these applications to better understand the general behavior of data integration tasks. As a result of our benchmark suite and characterization methods, we offer insight regarding data integration tasks that will guide other researchers designing solutions in this area.

References

[1]
Nathan Binkert et al. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News Vol. 39, 2 (Aug. 2011), 1--7. 2006. Taverna: Lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, Vol. 18, 10 (2006), 1067--1100.
[2]
Meikel Poess et al. 2014. TPC-DI: The first industry benchmark for data integration. Proceedings of the VLDB Endowment Vol. 7, 13 (2014), 1367--1378.
[3]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter's Wheel: An Interactive Data Cleaning System Proc. of 27th Int'l Conf. on Very Large Data Bases. 381--390.
[4]
Jonathan Weinberg et al. 2005. Quantifying locality in the memory access patterns of HPC applications Proc. of ACM/IEEE Supercomputing Conference.
[5]
Takashi Yokota, Kanemitsu Ootsu, and Takanobu Baba. 2008. Potentials of branch predictors: From entropy viewpoints Proc. of International Conference on Architecture of Computing Systems. 273--285.

Cited By

View all
  • (2025)Application of Network Calculus Models to Heterogeneous Streaming ApplicationsInternational Journal of Networking and Computing10.15803/ijnc.15.1_5115:1(51-63)Online publication date: 2025
  • (2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
  • (2022)Interruptible Nodes: Reducing Queueing Costs in Irregular Streaming Dataflow Applications on Wide-SIMD ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-022-00745-251:1(43-60)Online publication date: 5-Dec-2022
  • Show More Cited By

Index Terms

  1. DIBS: A Data Integration Benchmark Suite

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering
    April 2018
    212 pages
    ISBN:9781450356299
    DOI:10.1145/3185768
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. data integration
    3. data wrangling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICPE '18

    Acceptance Rates

    Overall Acceptance Rate 252 of 851 submissions, 30%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)130
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Application of Network Calculus Models to Heterogeneous Streaming ApplicationsInternational Journal of Networking and Computing10.15803/ijnc.15.1_5115:1(51-63)Online publication date: 2025
    • (2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
    • (2022)Interruptible Nodes: Reducing Queueing Costs in Irregular Streaming Dataflow Applications on Wide-SIMD ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-022-00745-251:1(43-60)Online publication date: 5-Dec-2022
    • (2021)Platform Agnostic Streaming Data Application Performance Models2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)10.1109/RSDHA54838.2021.00008(17-26)Online publication date: Nov-2021
    • (2021)Evaluation of Data Integration Plans based on Graph DataProcedia Computer Science10.1016/j.procs.2021.08.107192(1041-1050)Online publication date: 2021
    • (2020)Reducing Queuing Impact in Irregular Data Streaming Applications2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00009(22-30)Online publication date: Nov-2020
    • (2020)Chip-to-chip Optical Data Communications using Polarization Division Multiplexing2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286227(1-8)Online publication date: 22-Sep-2020
    • (2020)Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286221(1-7)Online publication date: 22-Sep-2020
    • (2020)Designing Domain Specific Computing Systems2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM48280.2020.00052(221-221)Online publication date: May-2020
    • (2019)Data Integration Tasks on Heterogeneous Systems Using OpenCLProceedings of the International Workshop on OpenCL10.1145/3318170.3318187(1-1)Online publication date: 13-May-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media