skip to main content
10.1145/3583678.3596898acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

An exploratory analysis of methods for real-time data deduplication in streaming processes

Published: 27 June 2023 Publication History

Abstract

Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.

References

[1]
Otmane Azeroual, Meena Jha, Anastasija Nikiforova, Kewei Sha, Mohammad Alsmirat, and Sanjay Jha. 2022. A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension. Multimodal Technologies and Interaction 6, 4 (2022).
[2]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
[3]
Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State management in Apache Flink®: consistent stateful distributed stream processing. Proceedings of the VLDB Endowment 10, 12 (2017), 1718--1729.
[4]
Bill Chambers and Matei Zaharia. 2018. Spark: The definitive guide: Big data processing made simple. " O'Reilly Media, Inc.".
[5]
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. {ChunkStash}: Speeding Up Inline Storage Deduplication Using Flash Memory. In 2010 USENIX Annual Technical Conference (USENIX ATC 10).
[6]
Bonaventura Del Monte, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Rhino: Efficient management of very large distributed state for stream processing engines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2471--2486.
[7]
Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary Data {Deduplication---Large} Scale Study and System Design. In 2012 USENIX Annual Technical Conference (USENIX ATC 12). 285--296.
[8]
Widad Elouataoui, Imane El Alaoui, Saida El Mendili, and Youssef Gahi. 2022. An End-to-End Big Data Deduplication Framework based on Online Continuous Learning. International Journal of Advanced Computer Science and Applications 13, 9 (2022).
[9]
João Victor Esteves, Sérgio Lifschitz, Rosa Costa, and Ana Almeida. 2020. Streaming state management methods for real-time data deduplication. In Proceedings of XXXV Brazilian Symposium on Databases (Online Event). SBC, Porto Alegre, RS, Brasil, 265--270.
[10]
N. Fisk. 2019. Mastering Ceph: Infrastructure storage solutions with the latest Ceph release, 2nd Edition. Packt Publishing. https://books.google.dk/books?id=vuiLDwAAQBAJ
[11]
Moritz Hoffmann, Andrea Lattuada, Frank McSherry, Vasiliki Kalavri, John Liagouris, and Timothy Roscoe. 2019. Megaphone: Latency-conscious state migration for distributed streaming dataflows. Proceedings of the VLDB Endowment 12, 9 (2019), 1002--1015.
[12]
Muhammed Tawfiqul Islam, Renata Borovica-Gajic, and Shanika Karunasekera. 2022. A Multi-Level Caching Architecture for Stateful Stream Computation. In Proceedings of the 16th ACM International Conference on Distributed and Event-Based Systems (Copenhagen, Denmark) (DEBS '22). Association for Computing Machinery, New York, NY, USA, 67--78.
[13]
Lifang Lin, Yuhui Deng, Yi Zhou, and Yifeng Zhu. 2023. InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization. ACM Trans. Storage 19, 1, Article 6 (jan 2023), 27 pages.
[14]
Qin Liu, John CS Lui, Cheng He, Lujia Pan, Wei Fan, and Yunlong Shi. 2016. SAND: A fault-tolerant streaming architecture for network traffic analytics. Journal of Systems and Software 122 (2016), 553--563.
[15]
Mikhail M Rovnyagin, Valentin K Kozlov, Roman A Mitenkov, Alexey D Gukov, and Anton A Yakovlev. 2020. Caching and Storage Optimizations for Big Data Streaming Systems. In 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus). IEEE, 468--471.
[16]
Kiran Srinivasan, Timothy Bisson, Garth R Goodson, and Kaladhar Voruganti. 2012. iDedup: latency-aware, inline data deduplication for primary storage. In Fast, Vol. 12. 1--14.
[17]
Cristiana-Stefania Stan, Adrian-Eduard Pandelica, Vlad-Andrei Zamfir, Roxana-Gabriela Stan, and Catalin Negru. 2019. Apache spark and apache ignite performance analysis. In 2019 22nd international conference on control systems and computer science (CSCS). IEEE, 726--733.
[18]
Li Su and Yongluan Zhou. 2019. Passive and Partially Active Fault Tolerance for Massively Parallel Stream Processing Engines. IEEE Transactions on Knowledge and Data Engineering 31, 1 (2019), 32--45.
[19]
Li Su and Yongluan Zhou. 2021. Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. In Proceedings of the 15th ACM International Conference on Distributed and Event-Based Systems (Virtual Event, Italy) (DEBS '21). Association for Computing Machinery, New York, NY, USA, 66--77.
[20]
Dongzhan Zhang, Chengfa Liao, Wenjing Yan, Ran Tao, and Wei Zheng. 2017. Data deduplication based on Hadoop. In 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD). IEEE, 147--152.
[21]
Benjamin Zhu, Kai Li, and R Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Fast, Vol. 8. 269--282.

Index Terms

  1. An exploratory analysis of methods for real-time data deduplication in streaming processes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems
      June 2023
      221 pages
      ISBN:9798400701221
      DOI:10.1145/3583678
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 June 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      • Independent Research Fund Denmark

      Conference

      DEBS '23

      Acceptance Rates

      Overall Acceptance Rate 145 of 583 submissions, 25%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 144
        Total Downloads
      • Downloads (Last 12 months)37
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media