skip to main content
10.1145/3627673.3680097acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

DAMOCRO: A Data Migration Framework Using Online Classification and Reordering

Published: 21 October 2024 Publication History

Abstract

This paper introduces DAMOCRO, a <u>da</u>ta <u>m</u>igration framework using <u>o</u>nline <u>c</u>lassification and tuple <u>r</u>e<u>o</u>rdering to improve throughput and decrease the costs of data migration. The DAMOCRO workflow consists of four main steps. First, it classifies records into subgroups to maximize the similarity within each group. Next, it reorders tuples within these groups, ensuring that similar tuples are adjacent. Subsequently, column-wise compression is applied to each group. Finally, the compressed data is transferred from the source to the target machine. The initial two steps enhance the compression ratio, thereby boosting throughput and reducing costs. Our evaluations on five real-world datasets and two benchmark datasets, show that the online classification process in DAMOCRO improves throughput by more than 24% and reduces costs by over 19% compared to baselines. Besides, implementing reordering based on functional dependencies brings an additional cost reduction ranging from 10% to 60%, while also enhancing throughput.

References

[1]
2023. NYPD Complaint Data Current Year-To-Date. https://data.cityofnewyork. us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uacw243/ about_data. Accessed: 2023-09-01.
[2]
2024. AWS Database Migration Service. https://aws.amazon.com/dms/. Accessed: 2024-05--19.
[3]
2024. Clean Meta Kaggle. https://www.kaggle.com/datasets/yonikremer/cleanmeta-kaggle. Accessed: 2024-05--15.
[4]
2024. IBM DB2 on Cloud. https://www.ibm.com/cloud/db2-on-cloud. Accessed: date-of-access.
[5]
2024. Secure Copy Protocol. https://en.wikipedia.org/wiki/Secure_copy_protocol. Accessed: [date you accessed the site].
[6]
2024. Supervised Learning. https://scikit-learn.org/stable/supervised_learning. html. Accessed: 2024-05--19.
[7]
Ruhul Amin, Siddhartha Vadlamudi, and Md. Mahbubur Rahaman. 2021. Opportunities and Challenges of Data Migration in Cloud. Engineering International 9, 1 (Apr. 2021), 41--50. https://doi.org/10.18034/ei.v9i1.529
[8]
Eric Anderson, Joe Hall, Jason Hartline, Michael Hobbs, Anna R Karlin, Jared Saia, Ram Swaminathan, and John Wilkes. 2001. An experimental study of data migration algorithms. In WAE 2001. Springer, 145--158.
[9]
Melyssa Barata, Jorge Bernardino, and Pedro Furtado. 2015. An overview of decision support benchmarks: TPC-DS, TPC-H and SSB. New Contributions in Information Systems and Technologies: Volume 1 (2015), 619--628.
[10]
Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, and Min Wang. 2019. Gaussian naive Bayesian data classification model based on clustering algorithm. In 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019). Atlantis Press, 396--400.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/V1/N19--1423
[12]
Martyn Ellison, Radu Calinescu, and Richard F. Paige. 2018. Evaluating cloud database migration options using workload models. J. Cloud Comput. 7 (2018), 6. https://doi.org/10.1186/S13677-018-0108--5
[13]
Jean-loup Gailly and Mark Adler. 1992. GNU gzip. GNU Operating System (1992).
[14]
Tuan Nguyen Gia, L Qingqing, J Pena Queralta, Hannu Tenhunen, Zhuo Zou, and Tomi Westerlund. 2019. Lossless compression techniques in edge computing for mission-critical applications in the IoT. In 2019 Twelfth International Conference on Mobile Computing and Ubiquitous Network (ICarnegie Mellon University). IEEE, 1--2.
[15]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
[16]
Zhexue Huang. 1998. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 2, 3 (1998), 283--304. https://doi.org/10.1023/A:1009769707641
[17]
Nishtha Jatana, Sahil Puri, Mehak Ahuja, Ishita Kathuria, and Dishant Gosain. 2012. A survey and comparison of relational and non-relational database. International Journal of Engineering Research & Technology 1, 6 (2012), 1--5.
[18]
David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. 2002. Logistic regression. Springer.
[19]
Jonathan Gana Kolo, S Anandan Shanmugam, David Wee Gin Lim, and Li-Minn Ang. 2015. Fast and efficient lossless adaptive compression scheme for wireless sensor networks. Computers & Electrical Engineering 41 (2015), 275--287.
[20]
N Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.
[21]
Jonghyun Lee, Marianne Winslett, Xiaosong Ma, and Shengke Yu. 2002. Enhancing data migration performance via parallel data compression. In Proceedings 16th International Parallel and Distributed Processing Symposium. IEEE, 8--pp.
[22]
Daniel Lemire, Owen Kaser, and Eduardo Gutarra. 2012. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems (TODS) 37, 3 (2012), 1--29.
[23]
Ping Lu, Liang Zhang, Xiahe Liu, Jingjing Yao, and Zuqing Zhu. 2015. Highly efficient data migration and backup for big data applications in elastic optical inter-data-center networks. IEEE Network 29, 5 (2015), 36--42.
[24]
Maniah, Benfano Soewito, Ford Lumban Gaol, and Edi Abdurachman. 2022. A systematic literature Review: Risk analysis in cloud migration. J. King Saud Univ. Comput. Inf. Sci. 34, 6 Part B (2022), 3111--3120. https://doi.org/10.1016/J.JKSUCI. 2021.01.008
[25]
Colt McAnlis and Aleks Haecky. 2016. Understanding compression: Data compression for modern developers. " O'Reilly Media, Inc.".
[26]
Felix Naumann. 2024. Repeatability - FDs and ODs. https://hpi.de/naumann/ projects/repeatability/data-profiling/fds.html. Accessed: 2024-05--15.
[27]
Wee Keong Ng and Chinya V Ravishankar. 1995. Relational database compression using augmented vector quantization. In ICDE. IEEE, 540--549.
[28]
Jingfeng Pan, Yunfei Peng, Kaiyu Li, Aijun An, Xiaohui Yu, and Dariusz Jania. 2023. Optimizing Data Migration Using Online Clustering. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, CASCON 2023, Las Vegas, NV, USA, September 11--14, 2023, Paria Shirani, Iosif-Viorel Onut, and Paula Branco (Eds.). ACM, 173--178. https://doi. org/10.5555/3615924.3615944
[29]
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment 8, 10 (2015), 1082--1093.
[30]
Marcus Paradies, Christian Lemke, Hasso Plattner,Wolfgang Lehner, Kai-Uwe Sattler, Alexander Zeier, and Jens Krueger. 2010. Howto juggle columns. Proceedings of the Fourteenth International Database Engineering & Applications Symposium on - IDEAS '10 (2010). https://doi.org/10.1145/1866480.1866510
[31]
Jérémie Rappaz, Julian J. McAuley, and Karl Aberer. 2021. Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption. In RecSys '21: Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September 2021 - 1 October 2021, Humberto Jesús Corona Pampín, Martha A. Larson, Martijn C. Willemsen, Joseph A. Konstan, Julian J. McAuley, Jean Garcia-Gathright, Bouke Huurnink, and Even Oldridge (Eds.). ACM, 390--399. https://doi.org/10.1145/3460231.3474267
[32]
Simanta Shekhar Sarmah. 2018. Data migration. Science and Technology 8, 1 (2018), 1--10.
[33]
Ulf Leser Sebastian Wandelt, Xiaoqian Sun. 2018. Column-wise compression of open relational data. Information Sciences (2018), 48--61. https://doi.org/10. 1016/j.ins.2018.04.074
[34]
Yan-Yan Song and LU Ying. 2015. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry 27, 2 (2015), 130.
[35]
Alaa Tharwat. 2016. Linear vs. quadratic discriminant analysis classifier: a tutorial. International Journal of Applied Pattern Recognition 3, 2 (2016), 145--180.
[36]
Muhammad Saleem Vighio, Taooz J Khanzada, and Mukesh Kumar. 2017. Analysis of the effects of redundancy on the performance of relational database systems. In 2017 IEEE 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS). IEEE, 1--5.
[37]
MengtingWan, Jianmo Ni, Rishabh Misra, and Julian J.McAuley. 2020. Addressing Marketing Bias in Product Recommendations. In WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3--7, 2020, James Caverlee, Xia (Ben) Hu, Mounia Lalmas, andWei Wang (Eds.). ACM, 618--626. https://doi.org/10.1145/3336191.3371855
[38]
Lixi Zhou, Jiaqing Chen, Amitabh Das, Hong Min, Lei Yu, Ming Zhao, and Jia Zou. 2022. Serving deep learning models with deduplication from relational databases. arXiv preprint arXiv:2201.10442 (2022).

Index Terms

  1. DAMOCRO: A Data Migration Framework Using Online Classification and Reordering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cloud computing
    2. data compression
    3. data migration
    4. functional dependency
    5. online clustering

    Qualifiers

    • Research-article

    Funding Sources

    • IBM CAS Fellowship
    • NSERC Discovery Grants

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 109
      Total Downloads
    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media