Abstract
Data nowadays are an extremely valuable resource. Data can come from different sources, and it can originate from the government of a country, an organization, a company, or just a normal person. Furthermore, the content of data is varied: the data could be about primary education in the U.K, it could be about medical care in the U.S., or it could be about agriculture in Vietnam, etc. It is reasonable to assume that among those datasets, some datasets would be about the same topic. Moreover, those datasets could have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into more meaningful datasets: The unionized datasets would contain the collective information of the datasets, and the users and scientists do not have to spend a lot of time searching and combining the datasets themselves, etc. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability for JSON-format data. Besides, we also performed some experiments to evaluate our method and prove its feasibility.











Similar content being viewed by others
References
Aditya B. Distributed clustering via LSH based data partitioning. ICML. 2018;2018:569–78.
Broder AZ. On the resemblance and containment of documents. Sequences. 1997;1997:21–9.
Apache Foundation. Apache Spark, an open-source unified analytics engine for large-scale data processing. https://spark.apache.org/ 2022.
Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.
Chun SL, Youwei J, Zhekang D, Dongxiao W, Yingshan T, Qi HL, Richard TKW, Ahmed FZ, Ruiheng W, Loi LL (2020) A review of technical standards for smart cities. Clean Technol
Craig AK, Pedro AS. Exploiting semantics for big data integration. AI Magn. 2015;36(1): 25–38.
Defays D. An efficient algorithm for a complete link method. Comput J. 1977;20(4):364–6.
Dong XL, Srivastava D. Big data integration. Morgan & Claypool Publishers, 2015;p. 198.
McLaren D, Agyeman J. Sharing cities: a case for truly smart and sustainable cities. London: MIT Press; 2015.
Erkang Z, Fatemeh N, Ken QP, Renée JM. LSH ensemble: internet scale domain search. arXiv:1603.07410, 2016.
Erkang Z, Fatemeh N, Ken QP, Renee JM. LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 2016;9(12):1185–1196.
Zhu E, Deng D, Nargesian F, Miller RJ. JOSIE: overlap set similarity search for finding joinable tables in data lakes. SIGMOD Conf. 2019;2019:847–64.
Fabian MS, Gjergji K, Gerhard W. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
Fatemeh N, Erkang Z, Ken QP, Renee JM. Table union search on open data. Proc. VLDB Endow. 2018;11(7):813–825.
Fatemeh N, Erkang Z, Renee JM, Ken QP, Patricia CA. Data lake management: challenges and opportunities. Proc VLDB Endow. 2019;12(12):1986–9.
Fatemeh N, Erkang Z, Ken QP, Renée JM. Benchmarch for evaluating table union search algorithms. https://github.com/RJMillerLab/table-union-search-benchmark, 2022.
Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.
Har-Peled S, Kushal A. Smaller coresets for k-median and kmeans clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126-134, 2005.
Koga H, Ishibashi T, Watanabe T. Fast hierarchical clustering algorithm using locality-sensitive hashing. Discov Sci. 2004;2004:114–28.
Hisashi K, Tetsuo I, Toshinori W. Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl. Inf. Syst. 2007;12(1):25–53.
Hyeonjeong L, Hoseok J, Miyoung S, Ohseok K. Developing a semi-automatic data conversion tool for Korean ecological data standardization. In Journal of Ecology and Environment, 2017;41(11).
Ivan Ermilov, Claus Stadler, Michael Martin, Soeren Auer (2013). CSV2RDF: User-Driven CSV to RDF Mass Conversion Framework. In Proceedings of the 9th International Conference on Semantic Systems.
Joelson Antônio dos Santos, Syed Talat Iqbal, Murilo Coelho Naldi, Ricardo J. G. B. Campello, Joerg Sander (2021). Hierarchical Density-Based Clustering Using MapReduce. IEEE Trans. Big Data 7(1): 102-114 (2021)
Rice JA. Mathematical Statistics and Data Analysis. Duxbury Press; 2006.
Rocha L, Vale F, Cirilo E, Barbosa D, Mourao F. A Framework for Migrating Relational Datasets to NoSQL. ICCS. 2015;2015:2593–602.
Le Hong Trang, Nguyen Le Hoang, Tran Khanh Dang (2020). A Farthest First Traversal based Sampling Algorithm for k-clustering. IMCOM 2020: 1-6 (2020).
Michael J. Cafarella, Alon Y. Halevy, Nodira Khoussainova (2009). Data Integration for the Relational Web. Proc. VLDB Endow. 2(1): 1090-1101 (2009).
Mior MJ, Salem K. Renormalization of NoSQL Database Schemas ER. 2018;2018:479–87.
Nguyen Duy Khang Truong, Tran Khanh Dang, Cong An Nguyen (2021). On Using Cryptographic Technologies in Privacy Protection of Online Conferencing Systems. FDSE (CCIS Volume) 2021: 123-138 (2021).
Nguyen Le Hoang, Tran Khanh Dang (2022). Alpha Lightweight Coreset for k-Means Clustering. IMCOM 2022: 1-8 (2022).
Oliver Lehmberg, Christian Bizer (2017). Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow. 10(11): 1502-1513 (2017).
Robin Sibson (1973). SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. Comput. J. 16(1): 30-34 (1973).
Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.
Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, Sungchul Kim (2020). Fast Hierarchical Graph Clustering in Linear-Time. WWW (Companion Volume) 2020: 10-12 (2020).
Subbulakshmi Pasupathi, Vimal Shanmuganathan, Madasamy Kaliappan, Yesudhas Harold Robinson, Mucheol Kim (2021).Trend analysis using agglomerative hierarchical clustering approach for time series big data. J. Supercomput. 77(7): 6505-6524 (2021).
Credit Fraud Detection. Thanh Cong Tran, Tran Khanh Dang (2021). Machine Learning for Prediction of Imbalanced Data. IMCOM. 2021;2021:1–7.
Tran Khanh Dang, Xuan Tinh Chu, The Huy Tran (2021). Privacy-Preserving Attribute-Based Access Control in Education Information Systems. FDSE (CCIS Volume) 2021: 327-345 (2021).
Dang TK, Anh TD. An Effective and Elastic Blockchain-based Provenance Preserving Solution for the Open Data. Int J Web Inf Syst. 2021;17(5):480–515.
Tran Khanh Dang, Manh Huy Ta, Ly Hoang Dang, Nguyen Le Hoang (2021). An Elastic Data Conversion Framework: A Case Study for MySQL and MongoDB. SN Comput. Sci. 2(4): 325 (2021).
Dang TK, Ta MH, Dang LH, Le Hoang N. An Elastic Data Conversion Framework for Data Integration System. FDSE (CCIS Volume). 2021;2020:35–50.
Dang TK, Ta MH, Le Hoang N. Intermediate Data Format for the Elastic Data Conversion Framework. IMCOM. 2021;2021:1–5.
Ha T, Dang TK. Investigating Local Differential Privacy and Generative Adversarial Network in Collecting Data. ACOMP. 2020;2020:140–5.
Vladimir Estivill-Castro (2002). Why so many clustering algorithms: a position paper. SIGKDD Explor. 4(1): 65-75 (2002).
Ling X, Halevy AY, Fei W, Cong Yu. Synthesizing Union Tables from the Web. IJCAI. 2013;2013:2677–83.
Wang Y, Shangdi Yu, Yan G, Shun J. Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering. SIGMOD Conference. 2021;2021:1982–95.
Yue Wang, Vivek R. Narasayya, Yeye He, Surajit Chaudhuri (2022). PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication. Proc. VLDB Endow. 15(6): 1132-1145 (2022).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Future Data and Security Engineering 2021” guest edited by Tran Khanh Dang.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dang, T.K., Ta, M.H. A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method. SN COMPUT. SCI. 3, 486 (2022). https://doi.org/10.1007/s42979-022-01384-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01384-7