A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method

Dang, Tran Khanh; Ta, Manh Huy

doi:10.1007/s42979-022-01384-7

A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method

Original Research
Published: 21 September 2022

Volume 3, article number 486, (2022)
Cite this article

SN Computer Science Aims and scope Submit manuscript

126 Accesses
Explore all metrics

Abstract

Data nowadays are an extremely valuable resource. Data can come from different sources, and it can originate from the government of a country, an organization, a company, or just a normal person. Furthermore, the content of data is varied: the data could be about primary education in the U.K, it could be about medical care in the U.S., or it could be about agriculture in Vietnam, etc. It is reasonable to assume that among those datasets, some datasets would be about the same topic. Moreover, those datasets could have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into more meaningful datasets: The unionized datasets would contain the collective information of the datasets, and the users and scientists do not have to spend a lot of time searching and combining the datasets themselves, etc. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability for JSON-format data. Besides, we also performed some experiments to evaluate our method and prove its feasibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 7

A Data Union Method Using Hierarchical Clustering and Set Unionability

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm

Article Open access 25 April 2020

Notes

http://github.com/ligthsworn/table_union_benchmark.

References

Aditya B. Distributed clustering via LSH based data partitioning. ICML. 2018;2018:569–78.
Google Scholar
Broder AZ. On the resemblance and containment of documents. Sequences. 1997;1997:21–9.
Google Scholar
Apache Foundation. Apache Spark, an open-source unified analytics engine for large-scale data processing. https://spark.apache.org/ 2022.
Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.
Chun SL, Youwei J, Zhekang D, Dongxiao W, Yingshan T, Qi HL, Richard TKW, Ahmed FZ, Ruiheng W, Loi LL (2020) A review of technical standards for smart cities. Clean Technol
Craig AK, Pedro AS. Exploiting semantics for big data integration. AI Magn. 2015;36(1): 25–38.
Defays D. An efficient algorithm for a complete link method. Comput J. 1977;20(4):364–6.
Article MathSciNet MATH Google Scholar
Dong XL, Srivastava D. Big data integration. Morgan & Claypool Publishers, 2015;p. 198.
McLaren D, Agyeman J. Sharing cities: a case for truly smart and sustainable cities. London: MIT Press; 2015.
Google Scholar
Erkang Z, Fatemeh N, Ken QP, Renée JM. LSH ensemble: internet scale domain search. arXiv:1603.07410, 2016.
Erkang Z, Fatemeh N, Ken QP, Renee JM. LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 2016;9(12):1185–1196.
Zhu E, Deng D, Nargesian F, Miller RJ. JOSIE: overlap set similarity search for finding joinable tables in data lakes. SIGMOD Conf. 2019;2019:847–64.
Google Scholar
Fabian MS, Gjergji K, Gerhard W. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
Fatemeh N, Erkang Z, Ken QP, Renee JM. Table union search on open data. Proc. VLDB Endow. 2018;11(7):813–825.
Fatemeh N, Erkang Z, Renee JM, Ken QP, Patricia CA. Data lake management: challenges and opportunities. Proc VLDB Endow. 2019;12(12):1986–9.
Article Google Scholar
Fatemeh N, Erkang Z, Ken QP, Renée JM. Benchmarch for evaluating table union search algorithms. https://github.com/RJMillerLab/table-union-search-benchmark, 2022.
Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.
MATH Google Scholar
Har-Peled S, Kushal A. Smaller coresets for k-median and kmeans clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126-134, 2005.
Koga H, Ishibashi T, Watanabe T. Fast hierarchical clustering algorithm using locality-sensitive hashing. Discov Sci. 2004;2004:114–28.
Article MATH Google Scholar
Hisashi K, Tetsuo I, Toshinori W. Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl. Inf. Syst. 2007;12(1):25–53.
Hyeonjeong L, Hoseok J, Miyoung S, Ohseok K. Developing a semi-automatic data conversion tool for Korean ecological data standardization. In Journal of Ecology and Environment, 2017;41(11).
Ivan Ermilov, Claus Stadler, Michael Martin, Soeren Auer (2013). CSV2RDF: User-Driven CSV to RDF Mass Conversion Framework. In Proceedings of the 9th International Conference on Semantic Systems.
Joelson Antônio dos Santos, Syed Talat Iqbal, Murilo Coelho Naldi, Ricardo J. G. B. Campello, Joerg Sander (2021). Hierarchical Density-Based Clustering Using MapReduce. IEEE Trans. Big Data 7(1): 102-114 (2021)
Rice JA. Mathematical Statistics and Data Analysis. Duxbury Press; 2006.
Google Scholar
Rocha L, Vale F, Cirilo E, Barbosa D, Mourao F. A Framework for Migrating Relational Datasets to NoSQL. ICCS. 2015;2015:2593–602.
Google Scholar
Le Hong Trang, Nguyen Le Hoang, Tran Khanh Dang (2020). A Farthest First Traversal based Sampling Algorithm for k-clustering. IMCOM 2020: 1-6 (2020).
Michael J. Cafarella, Alon Y. Halevy, Nodira Khoussainova (2009). Data Integration for the Relational Web. Proc. VLDB Endow. 2(1): 1090-1101 (2009).
Mior MJ, Salem K. Renormalization of NoSQL Database Schemas ER. 2018;2018:479–87.
Google Scholar
Nguyen Duy Khang Truong, Tran Khanh Dang, Cong An Nguyen (2021). On Using Cryptographic Technologies in Privacy Protection of Online Conferencing Systems. FDSE (CCIS Volume) 2021: 123-138 (2021).
Nguyen Le Hoang, Tran Khanh Dang (2022). Alpha Lightweight Coreset for k-Means Clustering. IMCOM 2022: 1-8 (2022).
Oliver Lehmberg, Christian Bizer (2017). Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow. 10(11): 1502-1513 (2017).
Robin Sibson (1973). SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. Comput. J. 16(1): 30-34 (1973).
Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.
Article Google Scholar
Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, Sungchul Kim (2020). Fast Hierarchical Graph Clustering in Linear-Time. WWW (Companion Volume) 2020: 10-12 (2020).
Subbulakshmi Pasupathi, Vimal Shanmuganathan, Madasamy Kaliappan, Yesudhas Harold Robinson, Mucheol Kim (2021).Trend analysis using agglomerative hierarchical clustering approach for time series big data. J. Supercomput. 77(7): 6505-6524 (2021).
Credit Fraud Detection. Thanh Cong Tran, Tran Khanh Dang (2021). Machine Learning for Prediction of Imbalanced Data. IMCOM. 2021;2021:1–7.
Google Scholar
Tran Khanh Dang, Xuan Tinh Chu, The Huy Tran (2021). Privacy-Preserving Attribute-Based Access Control in Education Information Systems. FDSE (CCIS Volume) 2021: 327-345 (2021).
Dang TK, Anh TD. An Effective and Elastic Blockchain-based Provenance Preserving Solution for the Open Data. Int J Web Inf Syst. 2021;17(5):480–515.
Article Google Scholar
Tran Khanh Dang, Manh Huy Ta, Ly Hoang Dang, Nguyen Le Hoang (2021). An Elastic Data Conversion Framework: A Case Study for MySQL and MongoDB. SN Comput. Sci. 2(4): 325 (2021).
Dang TK, Ta MH, Dang LH, Le Hoang N. An Elastic Data Conversion Framework for Data Integration System. FDSE (CCIS Volume). 2021;2020:35–50.
Google Scholar
Dang TK, Ta MH, Le Hoang N. Intermediate Data Format for the Elastic Data Conversion Framework. IMCOM. 2021;2021:1–5.
Google Scholar
Ha T, Dang TK. Investigating Local Differential Privacy and Generative Adversarial Network in Collecting Data. ACOMP. 2020;2020:140–5.
Google Scholar
Vladimir Estivill-Castro (2002). Why so many clustering algorithms: a position paper. SIGKDD Explor. 4(1): 65-75 (2002).
Ling X, Halevy AY, Fei W, Cong Yu. Synthesizing Union Tables from the Web. IJCAI. 2013;2013:2677–83.
Google Scholar
Wang Y, Shangdi Yu, Yan G, Shun J. Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering. SIGMOD Conference. 2021;2021:1982–95.
Google Scholar
Yue Wang, Vivek R. Narasayya, Yeye He, Surajit Chaudhuri (2022). PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication. Proc. VLDB Endow. 15(6): 1132-1145 (2022).

Download references

Author information

Authors and Affiliations

Ho Chi Minh City University of Food Industry (HUFI), 140 Le Trong Tan Street, Tan Phu District, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Ho Chi Minh City University of Technology, VNU-HCM, 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam
Manh Huy Ta

Authors

Tran Khanh Dang
View author publications
You can also search for this author inPubMed Google Scholar
Manh Huy Ta
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tran Khanh Dang.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Future Data and Security Engineering 2021” guest edited by Tran Khanh Dang.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dang, T.K., Ta, M.H. A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method. SN COMPUT. SCI. 3, 486 (2022). https://doi.org/10.1007/s42979-022-01384-7

Download citation

Received: 06 April 2022
Accepted: 19 August 2022
Published: 21 September 2022
DOI: https://doi.org/10.1007/s42979-022-01384-7

Keywords

Part of a collection:

Future Data and Security Engineering 2021

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Data Union Method Using Hierarchical Clustering and Set Unionability

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now