An efficient automated incremental density-based algorithm for clustering and classification
Introduction
Clustering is an analytical method used to group data objects according to their similarity without any knowledge of ground truth clusters [1]. The cluster analysis has been used in many different areas [2], [3], [4], [5]. In query optimization [6], clustering is used to identify a query group with a similar Query Execution Plan (QEP) due to the semantic representations of the queries. The mass of data is usually stored in distributed databases [7]. The sequence of queries is issued to distributed databases as the primary interaction unit among a database and its users [8]. The same QEP could be used to execute the queries with a similar structure [9]. There are different clustering methods, including density-based clustering [10], [11], [12], fuzzy clustering [13], hierarchical clustering, partitioning clustering, and model-based clustering.
There are various factors, including the size of data, shapes of clusters, the noise of data, and the number of input parameters affecting the quality of these clustering methods. A famous way for data cluster analysis is the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [14]. It can find clusters with arbitrary shapes and handle the noise. A static DBSCAN algorithm has been used to cluster static datasets in which all data objects have to be collected before running the algorithm [15]. Furthermore, the static DBSCAN algorithm re-cluster all the data objects for new incoming data. In dynamic environments, all data cannot be collected before performing clustering. Such datasets are collected over large environments and evolve quickly. Therefore, the incremental DBSCAN clustering is preferable to traditional static DBSCAN. The incremental DBSCAN [15] can incrementally create and update the arbitrary shaped clusters in large dynamic datasets. In the algorithm, the performance of clustering is influenced by its two input parameters (Epsilon (Eps) and Minimum points (MinPts)). Nevertheless, it is quite challenging to apply incremental DBSCAN to cluster points because of the complexity of determining its input parameters, mainly for handling a massive volume of data and the inaccurate characterizing of many kinds of data. This is why we should introduce automatic methods to calculate the values of these parameters.
Recently, several studies have combined meta-heuristic optimization algorithms with clustering algorithms to improve the quality of clustering [16], [17], [18], [19], [20], [21], [22], [23], for example, in [20], a hybrid clustering method was presented, which combined Differential Evolution (DE) with DBSCAN to automatically detect the best combinations of Eps and MinPts. In [17], Particle Swarm Optimization (PSO) was applied as a parameter tuning tool for DBSCAN for both supervised and unsupervised learning. However, these single-objective approaches can produce unbalanced results. Although, to the best of our knowledge, there is not any work directed at using multi-objective optimization techniques for density-based cluster parameter optimization. Therefore, this paper has proposed a new hybrid approach, NSGA-II based Density-Based Clustering and Classification (NSGA-II/DBCC), to improve the clustering quality of the incremental DBSCAN algorithm by identifying the ideal parameter configurations via searching for the whole parameter space by NSGA-II. In this paper, four kinds of fitness functions have been designed based on internal and external clustering validation indices to determine the best configuration of parameters for both labeled and unlabeled datasets [17] using NSGA-II/ DBCC. As far as we know, no other authors have used NSGA-II to optimize the DBSCAN parameters using fitness functions introduced in this paper. Moreover, the present article has developed a parallel version of the Non-dominated Sorting Genetic Algorithm II (pNSGA-II) to speed computations of fitness functions. The main contributions of this article are as follows:
- •
Presenting a novel technique based on the integration of NSGA-II and incremental DBSCAN for the automatic determination of the appropriate number of clusters and the enhancement of the quality of clusters.
- •
Using NSGA-II as a parameter tuning tool for reducing incorrectly-partitioned data points of incremental DBSCAN.
- •
Using multiple internal validation indices for choosing the most appropriate number of clusters in unlabeled datasets.
- •
Using multiple external validation indices for generating an efficient data partitioning in the labeled datasets.
- •
Performing the fitness evaluations of the NSGA-II different individuals in parallel.
The article has been structured as follows: Section two refers to previous works. Section three presents the incremental DBSCAN clustering algorithm. The proposed pNSGA-II/DBCC approach is presented in Section 4. Section 5 proposes various fitness functions for supervised and unsupervised pNSGA-II/DBCC. The experimental results and the conclusion are presented in Sections 6 Experimental setup and results, 7 Conclusion, respectively.
Section snippets
Related work
The relevant literature in terms of using Evolutionary and Swarm Algorithms (ESAs) has been reviewed in this section to improve the functionality of the incremental DBSCAN.
PSO is a practical algorithm for dealing with many ranges of optimization problems [24], [25], [26], [27], [28]. Guan, Yuen [17] have introduced a novel combined method called Particle swarm Optimized Density-based Clustering and Classification (PODCC) to identify the parameters used by DBSCAN to provide more accurate
Background
In this section, related technologies and the required notions are reviewed.
Algorithm
In the following section, the scheme of a parallel NSGA-II based density-based clustering and classification algorithm are detailed.
Objective functions
Multi-objective optimization is the optimization of conflicting objectives within the given constraints. In these problems, optimal decisions must be against the interests of two or more conflicting objectives. Four objective functions have been considered for optimization. The first two are internal cluster validity indexes dependent on a few inherent characteristics of the datasets. The other two calculate the violation of existing supervised data. Also, they are known as the external cluster
Experimental setup and results
In this research, several experiments have been performed to assess the performance of the introduced pNSGA-II/DBCC algorithm. The clustering and classification results of the pNSGA-II/DBCC have been compared to a standard PSO-based algorithm and some established methods. The results obtained in artificial and real data problems are described in Section 6.4 (Sections 6.4.1 to 6.4.3). Furthermore, the efficiency of the proposed algorithm is compared with a sequential algorithm in Section 6.4.4.
Conclusion
The present article has introduced a novel parallel technique named NSGA-II based Density-Based Clustering and Classification (pNSGA-II/DBCC) to handle the setting of two global input parameters of incremental DBSCAN problem. To address this problem, a method has been used based on NSGA-II to explore the whole parameter space for incremental DBSCAN. Two objectives have been considered in this paper: the maximization of the Dunn index and the minimization of the Davies–Bouldin index. On the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Elham Azhir is a Ph.D. student in Computer Engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran from September 2016. She received the B.S. and M.Sc. degree from Qazvin Islamic Azad University, Qazvin, Iran in 2010 and 2014, all in Software Engineering. Her main research interests are query optimization, cloud computing, distributed systems and programming.
References (40)
- et al.
Application of a density based clustering technique on biomedical datasets
Appl. Soft Comput.
(2018) - et al.
Data aggregation mechanisms in the Internet of things: A systematic review of the literature and recommendations for future research
J. Netw. Comput. Appl.
(2017) Application of clustering-based decision tree approach in SQL query error database
Future Gener. Comput. Syst.
(2019)- et al.
A systematic literature review of the data replication techniques in the cloud environments
Big Data Res.
(2017) - et al.
Hierarchical density-based cluster analysis framework for atom probe tomography data
Ultramicroscopy
(2019) A density-based clustering algorithm for earthquake zoning
Comput. Geosci.
(2018)- et al.
Intrusion detection for cloud computing using neural networks and artificial bee colony optimization algorithm
ICT Express
(2019) - et al.
Efficient incremental density-based algorithm for clustering large datasets
Alexandria Eng. J.
(2015) - et al.
Particle swarm Optimized Density-based Clustering and Classification: Supervised and unsupervised learning approaches
Swarm Evol. Comput.
(2019) A new hybrid method based on partitioning-based DBSCAN and ant clustering
Expert Syst. Appl.
(2011)
Sampling approaches for applying DBSCAN to large datasets
Pattern Recognit. Lett.
Service allocation in the cloud environments using multi-objective particle swarm optimization algorithm based on crowding distance
Swarm Evol. Comput.
Multi-objective energy management in a micro-grid
Energy Rep.
Fast density-based clustering through dataset partition using graphics processing units
Inform. Sci.
Towards data analysis for weather cloud computing
Knowl.-Based Syst.
Speed up genetic algorithms in the cloud using software containers
Future Gener. Comput. Syst.
Data Mining: Concepts and Techniques
Towards efficient data collection mechanisms in the vehicular ad hoc networks
Int. J. Commun. Syst.
Join query optimization in the distributed database system using an artificial bee colony algorithm and genetic operators
Concurr. Comput.: Pract. Exper.
Similarity metrics for sql query clustering
IEEE Trans. Knowl. Data Eng.
Cited by (25)
Key grids based batch-incremental CLIQUE clustering algorithm considering cluster structure changes
2024, Information SciencesIntrusion detection system using hybrid classifiers with meta-heuristic algorithms for the optimization and feature selection by genetic algorithm
2022, Computers and Electrical EngineeringCitation Excerpt :A hybrid approach was applied to design IDS using several machine learning classifiers. The detection of intrusions addressed in various areas, including anomaly detection based on feature clustering [3], Internet of the Things (IoT) for smart cities [4], and artificial intelligence [5]. Moreover, many methods and algorithms have been designed, including noisy and redundant attributes.
A public transport network design using a hidden Markov model and an optimization algorithm
2022, Research in Transportation EconomicsCitation Excerpt :The dwell times, such as riding times, may differ among days and trips based on the vehicle type, existing travelers, and other factors, therefore called stochastically. Clustering is the procedure of classifying a set of objects into clusters in which the inner members of each cluster are most similar to each other and the least similar to members of other clusters (Azhir et al., 2021a; Sadrishojaei et al., 2021). In general, there may be different ways to specify the clustering of two neighboring nodes, but most do so locally (Azhir et al., 2021b, Zanbouri & Jafari Navimipour, 2020).
A bi-level multi-objective location-routing model for municipal waste management with obnoxious effects
2021, Waste ManagementCitation Excerpt :The LRP is an NP-hard problem as it has an exponential worst-case complexity that challenges exact methods when seeking to obtain an optimum solution for large LRP; therefore, many heuristic methods have been proposed to address this in routing problems, such as artificial neural networks (ANN) (Vu et al., 2019), genetic algorithms (GA) (Karakatič, 2021), large neighborhood search (Wolfinger and Salazar-González, 2021) and Tabu search (Gmira et al., 2021). Of these, the GA has been found to be an effective approach to solving real-world optimization problems, for which several improved hybrid algorithms have been proposed (Sitek et al., 2021; Andrade et al., 2021; Mao et al., 2021), with the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) being one of the most widely used for multi-objective optimization (Rodriguez Sotomonte et al., 2021; Azhir et al., 2021) as the NSGA-II combines the Pareto concept and GA. For example, Deb et al. (2002) optimized an operational NSGA scale by employing an elite mechanism, with this version of NSGA named the NSGA-II.
A dynamic density-based clustering method based on K-nearest neighbor
2024, Knowledge and Information SystemsAN ALTERNATIVE PARAMETER FREE CLUSTERING ALGORITHM USING DATA POINT POSITIONING ANALYSIS (DPPA) – COMPARISON WITH DBSCAN
2023, International Journal of Innovative Computing, Information and Control
Elham Azhir is a Ph.D. student in Computer Engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran from September 2016. She received the B.S. and M.Sc. degree from Qazvin Islamic Azad University, Qazvin, Iran in 2010 and 2014, all in Software Engineering. Her main research interests are query optimization, cloud computing, distributed systems and programming.
Nima Jafari Navimipour received his B.S. in computer engineering, software engineering, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2007; the M.S. in computer engineering, computer architecture, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2009; the Ph.D. in computer engineering, computer architecture, from Science and Research Branch, Islamic Azad University, Tehran, Iran in 2014. His research interests include SDN, cloud computing, grid systems, computational intelligence, evolutionary computing, and wireless networks.
Mehdi HosseinZadeh received his B.S. degree in computer hardware engineering, from Islamic Azad University, Dezfol branch, Iran in 2003. He also received his M.Sc. and the Ph.D. degree in computer system architecture from the Science and Research Branch, Islamic Azad University, Tehran, Iran in 2005 and 2008, respectively. He is currently an Associate professor in Iran University of Medical Sciences (IUMS), Tehran, Iran, and his research interests include SDN, Information Technology, Data Mining, Big data analytics, E-Commerce, E-Marketing, and Social Networks.
Arash Sharifi received the B.S. degree in computer hardware engineering from IAU South Tehran Branch, M.S degree and Ph.D. degree in artificial intelligence from IAU science and research branch, in 2007 and 2012 respectively. He is currently head of computer engineering department of SRBIAU. His current research interests include image processing, machine learning and deep learning.
Aso Darwesh received the B.S. in Mathematics in University of Sulaimani, Iraq 2001, M.S. degrees in Computer Science in University of Rene Descartes, France 2007 and Ph.D. in Computer Science, University of Pierre and Mari Curie, France 2010. Currently, he is Associate Professor in the Information Technology Department, University of Human Development, Sulaymaniyah, Iraq. His research interests include Serious Games, Adaptive Learning Cognitive Diagnosis in E-Learning, Learning Systems, Computer Networks, Networking Security, and Data Mining.