An efficient automated incremental density-based algorithm for clustering and classification

doi:10.1016/j.future.2020.08.031

Future Generation Computer Systems

Volume 114, January 2021, Pages 665-678

https://doi.org/10.1016/j.future.2020.08.031 Get rights and content

Highlights

•
Presenting a novel technique grounded on the integration of NSGA-II and incremental DBSCAN for improving the quality of clustering.
•
Using NSGA-II as a parameter tuning tool for reducing incorrectly-partitioned data points of incremental DBSCAN.
•
Using the internal validation indices for choosing the most appropriate number of clusters in unlabeled datasets.
•
Using the external validation indices for generating an efficient data partitioning in the labeled datasets.

Abstract

Data clustering divides the datasets into different groups. Incremental Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a famous density-based clustering technique able to find the clusters of variable sizes and shapes. The quality of incremental DBSCAN results has been influenced by two input parameters: MinPts (Minimum Points) and Eps (Epsilon). Therefore, the parameter setting is one of the major problems of incremental DBSCAN. In the present article, an improved incremental DBSCAN accorded to Non-dominated Sorting Genetic Algorithm II (NSGA-II) has been presented to address the issue. The proposed algorithm adjusts the two parameters (MinPts and Eps) of the incremental DBSCAN via the iteration and the fitness functions to enhance the clustering precision. Moreover, our proposed method introduces suitable fitness functions for both labeled and unlabeled datasets. We have also improved the efficiency of the proposed hybrid algorithm by parallelization of the optimization process. The evaluation of the introduced method has been done through some textual and numerical datasets with different shapes, sizes, and dimensions. According to the experimental results, the proposed algorithm provides better results than Multi-Objective Particle Swarm Optimization (MOPSO) based incremental DBSCAN and a few well-known techniques, particularly regarding the shape and balanced datasets. Also, good speed-up can be reached with a parallel model compared with the serial version of the algorithm.

Introduction

Clustering is an analytical method used to group data objects according to their similarity without any knowledge of ground truth clusters [1]. The cluster analysis has been used in many different areas [2], [3], [4], [5]. In query optimization [6], clustering is used to identify a query group with a similar Query Execution Plan (QEP) due to the semantic representations of the queries. The mass of data is usually stored in distributed databases [7]. The sequence of queries is issued to distributed databases as the primary interaction unit among a database and its users [8]. The same QEP could be used to execute the queries with a similar structure [9]. There are different clustering methods, including density-based clustering [10], [11], [12], fuzzy clustering [13], hierarchical clustering, partitioning clustering, and model-based clustering.

There are various factors, including the size of data, shapes of clusters, the noise of data, and the number of input parameters affecting the quality of these clustering methods. A famous way for data cluster analysis is the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [14]. It can find clusters with arbitrary shapes and handle the noise. A static DBSCAN algorithm has been used to cluster static datasets in which all data objects have to be collected before running the algorithm [15]. Furthermore, the static DBSCAN algorithm re-cluster all the data objects for new incoming data. In dynamic environments, all data cannot be collected before performing clustering. Such datasets are collected over large environments and evolve quickly. Therefore, the incremental DBSCAN clustering is preferable to traditional static DBSCAN. The incremental DBSCAN [15] can incrementally create and update the arbitrary shaped clusters in large dynamic datasets. In the algorithm, the performance of clustering is influenced by its two input parameters (Epsilon (Eps) and Minimum points (MinPts)). Nevertheless, it is quite challenging to apply incremental DBSCAN to cluster points because of the complexity of determining its input parameters, mainly for handling a massive volume of data and the inaccurate characterizing of many kinds of data. This is why we should introduce automatic methods to calculate the values of these parameters.

Recently, several studies have combined meta-heuristic optimization algorithms with clustering algorithms to improve the quality of clustering [16], [17], [18], [19], [20], [21], [22], [23], for example, in [20], a hybrid clustering method was presented, which combined Differential Evolution (DE) with DBSCAN to automatically detect the best combinations of Eps and MinPts. In [17], Particle Swarm Optimization (PSO) was applied as a parameter tuning tool for DBSCAN for both supervised and unsupervised learning. However, these single-objective approaches can produce unbalanced results. Although, to the best of our knowledge, there is not any work directed at using multi-objective optimization techniques for density-based cluster parameter optimization. Therefore, this paper has proposed a new hybrid approach, NSGA-II based Density-Based Clustering and Classification (NSGA-II/DBCC), to improve the clustering quality of the incremental DBSCAN algorithm by identifying the ideal parameter configurations via searching for the whole parameter space by NSGA-II. In this paper, four kinds of fitness functions have been designed based on internal and external clustering validation indices to determine the best configuration of parameters for both labeled and unlabeled datasets [17] using NSGA-II/ DBCC. As far as we know, no other authors have used NSGA-II to optimize the DBSCAN parameters using fitness functions introduced in this paper. Moreover, the present article has developed a parallel version of the Non-dominated Sorting Genetic Algorithm II (pNSGA-II) to speed computations of fitness functions. The main contributions of this article are as follows:

•
Presenting a novel technique based on the integration of NSGA-II and incremental DBSCAN for the automatic determination of the appropriate number of clusters and the enhancement of the quality of clusters.
•
Using NSGA-II as a parameter tuning tool for reducing incorrectly-partitioned data points of incremental DBSCAN.
•
Using multiple internal validation indices for choosing the most appropriate number of clusters in unlabeled datasets.
•
Using multiple external validation indices for generating an efficient data partitioning in the labeled datasets.
•
Performing the fitness evaluations of the NSGA-II different individuals in parallel.

The article has been structured as follows: Section two refers to previous works. Section three presents the incremental DBSCAN clustering algorithm. The proposed pNSGA-II/DBCC approach is presented in Section 4. Section 5 proposes various fitness functions for supervised and unsupervised pNSGA-II/DBCC. The experimental results and the conclusion are presented in Sections 6 Experimental setup and results, 7 Conclusion, respectively.

Section snippets

Related work

The relevant literature in terms of using Evolutionary and Swarm Algorithms (ESAs) has been reviewed in this section to improve the functionality of the incremental DBSCAN.

PSO is a practical algorithm for dealing with many ranges of optimization problems [24], [25], [26], [27], [28]. Guan, Yuen [17] have introduced a novel combined method called Particle swarm Optimized Density-based Clustering and Classification (PODCC) to identify the parameters used by DBSCAN to provide more accurate

Background

In this section, related technologies and the required notions are reviewed.

Algorithm

In the following section, the scheme of a parallel NSGA-II based density-based clustering and classification algorithm are detailed.

Objective functions

Multi-objective optimization is the optimization of conflicting objectives within the given constraints. In these problems, optimal decisions must be against the interests of two or more conflicting objectives. Four objective functions have been considered for optimization. The first two are internal cluster validity indexes dependent on a few inherent characteristics of the datasets. The other two calculate the violation of existing supervised data. Also, they are known as the external cluster

Experimental setup and results

In this research, several experiments have been performed to assess the performance of the introduced pNSGA-II/DBCC algorithm. The clustering and classification results of the pNSGA-II/DBCC have been compared to a standard PSO-based algorithm and some established methods. The results obtained in artificial and real data problems are described in Section 6.4 (Sections 6.4.1 to 6.4.3). Furthermore, the efficiency of the proposed algorithm is compared with a sequential algorithm in Section 6.4.4.

Conclusion

The present article has introduced a novel parallel technique named NSGA-II based Density-Based Clustering and Classification (pNSGA-II/DBCC) to handle the setting of two global input parameters of incremental DBSCAN problem. To address this problem, a method has been used based on NSGA-II to explore the whole parameter space for incremental DBSCAN. Two objectives have been considered in this paper: the maximization of the Dunn index and the minimization of the Davies–Bouldin index. On the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Elham Azhir is a Ph.D. student in Computer Engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran from September 2016. She received the B.S. and M.Sc. degree from Qazvin Islamic Azad University, Qazvin, Iran in 2010 and 2014, all in Software Engineering. Her main research interests are query optimization, cloud computing, distributed systems and programming.

References (40)

RahmanM.A. et al.
Application of a density based clustering technique on biomedical datasets
Appl. Soft Comput.
(2018)
PourgheblehB. et al.
Data aggregation mechanisms in the Internet of things: A systematic review of the literature and recommendations for future research
J. Netw. Comput. Appl.
(2017)
LinoA.
Application of clustering-based decision tree approach in SQL query error database
Future Gener. Comput. Syst.
(2019)
MilaniB.A. et al.
A systematic literature review of the data replication techniques in the cloud environments
Big Data Res.
(2017)
GhamarianI. et al.
Hierarchical density-based cluster analysis framework for atom probe tomography data
Ultramicroscopy
(2019)
ScitovskiS.
A density-based clustering algorithm for earthquake zoning
Comput. Geosci.
(2018)
HajimirzaeiB. et al.
Intrusion detection for cloud computing using neural networks and artificial bee colony optimization algorithm
ICT Express
(2019)
BakrA.M. et al.
Efficient incremental density-based algorithm for clustering large datasets
Alexandria Eng. J.
(2015)
GuanC. et al.
Particle swarm Optimized Density-based Clustering and Classification: Supervised and unsupervised learning approaches
Swarm Evol. Comput.
(2019)
JiangH.
A new hybrid method based on partitioning-based DBSCAN and ant clustering
Expert Syst. Appl.
(2011)

LuchiD. et al.

Sampling approaches for applying DBSCAN to large datasets

Pattern Recognit. Lett.

(2019)

SheikholeslamiF. et al.

Service allocation in the cloud environments using multi-objective particle swarm optimization algorithm based on crowding distance

Swarm Evol. Comput.

(2017)

AghajaniG. et al.

Multi-objective energy management in a micro-grid

Energy Rep.

(2018)

LohW.-K. et al.

Fast density-based clustering through dataset partition using graphics processing units

Inform. Sci.

(2015)

ChangV.

Towards data analysis for weather cloud computing

Knowl.-Based Syst.

(2017)

SalzaP. et al.

Speed up genetic algorithms in the cloud using software containers

Future Gener. Comput. Syst.

(2019)

HanJ. et al.

Data Mining: Concepts and Techniques

(2011)

PourgheblehB. et al.

Towards efficient data collection mechanisms in the vehicular ad hoc networks

Int. J. Commun. Syst.

(2019)

PanahiV. et al.

Join query optimization in the distributed database system using an artificial bee colony algorithm and genetic operators

Concurr. Comput.: Pract. Exper.

(2019)

KulG.

Similarity metrics for sql query clustering

IEEE Trans. Knowl. Data Eng.

(2018)

Cited by (25)

Key grids based batch-incremental CLIQUE clustering algorithm considering cluster structure changes
2024, Information Sciences
In the network environment, data from various industries is dynamic and large-scale. Traditional clustering algorithms struggle to effectively utilize existing clustering results when faced with continuously evolving data, which makes the incremental grid-based clustering highly regarded. However, the existing incremental grid-based clustering algorithms fail to adequately consider the impact of newly added data on the original cluster structure. To address this issue, the key grids based batch-incremental CLIQUE clustering algorithm is proposed. The algorithm designates the incremental data mapping grids, which are or their neighbour girds are mixed with original data, as key grids to fully consider the cluster structure changes caused by the incremental data. Moreover, the cluster similarity coefficient based on grid features is introduced to measure density differences between the incremental data and the original clusters, and the cluster membership degree is defined to further consider the cluster membership of boundary sparse grid data and the identification of noise points. All of which ensures that the algorithm can adaptively create, merge or split clusters with the arrival of new data. Experimental results show that the proposed algorithm can adaptively adjust the cluster structure during incremental clustering, outperforming in accuracy and efficiency when clustering large-scale, dynamically changing data.
Intrusion detection system using hybrid classifiers with meta-heuristic algorithms for the optimization and feature selection by genetic algorithm
2022, Computers and Electrical Engineering
Citation Excerpt :
A hybrid approach was applied to design IDS using several machine learning classifiers. The detection of intrusions addressed in various areas, including anomaly detection based on feature clustering [3], Internet of the Things (IoT) for smart cities [4], and artificial intelligence [5]. Moreover, many methods and algorithms have been designed, including noisy and redundant attributes.
An intrusion detection system (IDS) is considered critical for detecting threats, intrusions, and unauthorized access. IDS monitors massive network traffic that includes irrelevant and extravagant features that profoundly impact the system’s efficiency and slow down the classification process for accurate decisions. Its effectiveness is tested over the various techniques that comprise an enormous volume of data and heavy network traffic. Many approaches, such as machine learning algorithms, data mining, swarm intelligence, and artificial neural networks (ANN), have been implemented for adequate and improved IDSs. This paper recommends a novel feature selection method using a genetic algorithm (GA) that determines the optimal feature subsets from the NSL-KDD dataset. Further, hybrid classification has been performed using logistic regression (LR) and decision tree (DT) to achieve a better detection rate (DR) and accuracy (ACC). This research applied and compared several meta-heuristic algorithms’ performance to optimize the selected optimal features. The experimental results show that the grey wolf optimization (GWO) algorithm gives the best accuracy of 99.44% and DR of 99.36% with the reduction of features (=20) out of (=41). The results of the proposed work are compared with the existing feature selection methods to verify improved performance.
A public transport network design using a hidden Markov model and an optimization algorithm
2022, Research in Transportation Economics
Citation Excerpt :
The dwell times, such as riding times, may differ among days and trips based on the vehicle type, existing travelers, and other factors, therefore called stochastically. Clustering is the procedure of classifying a set of objects into clusters in which the inner members of each cluster are most similar to each other and the least similar to members of other clusters (Azhir et al., 2021a; Sadrishojaei et al., 2021). In general, there may be different ways to specify the clustering of two neighboring nodes, but most do so locally (Azhir et al., 2021b, Zanbouri & Jafari Navimipour, 2020).
Transportation Network Design Problem (TNDP) includes making the right choices possible when deciding a collection of design criteria to develop a current transportation network in response to rising traffic demand. Traffic congestion, higher maintenance and fuel prices, delays, accidents, and air emissions stem from the general rise in flow volume. Because of the NP-hard nature of this problem, a hidden Markov model and an Equilibrium Optimizer (EO) are employed in this paper to solve it. Each particle (solution) behaves as a search agent in EO, with its position. To reach the equilibrium condition, the search agents change their focus at random regarding the best-so-far approaches, including equilibrium candidates. A well-defined "generation rate" concept has been shown to elevate EO's capacity in avoiding local minima. This article provides a new method to lower the feasible travel time and the public travel cost using the hidden Markov model and EO algorithm. The suggested method's performance was compared to the performance of other algorithms on a test network. The related numerical outcomes show that it is more effective.
A bi-level multi-objective location-routing model for municipal waste management with obnoxious effects
2021, Waste Management
Citation Excerpt :
The LRP is an NP-hard problem as it has an exponential worst-case complexity that challenges exact methods when seeking to obtain an optimum solution for large LRP; therefore, many heuristic methods have been proposed to address this in routing problems, such as artificial neural networks (ANN) (Vu et al., 2019), genetic algorithms (GA) (Karakatič, 2021), large neighborhood search (Wolfinger and Salazar-González, 2021) and Tabu search (Gmira et al., 2021). Of these, the GA has been found to be an effective approach to solving real-world optimization problems, for which several improved hybrid algorithms have been proposed (Sitek et al., 2021; Andrade et al., 2021; Mao et al., 2021), with the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) being one of the most widely used for multi-objective optimization (Rodriguez Sotomonte et al., 2021; Azhir et al., 2021) as the NSGA-II combines the Pareto concept and GA. For example, Deb et al. (2002) optimized an operational NSGA scale by employing an elite mechanism, with this version of NSGA named the NSGA-II.
Municipal waste management is a complex problem. This paper develops a bi-level multi-objective location-routing model for municipal waste management that considers the interests of both the government and the sanitation companies. The government as the leader decides on the location and scale of the waste recycling centers to reduce the obnoxious effects and ensure cost effectiveness, and the sanitation company as the follower decides on the waste collection routing plans based on the government-approved locations to minimize the logistics cost. An improved hybrid NSGA-II is then developed to solve the proposed model. Two initial solution methods are employed: clustering for the leader and a Clarke and Wright method for the follower. Non-dominated sorting and best-cost route crossover operator are used to improve the effectiveness of NSGA-II. Based on Prins (24 instances) and Barreto (13 instances) benchmarks, the experimental results indicated that the improved operator had strong competitiveness and a better performance than previous methods, with the improved algorithm achieving the best average gaps of 0.18% and 0.24% and improving the best-known solutions in some instances. The model and solution methodology are illustrated using a waste collection problem in Tianjin, from which practical insights are derived.
A dynamic density-based clustering method based on K-nearest neighbor
2024, Knowledge and Information Systems
AN ALTERNATIVE PARAMETER FREE CLUSTERING ALGORITHM USING DATA POINT POSITIONING ANALYSIS (DPPA) – COMPARISON WITH DBSCAN
2023, International Journal of Innovative Computing, Information and Control

View all citing articles on Scopus

Nima Jafari Navimipour received his B.S. in computer engineering, software engineering, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2007; the M.S. in computer engineering, computer architecture, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2009; the Ph.D. in computer engineering, computer architecture, from Science and Research Branch, Islamic Azad University, Tehran, Iran in 2014. His research interests include SDN, cloud computing, grid systems, computational intelligence, evolutionary computing, and wireless networks.

Mehdi HosseinZadeh received his B.S. degree in computer hardware engineering, from Islamic Azad University, Dezfol branch, Iran in 2003. He also received his M.Sc. and the Ph.D. degree in computer system architecture from the Science and Research Branch, Islamic Azad University, Tehran, Iran in 2005 and 2008, respectively. He is currently an Associate professor in Iran University of Medical Sciences (IUMS), Tehran, Iran, and his research interests include SDN, Information Technology, Data Mining, Big data analytics, E-Commerce, E-Marketing, and Social Networks.

Arash Sharifi received the B.S. degree in computer hardware engineering from IAU South Tehran Branch, M.S degree and Ph.D. degree in artificial intelligence from IAU science and research branch, in 2007 and 2012 respectively. He is currently head of computer engineering department of SRBIAU. His current research interests include image processing, machine learning and deep learning.

Aso Darwesh received the B.S. in Mathematics in University of Sulaimani, Iraq 2001, M.S. degrees in Computer Science in University of Rene Descartes, France 2007 and Ph.D. in Computer Science, University of Pierre and Mari Curie, France 2010. Currently, he is Associate Professor in the Information Technology Department, University of Human Development, Sulaymaniyah, Iraq. His research interests include Serious Games, Adaptive Learning Cognitive Diagnosis in E-Learning, Learning Systems, Computer Networks, Networking Security, and Data Mining.

View full text

An efficient automated incremental density-based algorithm for clustering and classification

Highlights

Abstract

Introduction

Section snippets

Related work

Background

Algorithm

Objective functions

Experimental setup and results

Conclusion

Declaration of Competing Interest

Appl. Soft Comput.

J. Netw. Comput. Appl.

Future Gener. Comput. Syst.

Big Data Res.

Ultramicroscopy

Comput. Geosci.

ICT Express

Alexandria Eng. J.

Swarm Evol. Comput.

Expert Syst. Appl.

Pattern Recognit. Lett.

Swarm Evol. Comput.

Energy Rep.

Inform. Sci.

Knowl.-Based Syst.

Future Gener. Comput. Syst.

Data Mining: Concepts and Techniques

Towards efficient data collection mechanisms in the vehicular ad hoc networks

Int. J. Commun. Syst.

Join query optimization in the distributed database system using an artificial bee colony algorithm and genetic operators

Concurr. Comput.: Pract. Exper.

Similarity metrics for sql query clustering

IEEE Trans. Knowl. Data Eng.