An efficient automated incremental density-based algorithm for clustering and classification

https://doi.org/10.1016/j.future.2020.08.031Get rights and content

Highlights

  • Presenting a novel technique grounded on the integration of NSGA-II and incremental DBSCAN for improving the quality of clustering.

  • Using NSGA-II as a parameter tuning tool for reducing incorrectly-partitioned data points of incremental DBSCAN.

  • Using the internal validation indices for choosing the most appropriate number of clusters in unlabeled datasets.

  • Using the external validation indices for generating an efficient data partitioning in the labeled datasets.

Abstract

Data clustering divides the datasets into different groups. Incremental Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a famous density-based clustering technique able to find the clusters of variable sizes and shapes. The quality of incremental DBSCAN results has been influenced by two input parameters: MinPts (Minimum Points) and Eps (Epsilon). Therefore, the parameter setting is one of the major problems of incremental DBSCAN. In the present article, an improved incremental DBSCAN accorded to Non-dominated Sorting Genetic Algorithm II (NSGA-II) has been presented to address the issue. The proposed algorithm adjusts the two parameters (MinPts and Eps) of the incremental DBSCAN via the iteration and the fitness functions to enhance the clustering precision. Moreover, our proposed method introduces suitable fitness functions for both labeled and unlabeled datasets. We have also improved the efficiency of the proposed hybrid algorithm by parallelization of the optimization process. The evaluation of the introduced method has been done through some textual and numerical datasets with different shapes, sizes, and dimensions. According to the experimental results, the proposed algorithm provides better results than Multi-Objective Particle Swarm Optimization (MOPSO) based incremental DBSCAN and a few well-known techniques, particularly regarding the shape and balanced datasets. Also, good speed-up can be reached with a parallel model compared with the serial version of the algorithm.

Introduction

Clustering is an analytical method used to group data objects according to their similarity without any knowledge of ground truth clusters [1]. The cluster analysis has been used in many different areas [2], [3], [4], [5]. In query optimization [6], clustering is used to identify a query group with a similar Query Execution Plan (QEP) due to the semantic representations of the queries. The mass of data is usually stored in distributed databases [7]. The sequence of queries is issued to distributed databases as the primary interaction unit among a database and its users [8]. The same QEP could be used to execute the queries with a similar structure [9]. There are different clustering methods, including density-based clustering [10], [11], [12], fuzzy clustering [13], hierarchical clustering, partitioning clustering, and model-based clustering.

There are various factors, including the size of data, shapes of clusters, the noise of data, and the number of input parameters affecting the quality of these clustering methods. A famous way for data cluster analysis is the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [14]. It can find clusters with arbitrary shapes and handle the noise. A static DBSCAN algorithm has been used to cluster static datasets in which all data objects have to be collected before running the algorithm [15]. Furthermore, the static DBSCAN algorithm re-cluster all the data objects for new incoming data. In dynamic environments, all data cannot be collected before performing clustering. Such datasets are collected over large environments and evolve quickly. Therefore, the incremental DBSCAN clustering is preferable to traditional static DBSCAN. The incremental DBSCAN [15] can incrementally create and update the arbitrary shaped clusters in large dynamic datasets. In the algorithm, the performance of clustering is influenced by its two input parameters (Epsilon (Eps) and Minimum points (MinPts)). Nevertheless, it is quite challenging to apply incremental DBSCAN to cluster points because of the complexity of determining its input parameters, mainly for handling a massive volume of data and the inaccurate characterizing of many kinds of data. This is why we should introduce automatic methods to calculate the values of these parameters.

Recently, several studies have combined meta-heuristic optimization algorithms with clustering algorithms to improve the quality of clustering [16], [17], [18], [19], [20], [21], [22], [23], for example, in [20], a hybrid clustering method was presented, which combined Differential Evolution (DE) with DBSCAN to automatically detect the best combinations of Eps and MinPts. In [17], Particle Swarm Optimization (PSO) was applied as a parameter tuning tool for DBSCAN for both supervised and unsupervised learning. However, these single-objective approaches can produce unbalanced results. Although, to the best of our knowledge, there is not any work directed at using multi-objective optimization techniques for density-based cluster parameter optimization. Therefore, this paper has proposed a new hybrid approach, NSGA-II based Density-Based Clustering and Classification (NSGA-II/DBCC), to improve the clustering quality of the incremental DBSCAN algorithm by identifying the ideal parameter configurations via searching for the whole parameter space by NSGA-II. In this paper, four kinds of fitness functions have been designed based on internal and external clustering validation indices to determine the best configuration of parameters for both labeled and unlabeled datasets [17] using NSGA-II/ DBCC. As far as we know, no other authors have used NSGA-II to optimize the DBSCAN parameters using fitness functions introduced in this paper. Moreover, the present article has developed a parallel version of the Non-dominated Sorting Genetic Algorithm II (pNSGA-II) to speed computations of fitness functions. The main contributions of this article are as follows:

  • Presenting a novel technique based on the integration of NSGA-II and incremental DBSCAN for the automatic determination of the appropriate number of clusters and the enhancement of the quality of clusters.

  • Using NSGA-II as a parameter tuning tool for reducing incorrectly-partitioned data points of incremental DBSCAN.

  • Using multiple internal validation indices for choosing the most appropriate number of clusters in unlabeled datasets.

  • Using multiple external validation indices for generating an efficient data partitioning in the labeled datasets.

  • Performing the fitness evaluations of the NSGA-II different individuals in parallel.

The article has been structured as follows: Section two refers to previous works. Section three presents the incremental DBSCAN clustering algorithm. The proposed pNSGA-II/DBCC approach is presented in Section 4. Section 5 proposes various fitness functions for supervised and unsupervised pNSGA-II/DBCC. The experimental results and the conclusion are presented in Sections 6 Experimental setup and results, 7 Conclusion, respectively.

Section snippets

Related work

The relevant literature in terms of using Evolutionary and Swarm Algorithms (ESAs) has been reviewed in this section to improve the functionality of the incremental DBSCAN.

PSO is a practical algorithm for dealing with many ranges of optimization problems [24], [25], [26], [27], [28]. Guan, Yuen [17] have introduced a novel combined method called Particle swarm Optimized Density-based Clustering and Classification (PODCC) to identify the parameters used by DBSCAN to provide more accurate

Background

In this section, related technologies and the required notions are reviewed.

Algorithm

In the following section, the scheme of a parallel NSGA-II based density-based clustering and classification algorithm are detailed.

Objective functions

Multi-objective optimization is the optimization of conflicting objectives within the given constraints. In these problems, optimal decisions must be against the interests of two or more conflicting objectives. Four objective functions have been considered for optimization. The first two are internal cluster validity indexes dependent on a few inherent characteristics of the datasets. The other two calculate the violation of existing supervised data. Also, they are known as the external cluster

Experimental setup and results

In this research, several experiments have been performed to assess the performance of the introduced pNSGA-II/DBCC algorithm. The clustering and classification results of the pNSGA-II/DBCC have been compared to a standard PSO-based algorithm and some established methods. The results obtained in artificial and real data problems are described in Section 6.4 (Sections 6.4.1 to 6.4.3). Furthermore, the efficiency of the proposed algorithm is compared with a sequential algorithm in Section 6.4.4.

Conclusion

The present article has introduced a novel parallel technique named NSGA-II based Density-Based Clustering and Classification (pNSGA-II/DBCC) to handle the setting of two global input parameters of incremental DBSCAN problem. To address this problem, a method has been used based on NSGA-II to explore the whole parameter space for incremental DBSCAN. Two objectives have been considered in this paper: the maximization of the Dunn index and the minimization of the Davies–Bouldin index. On the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Elham Azhir is a Ph.D. student in Computer Engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran from September 2016. She received the B.S. and M.Sc. degree from Qazvin Islamic Azad University, Qazvin, Iran in 2010 and 2014, all in Software Engineering. Her main research interests are query optimization, cloud computing, distributed systems and programming.

References (40)

  • LuchiD. et al.

    Sampling approaches for applying DBSCAN to large datasets

    Pattern Recognit. Lett.

    (2019)
  • SheikholeslamiF. et al.

    Service allocation in the cloud environments using multi-objective particle swarm optimization algorithm based on crowding distance

    Swarm Evol. Comput.

    (2017)
  • AghajaniG. et al.

    Multi-objective energy management in a micro-grid

    Energy Rep.

    (2018)
  • LohW.-K. et al.

    Fast density-based clustering through dataset partition using graphics processing units

    Inform. Sci.

    (2015)
  • ChangV.

    Towards data analysis for weather cloud computing

    Knowl.-Based Syst.

    (2017)
  • SalzaP. et al.

    Speed up genetic algorithms in the cloud using software containers

    Future Gener. Comput. Syst.

    (2019)
  • HanJ. et al.

    Data Mining: Concepts and Techniques

    (2011)
  • PourgheblehB. et al.

    Towards efficient data collection mechanisms in the vehicular ad hoc networks

    Int. J. Commun. Syst.

    (2019)
  • PanahiV. et al.

    Join query optimization in the distributed database system using an artificial bee colony algorithm and genetic operators

    Concurr. Comput.: Pract. Exper.

    (2019)
  • KulG.

    Similarity metrics for sql query clustering

    IEEE Trans. Knowl. Data Eng.

    (2018)
  • Cited by (25)

    • Intrusion detection system using hybrid classifiers with meta-heuristic algorithms for the optimization and feature selection by genetic algorithm

      2022, Computers and Electrical Engineering
      Citation Excerpt :

      A hybrid approach was applied to design IDS using several machine learning classifiers. The detection of intrusions addressed in various areas, including anomaly detection based on feature clustering [3], Internet of the Things (IoT) for smart cities [4], and artificial intelligence [5]. Moreover, many methods and algorithms have been designed, including noisy and redundant attributes.

    • A public transport network design using a hidden Markov model and an optimization algorithm

      2022, Research in Transportation Economics
      Citation Excerpt :

      The dwell times, such as riding times, may differ among days and trips based on the vehicle type, existing travelers, and other factors, therefore called stochastically. Clustering is the procedure of classifying a set of objects into clusters in which the inner members of each cluster are most similar to each other and the least similar to members of other clusters (Azhir et al., 2021a; Sadrishojaei et al., 2021). In general, there may be different ways to specify the clustering of two neighboring nodes, but most do so locally (Azhir et al., 2021b, Zanbouri & Jafari Navimipour, 2020).

    • A bi-level multi-objective location-routing model for municipal waste management with obnoxious effects

      2021, Waste Management
      Citation Excerpt :

      The LRP is an NP-hard problem as it has an exponential worst-case complexity that challenges exact methods when seeking to obtain an optimum solution for large LRP; therefore, many heuristic methods have been proposed to address this in routing problems, such as artificial neural networks (ANN) (Vu et al., 2019), genetic algorithms (GA) (Karakatič, 2021), large neighborhood search (Wolfinger and Salazar-González, 2021) and Tabu search (Gmira et al., 2021). Of these, the GA has been found to be an effective approach to solving real-world optimization problems, for which several improved hybrid algorithms have been proposed (Sitek et al., 2021; Andrade et al., 2021; Mao et al., 2021), with the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) being one of the most widely used for multi-objective optimization (Rodriguez Sotomonte et al., 2021; Azhir et al., 2021) as the NSGA-II combines the Pareto concept and GA. For example, Deb et al. (2002) optimized an operational NSGA scale by employing an elite mechanism, with this version of NSGA named the NSGA-II.

    View all citing articles on Scopus

    Elham Azhir is a Ph.D. student in Computer Engineering at Science and Research Branch, Islamic Azad University, Tehran, Iran from September 2016. She received the B.S. and M.Sc. degree from Qazvin Islamic Azad University, Qazvin, Iran in 2010 and 2014, all in Software Engineering. Her main research interests are query optimization, cloud computing, distributed systems and programming.

    Nima Jafari Navimipour received his B.S. in computer engineering, software engineering, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2007; the M.S. in computer engineering, computer architecture, from Tabriz Branch, Islamic Azad University, Tabriz, Iran, in 2009; the Ph.D. in computer engineering, computer architecture, from Science and Research Branch, Islamic Azad University, Tehran, Iran in 2014. His research interests include SDN, cloud computing, grid systems, computational intelligence, evolutionary computing, and wireless networks.

    Mehdi HosseinZadeh received his B.S. degree in computer hardware engineering, from Islamic Azad University, Dezfol branch, Iran in 2003. He also received his M.Sc. and the Ph.D. degree in computer system architecture from the Science and Research Branch, Islamic Azad University, Tehran, Iran in 2005 and 2008, respectively. He is currently an Associate professor in Iran University of Medical Sciences (IUMS), Tehran, Iran, and his research interests include SDN, Information Technology, Data Mining, Big data analytics, E-Commerce, E-Marketing, and Social Networks.

    Arash Sharifi received the B.S. degree in computer hardware engineering from IAU South Tehran Branch, M.S degree and Ph.D. degree in artificial intelligence from IAU science and research branch, in 2007 and 2012 respectively. He is currently head of computer engineering department of SRBIAU. His current research interests include image processing, machine learning and deep learning.

    Aso Darwesh received the B.S. in Mathematics in University of Sulaimani, Iraq 2001, M.S. degrees in Computer Science in University of Rene Descartes, France 2007 and Ph.D. in Computer Science, University of Pierre and Mari Curie, France 2010. Currently, he is Associate Professor in the Information Technology Department, University of Human Development, Sulaymaniyah, Iraq. His research interests include Serious Games, Adaptive Learning Cognitive Diagnosis in E-Learning, Learning Systems, Computer Networks, Networking Security, and Data Mining.

    View full text