Skip to main content
Log in

A new swarm-based efficient data clustering approach using KHM and fuzzy logic

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Clustering is a useful technique to create different groups of objects on the basis of their nature. Objects of same group are of similar in nature and differ to the objects of other groups. Clustering has proved its importance in various fields such as information retrieval, bioinformatics, image processing and many others. In this paper, particle swarm optimization (PSO) technique is used with K-harmonic means (KHM) for clustering. PSO overcomes the limitations of KHM like local optimum problem. Fuzzy logic is also employed in this paper to make PSO adaptive in nature by controlling various parameters. The performance of the proposed approach is validated on five benchmark datasets in terms of inter-clustering distance, intra-clustering distance, F-measure and fitness value. The results of proposed approach are compared with well-known conventional clustering techniques such as K-means, KHM and fuzzy C-means along with different state-of-the-art clustering approaches. Two text-based benchmark datasets such as CACM and CISI are also used to test the performance of all clustering approaches. The proposed clustering approach gives better results in comparison with other clustering approaches as clear from both the experimental and statistical analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: Proceedings of the 2006 IEEE congress on evolutionary computation (CEC 2006), Vancouver, pp 1784–1791

  • Alguliev R, Aliguliyev R (2005) Fast genetic algorithm for clustering of text documents. Artif Intell 3:698–707

    Google Scholar 

  • Aliguliyev R (2006) A clustering method for document collections and algorithm for estimation the optimal number of clusters. Artif Intell 4:651–659

    Google Scholar 

  • Aupetit S, Monmarché N, Slimane M (2007) Hidden Markov models training by a particle swarm optimization algorithm. J Math Model Algorithms 6:175–193

    Article  MathSciNet  MATH  Google Scholar 

  • Azzag H, Venturini G, Oliver A, Guinot C (2007) A hierarchical ant based clustering algorithm and its use in three real-world applications. Eur J Oper Res 179:906–922

    Article  MATH  Google Scholar 

  • Bergh F, Engelbrecht A (2001) Effect of swarm size on cooperative particle swarm optimizers. In: Proceedings of genetic evolutionary computation conference (GECCO-2001), San Francisco, pp 892–899

  • Bezdek J (1974) Fuzzy mathematics in pattern classification. PhD thesis, Cornell University, Ithaca

  • Chang P, Liu C, Fan C (2009) Data clustering and fuzzy neural network for sales forecasting: a case study in printed circuit board industry. Knowl-Based Syst 22(5):344–355

    Article  Google Scholar 

  • Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of the 2005 IEEE swarm intelligence symposium, Pasadena, pp 186–191

  • Das S, Abraham A, Konar A (2008a) Automatic clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recogn Lett 29:688–699

    Article  Google Scholar 

  • Das S, Abraham A, Konar A (2008b) Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst Man Cybern Part A Syst Hum 38:218–237

    Article  Google Scholar 

  • ElAlami M (2011) Supporting image retrieval framework with rule base system. Knowl-Based Syst 24(2):331–340

    Article  Google Scholar 

  • Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Garai G, Chaudhuri B (2004) A novel genetic algorithm for automatic clustering. Pattern Recogn Lett 25:173–187

    Article  Google Scholar 

  • Gath I, Geva G (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11:773–781

    Article  MATH  Google Scholar 

  • Güngör Z, Ünler A (2008) K-harmonic means data clustering with tabu search method. Appl Math Model 32:1115–1125

    Article  MATH  Google Scholar 

  • Gupta Y, Saini A (2015) An efficient clustering approach based on hybridization of PSO, fuzzy logic and K-harmonic means. In: IEEE workshop on computational intelligence: theories, applications and future directions (WCI). IIT Kanpur

  • Hadavandi E, Shavandi H, Ghanbari A (2010) Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting. Knowl-Based Syst 23(8):800–808

    Article  Google Scholar 

  • Hammerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the 11th international conference on information and knowledge management, pp 600–607

  • Han J, Kamber M, Pei P (2006) Data mining: concepts and techniques. Morgan Kaufmann, Los Altos

    MATH  Google Scholar 

  • Hartmann V (2005) Ant colony optimization and swarm intelligence: evolving agent swarms for clustering and sorting. In: Proceedings of the 2005 conference on genetic and evolutionary computation (GECCO’05), Washington, DC, pp 217–224

  • Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31:264–323

    Article  Google Scholar 

  • Kalyani S, Swarup K (2011) Particle swarm optimization based K-means clustering approach for security assessment in power systems. Expert Syst Appl 38(9):10839–10846

    Article  Google Scholar 

  • Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. J Comput 32(8):68–75

    Google Scholar 

  • Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis, vol 39. Wiley, London

    Book  MATH  Google Scholar 

  • Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the 1995 IEEE international conference on neural networks, Englewood Cliffs, pp 1942–1948

  • Khan M, Khor S (2004) Web document clustering using a hybrid neural network. Appl Soft Comput 4:423–432

    Article  Google Scholar 

  • Khy S, Ishikawa Y, Kitagawa H (2008) A novelty-based clustering method for on-line documents. World Wide Web 11:1–37

    Article  Google Scholar 

  • Laszlo M, Mukherjee S (2006) A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering. IEEE Trans Pattern Anal Mach Intell 28:533–543

    Article  Google Scholar 

  • Laszlo M, Mukherjee S (2007) A genetic algorithm that exchanges neighboring centers for k-means clustering. Pattern Recogn Lett 28:2359–2366

    Article  Google Scholar 

  • Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404

    Article  Google Scholar 

  • Liao C, Tseng C, Luarn P (2007) A discrete version of particle swarm optimization for flowshop scheduling problems. Comput Oper Res 34:3099–3111

    Article  MATH  Google Scholar 

  • Lin H, Yang F, Kao Y (2005) An efficient GA-based clustering technique. Tamkang J Sci Eng 8(2):113–122

    Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley symposium mathematical, statistic and probability, Berkeley

  • Martin-Guerrero J, Palomares A, Balaguer-Ballester E, Soria-Olivas E, Gomez-Sanchis J, Soriano-Asensi A (2006) Studying the feasibility of a recommender in a citizen web portal based on user modeling and clustering algorithms. Expert Syst Appl 30(2):299–312

    Article  Google Scholar 

  • Nock R, Nielsen F (2006) On weighting clustering. IEEE Trans Pattern Anal Mach Intell 28:1223–1235

    Article  Google Scholar 

  • Ponomarenko J, Merkulova T, Orlova G, Fokin O, Gorshkov E, Ponomarenko M (2002) Mining DNA sequences to predict sites which mutations cause genetic diseases. Knowl-Based Syst 15(4):225–233

    Article  Google Scholar 

  • Sander J, Ester M, Kriegel M, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194

    Article  Google Scholar 

  • Sebiskveradze D, Vrabie V, Gobinet C, Durlach A, Bernard P, Ly E, Manfait M, Jeannesson P, Piot O (2011) Automation of an algorithm based on fuzzy clustering for analyzing tumoral heterogeneity in human skin carcinoma tissue sections. Lab Invest 91(5):799–811

    Article  Google Scholar 

  • Shi J, Luo Z (2010) Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Comput Biol Med 40(8):723–732

    Article  Google Scholar 

  • Subramanyam V, Sett S (2008) Knowledge-based image retrieval system. Knowl-Based Syst 21(2):89–100

    Article  Google Scholar 

  • Suganthan P (1999) Particle swarm optimizer with neighborhood operator. In: Proceedings of IEEE international conference on evolutionary computation, vol 3, pp 1958–1962

  • Thakare A, Hanchate R (2014) Introducing hybrid model for data clustering using K-harmonic means and Gravitational search algorithms. Int J Comput Appl 88(17):18–22

    Google Scholar 

  • Verma N, Roy A (2014) Self-optimal clustering technique using optimized threshold function. IEEE Syst J 8(4):1213–1226

    Article  Google Scholar 

  • Vesanto W, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600

    Article  Google Scholar 

  • Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In Proceedings of 23rd international conference on very large databases, Greece, pp 186–195

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  • Yang F, Sun T, Zhang C (2009) Efficient hybrid data clustering method based on K-harmonic means and particle swarm optimization. Expert Syst Appl 36(6):9847–9852

    Article  Google Scholar 

  • Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353

    Article  MATH  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD conference of management of data, Canada, pp 103–114

  • Zhang B, Hsu M, Dayal U (1999) K-harmonic means—a data clustering algorithm. Technical Report HPL-1999-124, Hewlett-Packard Laboratories

  • Zhang B, Hsu M, Dayal U (2000) K-harmonic means. In: International workshop on temporal, spatial and spatio-temporal data mining. TSDM 2000, Lyon

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashish Saini.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Appendices

Appendix A: K-harmonic means clustering algorithm

This algorithm was proposed by Zhang et al. (1999, 2000). The variants of KHM were also proposed by Hammerly and Elkan (2002). KHM gives dynamic weight to each data point by averaging of the harmonic means of the distance from each data point to all centers. The harmonic average assigns a large weight to a data point that is not close to any centers and a small weight to the data point that is close to one or more centers. Therefore, KHM is less sensitive to the initialization than the K-means. Some notations are described in Table 12 as used in KHM algorithm, before discussing it:

Table 12 Notations and descriptions used in KHM

The detail of KHM clustering algorithm is given as follows:

  1. 1.

    Initially, select the centers randomly.

  2. 2.

    Determine objective function value by (A.1) as defined below

    $$ \text{KHM}\,\left( {X, C} \right) = \sum\limits_{i = 1}^{n} {\frac{k}{{\sum\nolimits_{j = 1}^{k} {\frac{1}{{\left\| {x_{i} - c_{j} } \right\|^{p} }}} }}} $$
    (A.1)

    where p is an input parameter and typically p ≥ 2.

  3. 3.

    Compute membership value m (cj/xi) in each center cj for each data point xi by (A.2) as defined below

    $$ m\,\left( {c_{j} , x_{i} } \right) = \frac{{\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} }}{{\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } }}. $$
    (A.2)
  4. 4.

    In this step, compute weights W (xi) for each data point xi by (A.3) as follows

    $$ W\,\left( {x_{i} } \right) = \frac{{\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } }}{{\left( {\sum\nolimits_{j = 1}^{k} {\left\| {x_{i} - c_{j} } \right\|^{ - p - 2} } } \right)^{2} }}. $$
    (A.3)
  5. 5.

    Now re-compute the locations of each center cj from all the data points xi according to their memberships and weights define by (A.4).

    $$ c_{j} = \frac{{\sum\nolimits_{i = 1}^{n} {m\left( {c_{j} /x_{i} } \right) w\left( {x_{i} } \right) x_{i} } }}{{\sum\nolimits_{i = 1}^{n} {m\left( {c_{j} /x_{i} } \right) w\left( {x_{i} } \right) } }} $$
    (A.4)
  6. 6.

    Repeat steps 2–5 for predefined number of iterations or until KHM(X, C) does not change significantly.

  7. 7.

    Assign data point xi to cluster j with the biggest m(cj/xi).

It is demonstrated that KHM is essentially insensitive to the initialization of the centers (Zhang et al. 1999), while it tends to converge to local optima (Güngör and Ünler 2008).

Appendix B: Particle swarm optimization

PSO is an evolutionary approach which has been successfully applied to science and many practical fields (Aupetit et al. 2007; Liao et al. 2007). It is a sociologically inspired optimization algorithm which is based on population. Each particle in PSO represents an individual, and all the particles form a swarm. The solution space for any problem is formulated as a search space in PSO. Each position of search space represents a correlated solution of the problem. Each particle moves according to its velocity. The movement of a particle is computed as (B.1) and (B.2)

$$ x_{i} \left( {t + 1} \right) \leftarrow \, x_{i} \left( {t + 1} \right) + v_{i} \left( t \right) $$
(B.1)
$$ \begin{aligned} v_{i} \left( {t + 1} \right) \, &\leftarrow \omega v_{i} \left( t \right) + c_{1} {\text{rand}}_{1} \left( {pbest_{i} \left( t \right) \, - \, x_{i} \left( t \right)} \right) \\&\quad+ c_{2} {\text{rand}}_{2} \left( {gbest\left( t \right) - \, x_{i} \left( t \right)} \right) \end{aligned} $$
(B.2)

where xi(t) is the position of particle i at time t, vi(t) is the velocity of particle i at time t, ω is an inertia weight scaling the previous velocity, rand1 and rand2 are random variables between 0 and 1, pbesti(t) is the best position found by particle i so far, gbest(t) is the best position of whole swarm so far, c1 and c2 are two acceleration coefficients that scale the influence of pbesti(t) and gbest(t), respectively.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, Y., Saini, A. A new swarm-based efficient data clustering approach using KHM and fuzzy logic. Soft Comput 23, 145–162 (2019). https://doi.org/10.1007/s00500-018-3514-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3514-1

Keywords

Navigation