Elsevier

Applied Soft Computing

Volume 46, September 2016, Pages 886-923
Applied Soft Computing

Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis

https://doi.org/10.1016/j.asoc.2016.01.045Get rights and content

Abstract

Over the years data clustering algorithms have been used for image segmentation. Due to the presence of uncertainty in real life datasets, several uncertainty based data clustering algorithms have been developed. The c-means clustering algorithms form one such family of algorithms. Starting with the fuzzy c-means (FCM) a subfamily of this family comprises of rough c-means (RCM), intuitionistic fuzzy c-means (IFCM) and their hybrids like rough fuzzy c-means (RFCM) and rough intuitionistic fuzzy c-means (RIFCM). In the basic subfamily of this family of algorithms, the Euclidean distance was being used to measure the similarity of data. However, the sub family of algorithms obtained replacing the Euclidean distance by kernel based similarities produced better results. Especially, these algorithms were useful in handling viably cluster data points which are linearly inseparable in original input space. During this period it was inferred by Krishnapuram and Keller that the membership constraints in some rudimentary uncertainty based clustering techniques like fuzzy c-means imparts them a probabilistic nature, hence they suggested its possibilistic version. In fact all the other member algorithms from basic subfamily have been extended to incorporate this new notion. Currently, the use of image data is growing vigorously and constantly, accounting to huge figures leading to big data. Moreover, since image segmentation happens to be one of the most time consuming processes, industries are in the need of algorithms which can solve this problem at a rapid pace and with high accuracy. In this paper, we propose to combine the notions of kernel and possibilistic approach together in a distributed environment provided by Apache™ Hadoop. We integrate this combined notion with map-reduce paradigm of Hadoop and put forth three novel algorithms; Hadoop based possibilistic kernelized rough c-means (HPKRCM), Hadoop based possibilistic kernelized rough fuzzy c-means (HPKRFCM) and Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means (HPKRIFCM) and study their efficiency in image segmentation. We compare their running times and analyze their efficiencies with the corresponding algorithms from the other three sub families on four different types of images, three different kernels and six different efficiency measures; the Davis Bouldin index (DB), Dunn index (D), alpha index (α), rho index (ρ), alpha star index (α*) and gamma index (γ). Our analysis shows that the hyper-tangent kernel with Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means is the best one for image segmentation among all these clustering algorithms. Also, the times taken to render segmented images by the proposed algorithms are drastically low in comparison to the other algorithms. The implementations of the algorithms have been carried out in Java and for the proposed algorithms we have used Hadoop framework installed on CentOS. For statistical plotting we have used matplotlib (python library).

Introduction

Clustering is the task of grouping data points in a way such that dissimilar points are grouped into different subclasses but similar points are grouped together. Though there is another similar term called classification which is also used to split data points into separate groups, clustering is altogether a highly distinct technique. In contradiction to classification where the algorithms initially learn from the association between attributes and target variables, clustering is solely performed on the basis of similarity of data points to one another. It is highly important to examine huge quantity of data that is imported from various sources and explore it so that information of interest can be extracted which can then be utilized to perform specialized tasks. More important is conducting this whole operation efficiently and effectively to ensure the maximum output for the task performed. Therefore, methodologies that can filter the datasets and segregate the classes of interest are essential. For analyzing the statistical data, clustering algorithms are useful in fields such as medical imaging, search engines, pattern recognition, text mining, voice mining and it is one of the vital task in machine learning. A comprehensive set of clustering techniques which comply with the necessities in each of the fields have been proposed over time.

One of the cardinal task that is associated with the topics like computer vision and image analysis is image segmentation. The main aim of image segmentation involves segregation of image into disjoint sections possessing a common and consistent features like hue, shade, intensity, etc. Because of the imaging acquisition process, the boundaries between entities are indistinct and partial in images. The definitions of entities are not certain in general and it may happen that the information pertaining to entities in different sections may be ambiguous. To comply with these imprecise and vague conditions, theories related to fuzzy, intuitionistic fuzzy and rough sets are ideal. Hence, in the segmentation process of these kinds of images uncertainty based clustering algorithms are crucial. In image analysis, clustering techniques can help in performing segmentation over applications such as diagnosing brain tumors using magnetic resonance imaging (MRI) scans, blood cancer detection by segmentation of cancer blood cells from healthy cells, fake coin detection, segregating drought prone areas from water valleys, etc.

The first of the clustering algorithms is hard c-means (HCM), which is based on crisp concepts. The models which take care of handling uncertain concepts are the fuzzy sets, introduced by Zadeh in 1965 [13], intuitionistic fuzzy sets, introduced by Atanassov in 1986 [12] and rough sets, introduced by Pawlak in 1982 [31]. Using these concepts several clustering algorithms have been introduced in the literature like the fuzzy c-means (FCM) [8], [10], intuitionistic fuzzy c-means (IFCM) [29] and rough c-means (RCM) [17]. The imprecise models like the fuzzy sets and rough sets were supposed to be rival models of each other till in 1990 Dubois and Prade [6] could show that far from being rivals these two models complement each other. Moving a step further, they combined these two models to frame their hybrid models in the form of rough fuzzy sets and fuzzy rough sets [3], [27]. In general it has been observed that the hybrid models are more efficient than their individual components. This can be achieved by taking the strong points of each of the constituent models and neglecting their weaker points. Continuing this hybridization process further, intuitionistic fuzzy rough sets and rough intuitionistic fuzzy rough sets [24] have been introduced later. Using these hybrid models, rough fuzzy c-means (RFCM) [18], [27] and rough intuitionistic fuzzy c-means (RIFCM) [21] were introduced and it was experimentally established that RIFCM is the best among this sequence for both numeric and image data. In order to measure the similarity of elements, in all the above algorithms, the Euclidean distance between elements was used. A major problem with this metric is that it can only segregate the points which are linearly separable. To overcome this lacuna, kernel functions are used in later algorithms with an advantage that the data points are separated by creation of non-linear separators [7], [30]. Several algorithms have been developed in this direction, starting with kernel based fuzzy c-means (KFCM), kernel based intuitionistic fuzzy c-means (KIFCM), kernel based rough c-means (KRCM), kernel based rough fuzzy c-means (KRFCM) and kernel based rough intuitionistic fuzzy c-mean (KRIFCM) [4]. Again, using different types of datasets it was established here that KRIFCM is the best in this family and each member of this kernel based algorithms is better than the corresponding Euclidean distance based algorithms. In the analysis of efficiency of these algorithms, two standard indexes called the Davis Bouldin (DB) [5] and Dunn (D) [11] are used. Various applications of uncertain hybrid clustering techniques can be found in Refs. [1], [4], [19], [20]. Besides this some specific methods for image segmentation in literature include modified optimum boundary point detection method (OPBD) as presented by Agrawal et al. [25]. They utilized OPBD method to obtain final centroids in FCM algorithm and concluded that the proposed method gives improved results as compared to genetic algorithm (GA) and particle swarm optimization (PSO) based algorithms. Kaur et al. [16] introduced a KIFCM algorithm and described its effectiveness in segmentation of noisy medical images. Talukdar e al [15] used fuzzy inference system (FIS) which involved image segmentation for segregating the blood cells to produce an automated blood cancer detection technique. Modi et al. [26] proposed a coin recognition system using neural networks which was more than 97% accurate and is capable of recognizing coin from both sides. Berhan et al. [9] took satellite images for extracting information and utilized them for taking decisions related to drought affected locations.

When analysing the FCM algorithm it can be observed that cumulative membership of every object over the classes is 1. This condition prevents meaningless outcomes by restricting a solution in which all the memberships could be 0 and hence it also enables to express memberships as probabilities. However, on the other hand this constraint ensures that membership values are associated to one another and hence it becomes inappropriate for situations where membership values symbolize “typicality”. A different kind of approach was provided by Krishnapuram and Keller [22], [23] where they used the notion of typicality so that the constraint on sum of membership of any object now lies between 0 and any constant c. In fact, the membership rating of an object to a cluster is called typicality of that object. Using this concept Pal et al. in 1997 [14] introduced a fuzzy possibilistic c-means (FPCM) algorithm which could produce typicality as well as membership values during the process of clustering. In FPCM, constraint applied on typicality ensures that for a cluster the sum of typicalities over all the instances is 1. This makes FPCM less susceptible to problem of coincidental cluster associated with PCM and the probabilistic restriction problem associated with FCM. Later, PRCM, PRFCM [3] and PRIFCM [2] have been introduced and studied. Like earlier cases here also, in most of the cases it has been observed that PRIFCM outperforms the other algorithms in this family.

Instead of merely displaying images to the user, treating a huge number of images as data is a new idea. In recent times storage and processing of an image has been considerably expensive. One of the major problem today is to process millions of images in the shortest amount of time and that too very cheaply. Industries are in the need of cheap frameworks which can ensure this cumbersome task, in this regard Apache™ Hadoop [32], [35] comes to rescue. Apache™ Hadoop is an open source framework for distributed computing of very large datasets and that too on commodity hardwares. It is known for its capability to handle hardware failures efficiently. Each system in the distributed environment is known as node. Combination of all the nodes in the system is called as a cluster. The framework has been developed in Java and follows a map-reduce paradigm for processing the data. In this paradigm, the first step is fragmentation of data. This fragmentation ensures distribution of data on multiple nodes in the cluster. Each node then performs its task separately and the processed data is integrated as a whole in the reduce step. The problems like failure of nodes are automatically handled in the course of processing by the framework itself. All this makes Hadoop a tremendously ambidextrous and flexible platform.

One of the most challenging aspects of data mining is analysis of unstructured data. Currently, image data is growing at a brisk rate and we are in the need of algorithms which can find meaningful insights in the growing volume of primarily unstructured image data. A medium resolution satellite image with 2000 × 2000 pixels contains 40 lakh data points, even if 1 lakh such images have to be analyzed collectively then it would account for tremendous processing power [33]. If the analysis is primarily a clustering task then high accuracy also becomes an equally important requirement. Hence if efficient machine learning techniques can be applied in distributed framework then we will be able to sense resourceful information from huge volumes of images in considerably less time. Not all algorithms can be devised to work with the map-reduce paradigm of Hadoop. However, if one can organize an algorithm that can harmonize with this paradigm decently then it would add a high value to the processing efficiency.

Moving a step further in the process of hybridization of techniques and models that concepts kernel as similarity measure, the possibilistic approach and the uncertainty based models; fuzzy set, intuitionistic fuzzy set and rough set is possible. If this resulting hybrid combination can be incorporated to synchronize with the distributed architecture of Hadoop, an immensely efficient clustering technique can be comprehended. In order to realize this, in this paper we introduced three such algorithms called the Hadoop based possibilistic kernelized rough c-means (HPKRCM), Hadoop based possibilistic kernelized rough fuzzy c-means (HPKRFCM) and Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means (HPKRIFCM). Also, we try for three variations of each of these algorithms by taking three different types of kernels; the radial basis, the Gaussian and the hyper tangent kernels. The basic aim is to compare the efficiency of the algorithms and find out which of the kernels is most appropriate under which real life applications. Also, the aim is to compare these algorithms with the basic hybrid algorithms as discussed earlier. In order to make our study extensive on images we have selected images from four different categories of real life set ups; namely a metal coin, a MRI image, Cancerous blood cells and a satellite image.

For the comparison of the efficiencies of the algorithms we use six measuring indexes; namely the DB, D, α, ρ, α* and γ [5], [11], [18]. Out of these the first two are general indices for any clustering algorithms whereas the other four are applicable for rough set based clustering algorithms. Also, again to bring variety to the experiments we consider two cases in connection with the number of clusters to be formed by taking the number of clusters to be 4 or 5. Another aspect of our study is that we provide pixel clustering images for brain MRI image to show the cluster formation in each of the cases taken for comparison. Among several observations obtained, we find that the HPKRIFCM algorithm with the hyper tangent kernel performs the best for MRI images.

The paper comprises of 5 sections. In the next section, we present the notations and definitions for various concepts, similarity functions, map-reduce paradigm and accuracy measures which form an integral part of our study. In Section 3, we introduce the three algorithms HPKRCM, HPKRFCM and HPKRIFCM. It is being noted that the other algorithms can be obtained as specific cases of these three algorithms. A major part of our work in the form of experimentation and analysis in the form of the results obtained from the different algorithms, different inputs and for different kernels is presented here with critical analysis. We provide diagrammatic representations in the form of bar diagrams for easy visualization of the results obtained and comparison. In Section 5, we conclude the work done in the paper, which is followed by an extensive bibliography of sources referred during the compilation of the work.

Section snippets

Notations and definitions

In this section we introduce some definitions and notations to be used in this paper. First we start with the three uncertainty based models to be used in the next subsection.

Kernelized c-means algorithms with proposed and ordinary approach

Firstly, the notion of possibilistic approach in c-means techniques is designated and then its integration with Hadoop's map-reduce paradigm is elaborated. Subsequently, proposed algorithms for HPKRCM, HPKRFCM and HPKRIFCM are described. Subsequently, the ordinary approaches namely KRCM, KRFCM and KRIFCM are described briefly using their corresponding Hadoop based possibilistic approaches.

In rudimentary clustering techniques like FCM, the membership of data point to a particular cluster is

Results and analysis

Development has been carried out in Java with eclipse Luna IDE. Implementation of ordinary approaches has been conducted in Dell Inspiron machine powered by 4th Generation Intel(R) Core™ i3-4170 Processor, 4 GB Memory and 400 GB HD memory. The deployment of the proposed algorithms has been carried out in a 4 node cluster with 4 data nodes and 1 node working as master node and data node simultaneously. Each node comprised of a Dell Inspiron machine powered by 4th Generation Intel(R) Core™ i3-4170

Conclusion

Recently, the prospective to promote more refined and powerful algorithms to index, categorize and operate on images has posed a remarkable challenge. This critical problem has been intensified by the outburst of image data on the internet. In this work, we have introduced three novel algorithms HPKRCM, HPKRFCM and HPKRIFCM which are based on map-reduce paradigm of Apache™ Hadoop. The algorithms have been obtained as a hybridization of the imprecise models fuzzy set, intuitionistic fuzzy sets,

B.K. Tripathy is a senior professor in SCSE, VIT University, Vellore, India. He has received fellowships from UGC, DST, SERC and DOE of Govt. of India. He has published more than 300 technical papers and has produced 21 PhDs, 13 MPhils and 2 MS (by research) under his supervision. Dr. Tripathy has published a text book on soft computing and has edited two research volumes for IGI publications. He is a life-time/senior member of IEEE, ACM, IRSS, CSI and IMS. He is an editorial board

References (35)

  • E.H. Ruspini

    A new approach to clustering

    Inf. Control

    (1969)
  • K.T. Atanassov

    Intuitionistic fuzzy sets

    Fuzzy Sets Syst.

    (1986)
  • L.A. Zadeh

    Fuzzy sets

    Inf. Control

    (1965)
  • S. Agrawal et al.

    A study on fuzzy clustering for magnetic resonance brain image segmentation using soft computing approaches

    Appl. Soft Comput.

    (2014)
  • B.K. Tripathy et al.

    Data Clustering Algorithms Using Rough Sets. Handbook of Research on Computational Intelligence for Engineering, Science, and Business

    (2012)
  • B.K. Tripathy et al.

    On PRIFCM algorithm for data clustering, image segmentation and comparative analysis

  • B.K. Tripathy et al.

    Possibilistic rough fuzzy c-means algorithm in data clustering and image segmentation

  • B.K. Tripathy et al.

    On kernel based rough intuitionistic fuzzy c-means algorithm and a comparative analysis

    (2014)
  • D.L. Davis et al.

    A cluster separation measure

    IEEE Trans Pattern Anal Mach Intell

    (1979)
  • D. Dubois et al.

    Rough fuzzy sets model

    Int. J. Gen. Syst.

    (1990)
  • D. Zhang et al.

    Fuzzy clustering using kernel method

  • G. Berhan et al.

    Using satellite images for drought monitoring: a knowledge discovery approach

    J. Strateg Innov Sustain

    (2011)
  • J.C. Bezdek

    Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • J.C. Dunn

    A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters

    (1973)
  • N.R. Pal et al.

    A possibilistic fuzzy c-means clustering algorithm

    IEEE Trans. Fuzzy Syst.

    (2005)
  • N.A. Talukadar et al.

    Blood cancer detection using image processing based on fuzzy system

    Int. J. Adv. Res. Comput. Sci. Softw. Eng.

    (2014)
  • P. Kaur et al.

    A robust kernelized intuitionistic fuzzy c-means clustering algorithm in segmentation of noisy medical images

    Pattern Recogn. Lett.

    (2013)
  • Cited by (25)

    • An innovative integrated modelling of safety data using multiple correspondence analysis and fuzzy discretization techniques

      2020, Safety Science
      Citation Excerpt :

      The working principle of UPFCM is similar to FCM except that the membership grade of an object in all the cluster sum up to 1 does not hold good . In this algorithm, the membership grade is interpreted as a degree of compatibility between the object and a particular cluster and is independent of that between the object and other clusters (Tripathy and Mittal, 2016). For validation of the solution obtained in the algorithm, different cluster validity indices are used.

    • Veracity handling and instance reduction in big data using interval type-2 fuzzy sets

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In Fahad et al. (2014), a survey was conducted to examine which was the most appropriate clustering algorithm for big data based on its three dimensions i.e. Volume, Velocity and Variety. The candidate set of the algorithm selected were fuzzy c-means (FCM) (Ludwig, 2015), BIRCH, Expectation Maximization (EM), DENCLUE, OptiGrid and kernalized c-means (Tripathy and Mittal, 2016). The clusters obtained were validated using various validity indices.

    • Innovative classification, regression model for predicting various diseases

      2020, Data Analytics in Biomedical Engineering and Healthcare
    • An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities

      2017, Knowledge-Based Systems
      Citation Excerpt :

      These techniques are usually implemented by specific big data technologies, which involve batch processing, stream processing (or real-time processing) or hybrid processing with the Lambda architecture [2,14]. However, although many technologies, such as MapReduce [15] and Hadoop [16], have been released, those are far from meeting the ideal requirements of each processing step. There are challenges presented in almost every aspect of big data processing and applications, including technical challenges and non-technical ones.

    • Big Data for supply chain management in the service and manufacturing sectors: Challenges, opportunities, and future perspectives

      2016, Computers and Industrial Engineering
      Citation Excerpt :

      On February 19, 2008, Yahoo! launched the world’s largest Hadoop production application that runs on more than 10,000 clusters and produced data used in every Yahoo! Web search query; in 2010, Facebook claimed that the largest Hadoop cluster in the world with 21 PB (Petabytes) was achieved, and two years later, on June 13, 2012, the data from Hadoop cluster reached 100 PB (HDFS, 2010; Ryan, 2012). Currently, Hadoop has been widely adopted with more than half of the Fortune 50 for Big Data processing (Eatontown, 2012; Tripathy & Mittal, 2016). MapReduce is one of the key modules in Hadoop.

    • Anomaly Detection and Complex Event Processing Over IoT Data Streams: With Application to eHealth and Patient Data Monitoring

      2022, Anomaly Detection and Complex Event Processing Over IoT Data Streams: With Application to eHealth and Patient Data Monitoring
    View all citing articles on Scopus

    B.K. Tripathy is a senior professor in SCSE, VIT University, Vellore, India. He has received fellowships from UGC, DST, SERC and DOE of Govt. of India. He has published more than 300 technical papers and has produced 21 PhDs, 13 MPhils and 2 MS (by research) under his supervision. Dr. Tripathy has published a text book on soft computing and has edited two research volumes for IGI publications. He is a life-time/senior member of IEEE, ACM, IRSS, CSI and IMS. He is an editorial board member/reviewer of more than 60 journals. His research interest includes fuzzy sets and systems, rough sets and knowledge engineering, data clustering, social network analysis, soft computing, granular computing, content based learning, neighborhood systems, soft set theory and applications, multiset theory, list theory and multi-criteria decision making.

    Dishant Mittal has received the BTech degree in computer science and engineering in 2015 and currently he is working for Johnson Controls, India. His current research interests include artificial intelligence, image processing, machine learning and data analytics.

    View full text