Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures

https://doi.org/10.1016/j.eswa.2008.12.046Get rights and content

Abstract

This paper proposes a self-organized genetic algorithm for text clustering based on ontology method. The common problem in the fields of text clustering is that the document is represented as a bag of words, while the conceptual similarity is ignored. We take advantage of thesaurus-based and corpus-based ontology to overcome this problem. However, the traditional corpus-based method is rather difficult to tackle. A transformed latent semantic indexing (LSI) model which can appropriately capture the associated semantic similarity is proposed and demonstrated as corpus-based ontology in this article. To investigate how ontology methods could be used effectively in text clustering, two hybrid strategies using various similarity measures are implemented. Experiments results show that our method of genetic algorithm in conjunction with the ontology strategy, the combination of the transformed LSI-based measure with the thesaurus-based measure, apparently outperforms that with traditional similarity measures. Our clustering algorithm also efficiently enhances the performance in comparison with standard GA and k-means in the same similarity environments.

Introduction

With the abundance of text documents available on the internet, the automatic partition of texts into previously unseen categories ranks top on the priority list for Information Retrieval (IR), and Pattern Recognition. However, the characteristics of polysemy and synonymy that exist in words of natural language have always been a challenge in the fields of IR and data mining. In many cases, humans have little difficulty in determining the intended meaning of an ambiguous word, while it is extremely difficult to replicate this process computationally. One main reason for this is that the existing retrieval solutions only relate documents that use identical terminology, while they ignore conceptual similarity of terms.

To address this problem, clustering algorithm is introduced first. Clustering is a popular unsupervised classification technique which groups the input space into K regions based on some similarity or dissimilarity metric. The partition is done such that patterns within a group are more similar to each other than patterns belonging to different groups (Frigui and Krishnapuram, 1999, Koontz et al., 1975a, Koontz et al., 1975b). Clustering is run-timely formed during the partition process, instead of being pre-defined as in case of text categorization, which commonly refers to the supervised partitioning of documents to “labeled” sets (Xia, Wang, & Yoshida, 2006). The task of documents clustering is both difficult and intensively studied in literature. A branch and bound algorithm uses a tree search technique to search the entire solution space (Koontz et al., 1975a, Koontz et al., 1975b). It employs a criterion of eliminating sub trees which do not contain the optimal result. In this scheme, the number of nodes to be searched becomes huge as the size of the dataset becomes large. k-Means algorithm, one of the most widely used, attempts to solve the clustering problem into a fixed number of clusters K known in advance (Selim & Ismail, 1984). It is an iterative hill-climbing algorithm and solution suffering from the limitation of the sub-optimal which is known to depend on the choice of initial clustering distribution. Since stochastic optimization approaches can avoid convergence to a local optimization, these approaches can be used to find a globally optimal solution. Genetic algorithm (GA) belongs to the search techniques that mimic the principle of natural selection and heredity. It performs search in complex, large and multimode landscapes, and provides near-optimal solutions for objective or fitness function (Bandyopadhyay et al., 2004, Maulik and Bandyopadhyay, 2000). However, most of these clustering algorithms solely adopt vector space model (VSM) to represent text. That is, each unique term in vocabulary represents one dimension in feature space. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms which do not co-occur literally. This is due to the nature of text, where the same concept can be represented by many different words, and words can have ambiguous meaning. Meanwhile, with the direct representation of text, there is a lack of more general concepts which can help identifying related topics. For example, a document about “canine” may not be related to a document about “feline” by the traditional clustering algorithms if there are only “canine” and “feline” in the different vectors. But if we add a more general concept “carnivore” to both documents, their semantic relationship is revealed. Thus, it is essential that a document clustering algorithm is regarded as a data clustering method combined with an appropriate document similarity measure.

In this paper we propose a modified genetic algorithm based on ontology for text clustering. We take advantage of thesaurus-based ontology and corpus-based ontology to provide a more accurate assessment of the similarity between documents. The lexical taxonomy Wordnet is designed in a tree-like hierarchical structure going from many specific terms at the lower levels to a few generic terms at the top (Hotho et al., 2003, Miller, 1995). We can use its hierarchical structure and broad-coverage taxonomy as thesaurus-based ontology. Meanwhile, a novel transform from the original latent semantic indexing (LSI) is proposed and demonstrated as the corpus-based ontology which can appropriately depict the associative semantic relationship in this study.

A variable string length GA using gene index to encode chromosome is developed to achieve the proper number of clusters. Meanwhile, considering the influence between the diversity of the population and the selective pressure, a self-organized evolution process is put forward in this article.

In the next section we give a brief review of ontology-based semantic similarity, and describe how we use it to compute in Wordnet. In Section 3 a transformed LSI model is proposed for corpus-based text representation, which is then used in conjunction with the thesaurus-based method as a hybrid strategy to evaluate document similarity measure. The details of genetic algorithm for text clustering based the ontology are described in Section 4. Experiment results are given in Section 5. Conclusions and future works are given in Section 6.

Section snippets

Ontology-based semantic similarity

Semantic similarity is a generic issue in the variety of application areas of Artificial Intelligence (AI) and Natural Language Processing (NLP). Similarity between two words is often represented by similarity between the concepts related with the two words. A number of semantic similarity methods have been developed in literature. Various similarity methods have proven to be useful in some specific application (Hotho and Maedche, 2001, Rada et al., 1989). In general, the semantic similarity

LSI for semantic similarity calculation

Latent semantic indexing is an automatic method that uses singular value decomposition (SVD) to decompose the original term-by-document matrix into a set of k orthogonal factors (Bellegarda et al., 1996, Deerwester et al., 1990). In this semantic structure, we can find the associative relationships even two documents do not share any common words, because the similar contexts in the documents will have similar vectors in the semantic space.

Genetic algorithm for document clustering

Genetic algorithms (GAs) are randomized search and optimization techniques guided by the principals of natural selection and heredity. They are efficient, adaptive and robust search processes which can provide near-optimal solutions for objective or fitness function of an optimization problem. However, in the traditional clustering algorithm the number of clusters is assumed to be fixed in advance. Here we attempt to automatically evolve the appropriate number of clusters as well as the fuzzy

Experiments results and analysis

In this section we implement our method of genetic algorithm for text clustering on the Reuters-21578 corpus, which is one of the most-widely adopted benchmark datasets in text mining fields. In the current test data set 1 with 200 documents from 4 topics (coffee 50, trade 50, crude 50, and sugar 50) and data set 2 with 600 documents from 6 topics (coffee 100, trade 100, crude 100, sugar 100, grain 100, ship 100) are selected. After being processed by word extraction, stop word removal, and

Conclusions and future works

In this article a modified genetic algorithm for document clustering based on ontology is proposed. The problem existing in the field of document clustering is that the documents are solely represented as the vectors of the identical terminologies, while the conceptual similarity between each pairs of documents is ignored. We take advantage of thesaurus-based and corpus-based semantic similarity measures to overcome this problem. Whereas, in general, the corpus-based method is rather difficult

Acknowledgements

This work was partially supported by the Korea Research Foundation Grant (KRF-2006-321-A00012) and partially supported by the program of the third stage of Brain Korea 21.

References (26)

  • U. Maulik et al.

    Genetic algorithm-based clustering technique

    Pattern Recognition

    (2000)
  • S. Bandyopadhyay et al.

    Nonparametric genetic clustering: Comparison of validity indices

    IEEE Transactions on Systems, Man and Cybernetics-C. Applications and Reviews

    (2001)
  • S. Bandyopadhyay et al.

    Multi-objective GAs, quantitative indices and pattern classification

    IEEE Transactions on Systems, Man and Cybernetics-B

    (2004)
  • Bellegarda, J., Butzberger, J., Chow, Y. (1996). A novel word clustering algorithm based on latent semantic analysis....
  • D. Davies et al.

    A cluster separation measure

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1979)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society of Information Science

    (1990)
  • Francis, W., & Kucera, H. (1997). Brown corpus manual-revised and amplified. Department of Linguistics, Brown...
  • H. Frigui et al.

    A robust competitive clustering algorithm with application in computer vision

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1999)
  • Hotho, A., & Maedche, A. (2001). Ontology-based text clustering. In Proceedings of the IJCAI workshop text learning:...
  • Hotho, A., & Stumme, G. (2002). Conceptual clustering of text clusters. In Proceedings of the FGML...
  • Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Proceedings of the 26th annual...
  • Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of...
  • W. Koontz et al.

    A graph theoretic approach to nonparametric cluster analysis

    IEEE Transactions on Computers

    (1975)
  • Cited by (91)

    • Metaheuristic algorithms in text clustering

      2023, Comprehensive Metaheuristics: Algorithms and Applications
    • A semantic similarity computation method for virtual resources in cloud manufacturing environment based on information content

      2021, Journal of Manufacturing Systems
      Citation Excerpt :

      Regarding similarity computation, many measurement methods have been proposed in previous studies, and they can be classified into three families: The first one is ontology-based semantic similarity measures, which are usually used as tools to calculate the similarity between semantic descriptions, and they have been applied to pursue better algorithmic performance [16], find disease similarity based on both co-occurrence and information content [17], propose an algorithm for text clustering [18], quantify hyponym subgraphs for measuring semantic descriptions [19], capture semantic evidence modeled in ontologies for the particular concepts [13], evaluate advantages and limitations of different ontology-based approaches, and compare their expected performance [20]. Although ontology-based semantics can be applied to multiple domains, it is only used as a tool to solve problems and has not been deeply explored.

    View all citing articles on Scopus
    View full text