Elsevier

Neural Networks

Volume 16, Issues 5–6, June–July 2003, Pages 771-778
Neural Networks

2003 Special issue
On the quality of ART1 text clustering

https://doi.org/10.1016/S0893-6080(03)00088-1Get rights and content

Abstract

There is a large and continually growing quantity of electronic text available, which contain essential human and organization knowledge. An important research endeavor is to study and develop better ways to access this knowledge. Text clustering is a popular approach to automatically organize textual document collections by topics to help users find the information they need. Adaptive Resonance Theory (ART) neural networks possess several interesting properties that make them appealing in the area of text clustering. Although ART has been used in several research works as a text clustering tool, the level of quality of the resulting document clusters has not been clearly established yet. In this paper, we present experimental results with binary ART that address this issue by determining how close clustering quality is to an upper bound on clustering quality.

Introduction

We consider the application of clustering to the selforganization of a textual document collection. Clustering is the operation by which similar objects are grouped together in an unsupervised manner (Jain et al., 1999, Kaufman and Rousseeuw, 1990). Hence, when clustering textual documents, one is hoping to form sets of documents with similar content. Instead of exploring the whole collection of documents, a user can then browse the resulting clusters to identify and retrieve relevant documents. As such, clustering provides a summarized view of the information space by grouping documents by topics. Clustering is often the only viable solution to organize large text collections into topics. The advantage of clustering is realized when a training set and classes definitions are unavailable, or when creating them is either cost prohibitive due to the collection shear size or unrealistic due to the rapidly changing nature of the collection.

We specifically study text clustering with Adaptive Resonance Theory (ART) (Carpenter and Grossberg, 1995, Grossberg, 1976) neural networks. ART neural networks are known for their ability to perform on-line and incremental clustering of dynamic datasets. Contrary to most other types of artificial neural networks such as the popular Back-propagation Multi-Layer Perceptron (MLP) (Rumelhart, Hinton, & Williams, 1986), ART is unsupervised and allows for plastic yet stable learning. ART detects similarities among data objects, typically data points in an N-dimensional metric space. When novelty is detected, ART adaptively and autonomously creates a new category. Another advantageous and distinguishing feature of ART is its ability to discover patterns at various levels of generality. This is achieved by setting the value of a parameter known as vigilance and denoted by ρ, ρ∈(0,1]. ART stability and plasticity properties as well as its ability to process dynamic data efficiently make it an attractive candidate for clustering large, rapidly changing text collections in real-life environments. Although ART has been investigated previously as a means of clustering text data, due to numerous variations in ART implementations, experimental data sets and quality evaluation methodologies, it is not clear whether ART performs well in this type of application. Since ART seems to be a logical and appealing solution to the rapidly growing amount of textual electronic information processed by organizations, it would be important to eliminate any confusion surrounding the quality of the text clusters it produces. In this paper, we present experimental results with a binary ART neural network (ART1) that address this issue by determining how close clustering quality achieved with ART is to an expected upper bound on clustering quality. We will consider other versions of ART in future work.

Section snippets

Related work

We consider one of the many applications of text clustering in the field of Information Retrieval (IR) (VanRijsbergen, 1979), namely clustering that aims at self-organizing textual document collections. This application of text clustering can be seen as a form of classification by topics, hence making it the unsupervised counterpart to Text Categorization (TC) (Sebastiani, 2002). Text self-organization has become increasingly popular due to the availability of large document collections that

Experimental settings

We selected two well-established cluster quality evaluation measures: Jaccard (JAC) (Downton & Brennan, 1980) and Fowlkes–Mallows (FM) (Fowlkes & Mallows, 1983):JAC=a/(a+b+c)FM=a/((a+b)(a+c))1/2where

  • a is the pair-wise number of true positives, i.e. the total number of document pairs grouped together in the expected solution and that are indeed clustered together by the clustering algorithm;

  • b is the pair-wise number of false positives, i.e. the number of document pairs not expected to be grouped

Experimental results

We eliminated words that appear in 10, 20, 40 and 60 or less documents. In the first case, a total of 2282 term features were retained while in the last only 466 were. Our experiments indicated that less radical feature selection not only increased the number of features and consequently processing time, but also resulted in lower quality clusters in some cases (Fig. 1). Best quality is achieved at vigilance value of 0.05, with 106 clusters, a number close to the expected number of topics

Conclusions and future work

Text clustering work conducted with ART up to now has used many different forms of ART-based architectures, as well as different and non-comparable text collections and evaluation methods. This situation resulted in confusion as to the level of clustering quality achievable with ART. As a first step towards resolving this situation, we have tested a simple ART1 network implementation and evaluated its text clustering quality on the benchmark Reuter data set and with the standard F1 measure. K

References (37)

  • E. Fowlkes et al.

    A method for comparing two hierarchical clusterings

    Journal of American Statistical Association

    (1983)
  • M. Georgiopoulos et al.

    Convergence properties of learning in ART1

    Neural Computation

    (1990)
  • S. Grossberg

    Adaptive pattern classification and universal recording: I. Parallel development and coding of neural feature detectors

    Biological Cybernetics

    (1976)
  • Heuser, U., & Rosenstiel, W (2000). Automatic construction of local internet directories using hierarchical...
  • A.K. Jain et al.

    Data clustering: a review

    ACM Computing Surveys

    (1999)
  • L. Kaufman et al.

    Finding groups in data: An introduction to cluster analysis

    (1990)
  • T. Kohonen

    Springer series in information

    (2001)
  • T. Kohonen et al.

    Self organization of a document collection

    IEEE Transactions on Neural Networks

    (2000)
  • Cited by (41)

    • Clustering: A neural network approach

      2010, Neural Networks
      Citation Excerpt :

      The simplest and most popular ART model is the ART 1 (Carpenter & Grossberg, 1987a) for learning to categorize arbitrarily many, complex binary input patterns presented in an arbitrary order. A popular fast learning implementation is given by Du and Swamy (2006), Moore (1988), Massey (2003) and Serrano-Gotarredona and Linares-Barranco (1996). The ART 1 is stable for a finite training set.

    • Multilogistic regression by means of evolutionary product-unit neural networks

      2008, Neural Networks
      Citation Excerpt :

      Multi-class pattern recognition has a wide range of applications including handwritten digit recognition (Chiang, 1998), speech tagging and recognition (Athanaselis et al., 2005), bioinformatics (Mahony et al., 2006) and text categorization (Massey, 2003).

    • Fuzzy Law: Towards Creating a Novel Explainable Technology-Assisted Review System for e-Discovery

      2022, Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
    • Automatic Patents Classification Using Supervised Machine Learning

      2020, Advances in Intelligent Systems and Computing
    View all citing articles on Scopus
    View full text