Applying genetic algorithms to query optimization in document retrieval

https://doi.org/10.1016/S0306-4573(00)00008-XGet rights and content

Abstract

This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary, ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking, Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval, ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.

Introduction

Recently, people have started dealing with an increasing number of electronic documents in information networks. Finding the documents that users need from among all the available documents is an important issue. An approach to solving this problem is to categorize documents, as in a library where, the same class of books are shelved on their own bookcase. Traditionally, document categorization has been done by humans. However, there are problems with this, in that different people may categorize the same documents differently, and people working today may produce different results to people working tomorrow. The most natural solution to this problem today is to use a computer to help people categorize documents consistently, and thus be able to retrieve items of interest easily.

Many retrieval methods assume that a query and a document are related only if they contain shared words. Documents in the same class that have the same keywords are called a “surface-based match” (Yang & Chute, 1994). Yang and Chute (1994) presented a method of document retrieval based on the idea of surface-based match. Their method learns the keyword-category association from documents that are already categorized and uses a Linear Least Squares Fit (LLSF) technique to estimate the keyword-category association. The algorithm has a time complexity of O(m2n), where m denotes the number of pairs in the training set and n represents the number of distinct words in the source space (Yang & Chute, 1994). The LLSF algorithm uses a lot of matrix operations and therefore the amount of calculation is large.

The document retrieval approach in Liddy, Paik and Yu (1994) attempts to parse the contents of a document. This method uses the “Subject Field Codes” (SFCs) from Longman’s Dictionary of Contemporary English. Because SFC maintains a lot of keywords from the Longman’s Dictionary which are already classified, it easily categorizes documents into appropriate classes by parsing the contents of documents. However, this approach is limited in that the keywords must be classified by humans, which is a difficult task.

Most document retrieval systems use keywords to retrieve documents. The systems first extract keywords from documents and then assign weights to the keywords by using different approaches. Such a system has two major problems. One is how to extract keywords precisely (Baeza-Yates, 1992; Chen, He, Xu, Gey & Meggs, 1997; Chien, Huang & Chien, 1997; He et al., 1996, Kwok, 1997; Nie, Brisebois & Ren, 1996; Zhai, Tong, Milic-Frayling & Evans, 1996) and the other is how to decide the weight of each keyword (Gordon, 1988, Lewis et al., 1996). We try to solve the two problems in this paper.

Retrieving keywords in Chinese documents is especially difficult since Chinese sentences lack explicit word boundaries. Chien et al. (1997) used an automatic statistics-based approach which efficiently extracts significant lexical patterns from a set of relevant documents. This approach used a data structure called PAT tree (Gonnet, Baeza-Yates & Snider, 1992) to index full-text documents. The approach in Chien et al. (1997) inserted all possible strings of documents into a PAT tree, making the PAT tree a fast and easy means of searching for words. However, retrieving keywords requires much time.

Nie et al. (1996) proposed a hybrid segmentation approach which combines several commonly used approaches, including statistical and dictionary approaches. In the hybrid approach, each word for which statistical information is available, is used in priority, while the others are stored in the dictionary and then assigned a default probability. The highest product of the various words’ probabilities is the best solution of the segmented strings (Nie et al., 1996). The hybrid approach finds exact keywords by the dictionary approach, and extracts new keywords by the statistical approach. To be put into practice, however, the approach needs information such as a dictionary and a set of heuristic rules.

As for deciding of the weight of each term, the simplest way is to make the weight the frequency with which the term occurs (TF) in the documents. If there is a large amount of documents, the terms would occur frequently. Thus, one normalization typically used in weighting algorithms compensates for the number of words an item has. Buckley, Singhal and Mitra (1995) presented the term frequency weighting formula as follows:(1+log(TF))/(1+log(average(TF))(1−slope)pivot+slopenumber-of-unique-termwhere slope was set to 0.2 and pivot was set to the average number of each term occurring in the documents. Jones (1972) presented the following Inverse Document Frequency (IDF) measure:IDFi=log2Nni+1,where N is the number of documents and ni is the total number of documents containing the term i. Several methods are presented to combine TF with the IDF measure. The approach proposed by Salton and Buckley (1988) is given below:wij=0.5+0.5freqjmaxfreqjIDiwhere freqij is the frequency of term i in the document j and maxfreqj is the maximum frequency of any term in the document j. Lochbaum and Streeter (1989) presented the following entropy measure in several experiments:entropyi=1−k=1NFreqikTFreqilog2TFreqiFreqiklog2N,where N is the number of documents; Freqik is the frequency of term i in the document k; TFreqi is the total frequency of term i.

The above measures are used in traditional IR. Approaches such as TF and IDF compute the weights of certain of terms. Yang and Korfhage (1993) were the first to use genetic algorithms for query optimization in information retrieval. Their work emphasizes only term weight modification, and does not expand queries. This paper presents an alternative approach to finding keywords of documents and then applies a genetic approach to adapt the weights of keywords.

The remainder of the paper is organized as follows. In Section 2, we present our system framework. In Section 3, we combine the Bigram model (Chen et al., 1997; Yang, Chang & Chen, 1993) and the PAT-tree based filter algorithm (Chien et al., 1997) to retrieve keywords. In Section 4, we present a training algorithm using genetic algorithms (Chang and Hsu, 1997, Chang and Hsu, 1999, Goldberg, 1989, Yang and Korfhage, 1993) to adapt the weights of keywords. In Section 5, a relevance feedback mechanism is presented. In Section 6, several kinds of Chinese documents are designed to test the performance of our approach. Conclusions are drawn in Section 8.

Section snippets

System framework

The proposed system is based on a vector space model (Salton & MacGill, 1983) in which both documents and queries are represented as vectors. The components of the vector are keywords extracted from documents or queries. The system ranks the documents according to the degree of similarity between the documents and the query vector. The higher the value of the similarity measure is, the closer to the query vector the document is. If the value of the similarity measure is sufficiently high, the

Keyword extraction

This section presents an approach to retrieve Chinese keywords. Automatic keyword extraction has been a problem in Chinese documents because the Chinese language does not have explicit word boundary markers. The easiest solution is look words up in a dictionary. If the words appear in a dictionary, the words seem to be correct lexicons and the lexicons are confirmed as keywords. One popular dictionary-based word segmentation approach is the maximum matching method. In such an approach, keyword

Our genetic approach

Once significant keywords are extracted from training data including relevant and irrelevant documents, weights are assigned to the keywords. The weights of the keywords are formed as a query vector.

This section presents a novel genetic algorithm to tune the weights of keywords, with the aim of producing an optimal or near optimal query vector. The query vector is encoded as a chromosome. Real numbers — keyword weights — are used as genes in this approach. The first step is to generate a random

Relevance feedback

Relevance feedback is an important mechanism in information retrieval (Allan, 1996, Buckley and Salton, 1995, Harman, 1992, Lundquist et al., 1997). The relevance feedback approaches add or subtract a certain value from the weight of the retrieved keyword. They regulate the weights of keywords step by step to improve the performance (Harman, 1992).

Many different techniques (Lundquist et al., 1997) have been used to improve the results obtained from relevance feedback. Some methods are given

Experimental results

The experiment uses documents from the web, http://vita.fju.edu.tw/, which collects the categorized documents including Social Tender Sentiments, The Love of Univ., The Spring of Education, The Contribution of Medical Treatment and so on. The daily news articles are from the web http://tw.yahoo.com/headlines. The daily news articles are grouped manually into eight sections, namely business, China, entertainment, international news, life, politics, society, and sport. The documents were obtained

Discussion

In Experiment 3, our genetic algorithm has better performance than that of the Y.K. algorithm (Yang & Koefhage, 1993). The Y.K. algorithm uses two-point crossover and random mutation operators. The two-point crossover operator is the same as our weight-selection crossover, and the random mutation operator selects a random real number to replace the value of a gene of a chromosome. However, our proposed natural crossover and mutation operators could guide the chromosome to a better solution, and

Conclusions

This paper proposes a novel approach to retrieve keywords automatically and then uses genetic algorithms to adapt their weights. The advantage of this approach is that it does not need a dictionary. This approach can retrieve any type of keywords, including types like technical keywords and people’s names. The precision of document retrieval through this approach is equal to that of the PAT-tree based approach. However, the approach outlined in this paper requires less time and memory than the

Acknowledgements

A grant from the National Science Council, NSC 89-2213-E-008 –006, partially supported this research. We would also like to thank our referees for their helpful comments and suggestions.

References (33)

  • U. Manber et al.

    An algorithm for string matching with a sequence of don’t cares

    Information Processing Letters

    (1991)
  • G. Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing and Management

    (1988)
  • D.M. Tate et al.

    A genetic approach to the quardatic assignment problem

    Computers and Operations Research

    (1995)
  • J. Allan

    Incremental relevance feedback for information filtering

  • R.A. Baeza-Yates

    Text retrieval: theory and practice

  • C. Buckley et al.

    Automatic query expansion using SMART: TREC 3

  • C. Buckley et al.

    Optimization of relevance feedback weights

  • C. Buckley et al.

    New retrieval approaches using SMART: TREC 4

  • Chang, C.-H., & Hsu, C.-C. (1997). Information searching and exploring agent applying clustering and genetic algorithm....
  • Chang, C.-H., & Hsu, C.-C. (1999). The design of an information system for hypertext retrieval and automatic discovery...
  • Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. (1997). Chinese text retrieval without using a dictionary. ACM...
  • Chien, L.-F., Huang, T.-I., & Chien, M.-C. (1997). Pat-tree-based keyword extraction for Chinese information retrieval....
  • Fung, P., & Wu, D. (1994). Statistical augmentation of a Chinese machine-readable dictionary. In Second Annual Workshop...
  • D.E. Goldberg

    Genetic algorithms in search, optimization and machine learning

    (1989)
  • G.H. Gonnet et al.

    New indices for text PAT trees and PAT arrays

  • M.D. Gordon

    Probabilistic and genetic algorithms in document retrieval

    Communications of the ACM

    (1988)
  • Cited by (106)

    • A novel approach of cluster based optimal ranking of clicked URLs using genetic algorithm for effective personalized web search

      2016, Applied Soft Computing Journal
      Citation Excerpt :

      In Ref. [45] GA is used with user feedback to choose weights for search terms in a query. In Refs. [6,20,41] GA is examined for Information Retrieval and a new crossover and mutation operator were suggested. In Ref. [15] an algorithm is proposed for index function learning based on genetic algorithm.

    • Anecdotes extraction from webpage context as image annotation

      2015, Emerging Trends in Image Processing, Computer Vision and Pattern Recognition
    • Optimized weights of document keywords for auto-reply accuracy

      2014, Neurocomputing
      Citation Excerpt :

      These works follow various techniques to overcome the problem, to which the keyword proximity search problems can be reduced [1,10,12]. Additionally, a genetic algorithm was used to adapt the weights of document keywords for query optimization in information retrieval [9,11]. Hwang et al. [11] proposed an enhanced genetic approach to optimize the weights of document keywords for each candidate answer according to the feedbacks provided by the students, hence more accurate answers can be provided.

    • A model for book inquiry history analysis and book-acquisition recommendation of libraries

      2012, Library Collections, Acquisition and Technical Services
      Citation Excerpt :

      First of all, all the two-word terms of the documents can be selected via Bi-Gram model; then, PAT-Tree structure is established based on two-word term sets. Finally, the keywords can be extracted based on the distribution of PAT-Tree structure (Horng & Yeh, 2000). Libraries are confronted with reduced book funds and increased book ordering costs now, therefore, how libraries satisfy readers' requirements with limited funds is really a major issue to library servers.

    View all citing articles on Scopus
    View full text