Applying genetic algorithms to query optimization in document retrieval

doi:10.1016/S0306-4573(00)00008-X

Information Processing & Management

Volume 36, Issue 5, 1 September 2000, Pages 737-759

https://doi.org/10.1016/S0306-4573(00)00008-X Get rights and content

Abstract

This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary, ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking, Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval, ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.

Introduction

Recently, people have started dealing with an increasing number of electronic documents in information networks. Finding the documents that users need from among all the available documents is an important issue. An approach to solving this problem is to categorize documents, as in a library where, the same class of books are shelved on their own bookcase. Traditionally, document categorization has been done by humans. However, there are problems with this, in that different people may categorize the same documents differently, and people working today may produce different results to people working tomorrow. The most natural solution to this problem today is to use a computer to help people categorize documents consistently, and thus be able to retrieve items of interest easily.

Many retrieval methods assume that a query and a document are related only if they contain shared words. Documents in the same class that have the same keywords are called a “surface-based match” (Yang & Chute, 1994). Yang and Chute (1994) presented a method of document retrieval based on the idea of surface-based match. Their method learns the keyword-category association from documents that are already categorized and uses a Linear Least Squares Fit (LLSF) technique to estimate the keyword-category association. The algorithm has a time complexity of O(m²n), where m denotes the number of pairs in the training set and n represents the number of distinct words in the source space (Yang & Chute, 1994). The LLSF algorithm uses a lot of matrix operations and therefore the amount of calculation is large.

The document retrieval approach in Liddy, Paik and Yu (1994) attempts to parse the contents of a document. This method uses the “Subject Field Codes” (SFCs) from Longman’s Dictionary of Contemporary English. Because SFC maintains a lot of keywords from the Longman’s Dictionary which are already classified, it easily categorizes documents into appropriate classes by parsing the contents of documents. However, this approach is limited in that the keywords must be classified by humans, which is a difficult task.

Most document retrieval systems use keywords to retrieve documents. The systems first extract keywords from documents and then assign weights to the keywords by using different approaches. Such a system has two major problems. One is how to extract keywords precisely (Baeza-Yates, 1992; Chen, He, Xu, Gey & Meggs, 1997; Chien, Huang & Chien, 1997; He et al., 1996, Kwok, 1997; Nie, Brisebois & Ren, 1996; Zhai, Tong, Milic-Frayling & Evans, 1996) and the other is how to decide the weight of each keyword (Gordon, 1988, Lewis et al., 1996). We try to solve the two problems in this paper.

Retrieving keywords in Chinese documents is especially difficult since Chinese sentences lack explicit word boundaries. Chien et al. (1997) used an automatic statistics-based approach which efficiently extracts significant lexical patterns from a set of relevant documents. This approach used a data structure called PAT tree (Gonnet, Baeza-Yates & Snider, 1992) to index full-text documents. The approach in Chien et al. (1997) inserted all possible strings of documents into a PAT tree, making the PAT tree a fast and easy means of searching for words. However, retrieving keywords requires much time.

Nie et al. (1996) proposed a hybrid segmentation approach which combines several commonly used approaches, including statistical and dictionary approaches. In the hybrid approach, each word for which statistical information is available, is used in priority, while the others are stored in the dictionary and then assigned a default probability. The highest product of the various words’ probabilities is the best solution of the segmented strings (Nie et al., 1996). The hybrid approach finds exact keywords by the dictionary approach, and extracts new keywords by the statistical approach. To be put into practice, however, the approach needs information such as a dictionary and a set of heuristic rules.

As for deciding of the weight of each term, the simplest way is to make the weight the frequency with which the term occurs (TF) in the documents. If there is a large amount of documents, the terms would occur frequently. Thus, one normalization typically used in weighting algorithms compensates for the number of words an item has. Buckley, Singhal and Mitra (1995) presented the term frequency weighting formula as follows: $(1+ log (TF))/(1+ log (average (TF)) (1− slope)^{∗} pivot + slope^{∗} number-of-unique-term$ where slope was set to 0.2 and pivot was set to the average number of each term occurring in the documents. Jones (1972) presented the following Inverse Document Frequency (IDF) measure: $IDF_{i} = log_{2} N n_{i} +1,$ where N is the number of documents and n_i is the total number of documents containing the term i. Several methods are presented to combine TF with the IDF measure. The approach proposed by Salton and Buckley (1988) is given below: $w_{ij} = 0.5+ 0.5^{∗} freq_{j} maxfreq_{j}^{∗} ID_{i}$ where freq_ij is the frequency of term i in the document j and maxfreq_j is the maximum frequency of any term in the document j. Lochbaum and Streeter (1989) presented the following entropy measure in several experiments: $entropy_{i} =1− ∑ k=1 N Freq_{ik} TFreq_{i} log_{2} TFreq_{i} Freq_{ik} log_{2} N,$ where N is the number of documents; Freq_ik is the frequency of term i in the document k; TFreq_i is the total frequency of term i.

The above measures are used in traditional IR. Approaches such as TF and IDF compute the weights of certain of terms. Yang and Korfhage (1993) were the first to use genetic algorithms for query optimization in information retrieval. Their work emphasizes only term weight modification, and does not expand queries. This paper presents an alternative approach to finding keywords of documents and then applies a genetic approach to adapt the weights of keywords.

The remainder of the paper is organized as follows. In Section 2, we present our system framework. In Section 3, we combine the Bigram model (Chen et al., 1997; Yang, Chang & Chen, 1993) and the PAT-tree based filter algorithm (Chien et al., 1997) to retrieve keywords. In Section 4, we present a training algorithm using genetic algorithms (Chang and Hsu, 1997, Chang and Hsu, 1999, Goldberg, 1989, Yang and Korfhage, 1993) to adapt the weights of keywords. In Section 5, a relevance feedback mechanism is presented. In Section 6, several kinds of Chinese documents are designed to test the performance of our approach. Conclusions are drawn in Section 8.

Section snippets

System framework

The proposed system is based on a vector space model (Salton & MacGill, 1983) in which both documents and queries are represented as vectors. The components of the vector are keywords extracted from documents or queries. The system ranks the documents according to the degree of similarity between the documents and the query vector. The higher the value of the similarity measure is, the closer to the query vector the document is. If the value of the similarity measure is sufficiently high, the

Keyword extraction

This section presents an approach to retrieve Chinese keywords. Automatic keyword extraction has been a problem in Chinese documents because the Chinese language does not have explicit word boundary markers. The easiest solution is look words up in a dictionary. If the words appear in a dictionary, the words seem to be correct lexicons and the lexicons are confirmed as keywords. One popular dictionary-based word segmentation approach is the maximum matching method. In such an approach, keyword

Our genetic approach

Once significant keywords are extracted from training data including relevant and irrelevant documents, weights are assigned to the keywords. The weights of the keywords are formed as a query vector.

This section presents a novel genetic algorithm to tune the weights of keywords, with the aim of producing an optimal or near optimal query vector. The query vector is encoded as a chromosome. Real numbers — keyword weights — are used as genes in this approach. The first step is to generate a random

Relevance feedback

Relevance feedback is an important mechanism in information retrieval (Allan, 1996, Buckley and Salton, 1995, Harman, 1992, Lundquist et al., 1997). The relevance feedback approaches add or subtract a certain value from the weight of the retrieved keyword. They regulate the weights of keywords step by step to improve the performance (Harman, 1992).

Many different techniques (Lundquist et al., 1997) have been used to improve the results obtained from relevance feedback. Some methods are given

Experimental results

The experiment uses documents from the web, http://vita.fju.edu.tw/, which collects the categorized documents including Social Tender Sentiments, The Love of Univ., The Spring of Education, The Contribution of Medical Treatment and so on. The daily news articles are from the web http://tw.yahoo.com/headlines. The daily news articles are grouped manually into eight sections, namely business, China, entertainment, international news, life, politics, society, and sport. The documents were obtained

Discussion

In Experiment 3, our genetic algorithm has better performance than that of the Y.K. algorithm (Yang & Koefhage, 1993). The Y.K. algorithm uses two-point crossover and random mutation operators. The two-point crossover operator is the same as our weight-selection crossover, and the random mutation operator selects a random real number to replace the value of a gene of a chromosome. However, our proposed natural crossover and mutation operators could guide the chromosome to a better solution, and

Conclusions

This paper proposes a novel approach to retrieve keywords automatically and then uses genetic algorithms to adapt their weights. The advantage of this approach is that it does not need a dictionary. This approach can retrieve any type of keywords, including types like technical keywords and people’s names. The precision of document retrieval through this approach is equal to that of the PAT-tree based approach. However, the approach outlined in this paper requires less time and memory than the

Acknowledgements

A grant from the National Science Council, NSC 89-2213-E-008 –006, partially supported this research. We would also like to thank our referees for their helpful comments and suggestions.

References (33)

U. Manber et al.
An algorithm for string matching with a sequence of don’t cares
Information Processing Letters
(1991)
G. Salton et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988)
D.M. Tate et al.
A genetic approach to the quardatic assignment problem
Computers and Operations Research
(1995)
J. Allan
Incremental relevance feedback for information filtering
R.A. Baeza-Yates
Text retrieval: theory and practice
C. Buckley et al.
Automatic query expansion using SMART: TREC 3
C. Buckley et al.
Optimization of relevance feedback weights
C. Buckley et al.
New retrieval approaches using SMART: TREC 4
Chang, C.-H., & Hsu, C.-C. (1997). Information searching and exploring agent applying clustering and genetic algorithm....
Chang, C.-H., & Hsu, C.-C. (1999). The design of an information system for hypertext retrieval and automatic discovery...

Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. (1997). Chinese text retrieval without using a dictionary. ACM...

Chien, L.-F., Huang, T.-I., & Chien, M.-C. (1997). Pat-tree-based keyword extraction for Chinese information retrieval....

Fung, P., & Wu, D. (1994). Statistical augmentation of a Chinese machine-readable dictionary. In Second Annual Workshop...

D.E. Goldberg

Genetic algorithms in search, optimization and machine learning

(1989)

G.H. Gonnet et al.

New indices for text PAT trees and PAT arrays

M.D. Gordon

Probabilistic and genetic algorithms in document retrieval

Communications of the ACM

(1988)

Cited by (106)

A novel approach of cluster based optimal ranking of clicked URLs using genetic algorithm for effective personalized web search
2016, Applied Soft Computing Journal
Citation Excerpt :
In Ref. [45] GA is used with user feedback to choose weights for search terms in a query. In Refs. [6,20,41] GA is examined for Information Retrieval and a new crossover and mutation operator were suggested. In Ref. [15] an algorithm is proposed for index function learning based on genetic algorithm.
In this paper a novel approach is proposed for generating the optimal ranked clicked URLs using genetic algorithm (GA) based on clustered web query sessions for effective personalized web search. Experimental study was conducted on the data set of web query sessions captured in the domains academics, entertainment and sports to test the effectiveness of clusterwise optimal ranked clicked URLs for personalized web search (PWS). The results, which are verified statistically shows an improvement in the average precision of the personalized web search based on optimal ranked clicked URLs over both Classic IR and personalized web search without optimal ranked clicked URLs. Thus the effectiveness of personalized web search using optimal ranked clicked URLs is confirmed for better customizing the web search according to the information need of the user.
Anecdotes extraction from webpage context as image annotation
2015, Emerging Trends in Image Processing, Computer Vision and Pattern Recognition
Traditional feature-based or text processing techniques tend to assign the same annotation to all the images in the same cluster without considering the latent semantic anecdotes of each image. In this research, we propose the Chinese lexical chain processing method which is a bottom-up concatenating process based on the intensity and the degree of a lexical chain (LC) to extract the most meaningful LCs as anecdotes from a string. It requires minimum computation that allows sharing characters/words and facilitating their use at fine granularities without prohibitive cost. In the experiment, this method achieves a precision rate of 84.6%, and gains acceptance from expert rating and user rating of 84% and 76.6%, respectively. In performance testing, it only takes 0.007 s to process each image in a collection of 18,000 testing data set.
Optimized weights of document keywords for auto-reply accuracy
2014, Neurocomputing
Citation Excerpt :
These works follow various techniques to overcome the problem, to which the keyword proximity search problems can be reduced [1,10,12]. Additionally, a genetic algorithm was used to adapt the weights of document keywords for query optimization in information retrieval [9,11]. Hwang et al. [11] proposed an enhanced genetic approach to optimize the weights of document keywords for each candidate answer according to the feedbacks provided by the students, hence more accurate answers can be provided.
A Taguchi-crossover differential evolution (TCDE) algorithm is proposed to optimize weights of document keywords for auto-reply accuracy. The proposed TCDE algorithm combines the use of differential evolution for exploring the optimal feasible region in macro-space with the use of the Taguchi method for exploiting the optimal solution in micro-space. For learning purpose, an answer needs to be exactly given for a specific query. Notably, teachers give a problem answer to elementary students who need to have the clear and accurate solution for learning according to their queries. This study shows the TCDE which integrates a cosine similarity measure and an evaluation function to successfully find the best weights of document keywords for auto-reply accuracy. Performance comparisons confirm that the TCDE algorithm outperforms existing methods presented in the literature in finding the best weights of document keywords and obtaining accurate answers.
Automated query learning with Wikipedia and genetic programming
2013, Artificial Intelligence
Most of the existing information retrieval systems are based on bag-of-words model and are not equipped with common world knowledge. Work has been done towards improving the efficiency of such systems by using intelligent algorithms to generate search queries, however, not much research has been done in the direction of incorporating human-and-society level knowledge in the queries. This paper is one of the first attempts where such information is incorporated into the search queries using Wikipedia semantics. The paper presents Wikipedia-based Evolutionary Semantics (Wiki-ES) framework for generating concept based queries using a set of relevance statements provided by the user. The query learning is handled by a co-evolving genetic programming procedure.
To evaluate the proposed framework, the system is compared to a bag-of-words based genetic programming framework as well as to a number of alternative document filtering techniques. The results obtained using Reuters newswire documents are encouraging. In particular, the injection of Wikipedia semantics into a GP-algorithm leads to improvement in average recall and precision, when compared to a similar system without human knowledge. A further comparison against other document filtering frameworks suggests that the proposed GP-method also performs well when compared with systems that do not rely on query-expression learning.
A model for book inquiry history analysis and book-acquisition recommendation of libraries
2012, Library Collections, Acquisition and Technical Services
Citation Excerpt :
First of all, all the two-word terms of the documents can be selected via Bi-Gram model; then, PAT-Tree structure is established based on two-word term sets. Finally, the keywords can be extracted based on the distribution of PAT-Tree structure (Horng & Yeh, 2000). Libraries are confronted with reduced book funds and increased book ordering costs now, therefore, how libraries satisfy readers' requirements with limited funds is really a major issue to library servers.
In the era of knowledge economy, the libraries play an important role for library users to maintain and provide a large number of book resources. In order to satisfy requirements of borrowers, the libraries have to purchase all kinds of new books on a regular time schedule. However, the borrowers' demands cannot be satisfied simply because of the limited number of librarians and thus the libraries require useful suggestions for book-acquisition. Traditionally, the book-acquisition recommendation applications are collected by library consultants and then evaluated by librarians. Under the circumstance, several pitfalls (e.g., only partial library borrowers realize the book-acquisition recommendation procedure or a lot of time and human efforts required) might occur. Therefore, this paper focuses on the development of a book-acquisition recommendation model for libraries to acquire the various library borrowers' demands based on book inquiry history under a library system.
In addition to the book-acquisition recommendation model, a Web-based book-acquisition recommendation system is also developed and a demonstration case is applied to verify the performance of the proposed approach. Under the book-acquisition recommendation platform, the librarians can automatically derive the book-acquisition recommendation list to fit borrowers' requirements and the complicated recommendation processes for borrowers can also be reduced. The attempt of this research is to enhance the accuracy and efficiency of book-acquisition performance and therefore the book-acquisition tasks in library can be efficiently accomplished.
A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system
2024, Journal of Ambient Intelligence and Humanized Computing

View all citing articles on Scopus

View full text

Applying genetic algorithms to query optimization in document retrieval

Abstract

Introduction

Section snippets

System framework

Keyword extraction

Our genetic approach

Relevance feedback

Experimental results

Discussion

Conclusions

Acknowledgements

Information Processing Letters

Information Processing and Management

Computers and Operations Research

Incremental relevance feedback for information filtering

Text retrieval: theory and practice

Automatic query expansion using SMART: TREC 3

Optimization of relevance feedback weights

New retrieval approaches using SMART: TREC 4

Genetic algorithms in search, optimization and machine learning

New indices for text PAT trees and PAT arrays

Probabilistic and genetic algorithms in document retrieval

Communications of the ACM