Applying genetic algorithms to query optimization in document retrieval
Introduction
Recently, people have started dealing with an increasing number of electronic documents in information networks. Finding the documents that users need from among all the available documents is an important issue. An approach to solving this problem is to categorize documents, as in a library where, the same class of books are shelved on their own bookcase. Traditionally, document categorization has been done by humans. However, there are problems with this, in that different people may categorize the same documents differently, and people working today may produce different results to people working tomorrow. The most natural solution to this problem today is to use a computer to help people categorize documents consistently, and thus be able to retrieve items of interest easily.
Many retrieval methods assume that a query and a document are related only if they contain shared words. Documents in the same class that have the same keywords are called a “surface-based match” (Yang & Chute, 1994). Yang and Chute (1994) presented a method of document retrieval based on the idea of surface-based match. Their method learns the keyword-category association from documents that are already categorized and uses a Linear Least Squares Fit (LLSF) technique to estimate the keyword-category association. The algorithm has a time complexity of O(m2n), where m denotes the number of pairs in the training set and n represents the number of distinct words in the source space (Yang & Chute, 1994). The LLSF algorithm uses a lot of matrix operations and therefore the amount of calculation is large.
The document retrieval approach in Liddy, Paik and Yu (1994) attempts to parse the contents of a document. This method uses the “Subject Field Codes” (SFCs) from Longman’s Dictionary of Contemporary English. Because SFC maintains a lot of keywords from the Longman’s Dictionary which are already classified, it easily categorizes documents into appropriate classes by parsing the contents of documents. However, this approach is limited in that the keywords must be classified by humans, which is a difficult task.
Most document retrieval systems use keywords to retrieve documents. The systems first extract keywords from documents and then assign weights to the keywords by using different approaches. Such a system has two major problems. One is how to extract keywords precisely (Baeza-Yates, 1992; Chen, He, Xu, Gey & Meggs, 1997; Chien, Huang & Chien, 1997; He et al., 1996, Kwok, 1997; Nie, Brisebois & Ren, 1996; Zhai, Tong, Milic-Frayling & Evans, 1996) and the other is how to decide the weight of each keyword (Gordon, 1988, Lewis et al., 1996). We try to solve the two problems in this paper.
Retrieving keywords in Chinese documents is especially difficult since Chinese sentences lack explicit word boundaries. Chien et al. (1997) used an automatic statistics-based approach which efficiently extracts significant lexical patterns from a set of relevant documents. This approach used a data structure called PAT tree (Gonnet, Baeza-Yates & Snider, 1992) to index full-text documents. The approach in Chien et al. (1997) inserted all possible strings of documents into a PAT tree, making the PAT tree a fast and easy means of searching for words. However, retrieving keywords requires much time.
Nie et al. (1996) proposed a hybrid segmentation approach which combines several commonly used approaches, including statistical and dictionary approaches. In the hybrid approach, each word for which statistical information is available, is used in priority, while the others are stored in the dictionary and then assigned a default probability. The highest product of the various words’ probabilities is the best solution of the segmented strings (Nie et al., 1996). The hybrid approach finds exact keywords by the dictionary approach, and extracts new keywords by the statistical approach. To be put into practice, however, the approach needs information such as a dictionary and a set of heuristic rules.
As for deciding of the weight of each term, the simplest way is to make the weight the frequency with which the term occurs (TF) in the documents. If there is a large amount of documents, the terms would occur frequently. Thus, one normalization typically used in weighting algorithms compensates for the number of words an item has. Buckley, Singhal and Mitra (1995) presented the term frequency weighting formula as follows:where slope was set to 0.2 and pivot was set to the average number of each term occurring in the documents. Jones (1972) presented the following Inverse Document Frequency (IDF) measure:where N is the number of documents and ni is the total number of documents containing the term i. Several methods are presented to combine TF with the IDF measure. The approach proposed by Salton and Buckley (1988) is given below:where freqij is the frequency of term i in the document j and maxfreqj is the maximum frequency of any term in the document j. Lochbaum and Streeter (1989) presented the following entropy measure in several experiments:where N is the number of documents; Freqik is the frequency of term i in the document k; TFreqi is the total frequency of term i.
The above measures are used in traditional IR. Approaches such as TF and IDF compute the weights of certain of terms. Yang and Korfhage (1993) were the first to use genetic algorithms for query optimization in information retrieval. Their work emphasizes only term weight modification, and does not expand queries. This paper presents an alternative approach to finding keywords of documents and then applies a genetic approach to adapt the weights of keywords.
The remainder of the paper is organized as follows. In Section 2, we present our system framework. In Section 3, we combine the Bigram model (Chen et al., 1997; Yang, Chang & Chen, 1993) and the PAT-tree based filter algorithm (Chien et al., 1997) to retrieve keywords. In Section 4, we present a training algorithm using genetic algorithms (Chang and Hsu, 1997, Chang and Hsu, 1999, Goldberg, 1989, Yang and Korfhage, 1993) to adapt the weights of keywords. In Section 5, a relevance feedback mechanism is presented. In Section 6, several kinds of Chinese documents are designed to test the performance of our approach. Conclusions are drawn in Section 8.
Section snippets
System framework
The proposed system is based on a vector space model (Salton & MacGill, 1983) in which both documents and queries are represented as vectors. The components of the vector are keywords extracted from documents or queries. The system ranks the documents according to the degree of similarity between the documents and the query vector. The higher the value of the similarity measure is, the closer to the query vector the document is. If the value of the similarity measure is sufficiently high, the
Keyword extraction
This section presents an approach to retrieve Chinese keywords. Automatic keyword extraction has been a problem in Chinese documents because the Chinese language does not have explicit word boundary markers. The easiest solution is look words up in a dictionary. If the words appear in a dictionary, the words seem to be correct lexicons and the lexicons are confirmed as keywords. One popular dictionary-based word segmentation approach is the maximum matching method. In such an approach, keyword
Our genetic approach
Once significant keywords are extracted from training data including relevant and irrelevant documents, weights are assigned to the keywords. The weights of the keywords are formed as a query vector.
This section presents a novel genetic algorithm to tune the weights of keywords, with the aim of producing an optimal or near optimal query vector. The query vector is encoded as a chromosome. Real numbers — keyword weights — are used as genes in this approach. The first step is to generate a random
Relevance feedback
Relevance feedback is an important mechanism in information retrieval (Allan, 1996, Buckley and Salton, 1995, Harman, 1992, Lundquist et al., 1997). The relevance feedback approaches add or subtract a certain value from the weight of the retrieved keyword. They regulate the weights of keywords step by step to improve the performance (Harman, 1992).
Many different techniques (Lundquist et al., 1997) have been used to improve the results obtained from relevance feedback. Some methods are given
Experimental results
The experiment uses documents from the web, http://vita.fju.edu.tw/, which collects the categorized documents including Social Tender Sentiments, The Love of Univ., The Spring of Education, The Contribution of Medical Treatment and so on. The daily news articles are from the web http://tw.yahoo.com/headlines. The daily news articles are grouped manually into eight sections, namely business, China, entertainment, international news, life, politics, society, and sport. The documents were obtained
Discussion
In Experiment 3, our genetic algorithm has better performance than that of the Y.K. algorithm (Yang & Koefhage, 1993). The Y.K. algorithm uses two-point crossover and random mutation operators. The two-point crossover operator is the same as our weight-selection crossover, and the random mutation operator selects a random real number to replace the value of a gene of a chromosome. However, our proposed natural crossover and mutation operators could guide the chromosome to a better solution, and
Conclusions
This paper proposes a novel approach to retrieve keywords automatically and then uses genetic algorithms to adapt their weights. The advantage of this approach is that it does not need a dictionary. This approach can retrieve any type of keywords, including types like technical keywords and people’s names. The precision of document retrieval through this approach is equal to that of the PAT-tree based approach. However, the approach outlined in this paper requires less time and memory than the
Acknowledgements
A grant from the National Science Council, NSC 89-2213-E-008 –006, partially supported this research. We would also like to thank our referees for their helpful comments and suggestions.
References (33)
- et al.
An algorithm for string matching with a sequence of don’t cares
Information Processing Letters
(1991) - et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988) - et al.
A genetic approach to the quardatic assignment problem
Computers and Operations Research
(1995) Incremental relevance feedback for information filtering
Text retrieval: theory and practice
- et al.
Automatic query expansion using SMART: TREC 3
- et al.
Optimization of relevance feedback weights
- et al.
New retrieval approaches using SMART: TREC 4
- Chang, C.-H., & Hsu, C.-C. (1997). Information searching and exploring agent applying clustering and genetic algorithm....
- Chang, C.-H., & Hsu, C.-C. (1999). The design of an information system for hypertext retrieval and automatic discovery...
Genetic algorithms in search, optimization and machine learning
New indices for text PAT trees and PAT arrays
Probabilistic and genetic algorithms in document retrieval
Communications of the ACM
Cited by (106)
A novel approach of cluster based optimal ranking of clicked URLs using genetic algorithm for effective personalized web search
2016, Applied Soft Computing JournalCitation Excerpt :In Ref. [45] GA is used with user feedback to choose weights for search terms in a query. In Refs. [6,20,41] GA is examined for Information Retrieval and a new crossover and mutation operator were suggested. In Ref. [15] an algorithm is proposed for index function learning based on genetic algorithm.
Anecdotes extraction from webpage context as image annotation
2015, Emerging Trends in Image Processing, Computer Vision and Pattern RecognitionOptimized weights of document keywords for auto-reply accuracy
2014, NeurocomputingCitation Excerpt :These works follow various techniques to overcome the problem, to which the keyword proximity search problems can be reduced [1,10,12]. Additionally, a genetic algorithm was used to adapt the weights of document keywords for query optimization in information retrieval [9,11]. Hwang et al. [11] proposed an enhanced genetic approach to optimize the weights of document keywords for each candidate answer according to the feedbacks provided by the students, hence more accurate answers can be provided.
Automated query learning with Wikipedia and genetic programming
2013, Artificial IntelligenceA model for book inquiry history analysis and book-acquisition recommendation of libraries
2012, Library Collections, Acquisition and Technical ServicesCitation Excerpt :First of all, all the two-word terms of the documents can be selected via Bi-Gram model; then, PAT-Tree structure is established based on two-word term sets. Finally, the keywords can be extracted based on the distribution of PAT-Tree structure (Horng & Yeh, 2000). Libraries are confronted with reduced book funds and increased book ordering costs now, therefore, how libraries satisfy readers' requirements with limited funds is really a major issue to library servers.
A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system
2024, Journal of Ambient Intelligence and Humanized Computing