A rough-fuzzy document grading system for customized text information retrieval

doi:10.1016/j.ipm.2003.09.004

Information Processing & Management

Volume 41, Issue 2, March 2005, Pages 195-216

https://doi.org/10.1016/j.ipm.2003.09.004 Get rights and content

Abstract

Due to the large repository of documents available on the web, users are usually inundated by a large volume of information, most of which is found to be irrelevant. Since user perspectives vary, a client-side text filtering system that learns the user's perspective can reduce the problem of irrelevant retrieval. In this paper, we have provided the design of a customized text information filtering system which learns user preferences and modifies the initial query to fetch better documents. It uses a rough-fuzzy reasoning scheme. The rough-set based reasoning takes care of natural language nuances, like synonym handling, very elegantly. The fuzzy decider provides qualitative grading to the documents for the user's perusal. We have provided the detailed design of the various modules and some results related to the performance analysis of the system.

Introduction

The World Wide Web, with its large collection of documents, is a storehouse of information for any user. Search engines help users locate information. But these search engines usually return a huge list of url's which are ordered according to a general relevance computation function. Most of the users find a large proportion of these documents to be irrelevant. However, since no two users usually have identical perspectives it is very difficult to find a general relevance computation function that can satisfy all users simultaneously. It is also not feasible to load a server with profiles of all the clients to serve them better. A viable way to provide relevant documents to every user is to use client side information filtering systems, which can learn a client's perspective and grade documents according to a relevance function specific to the client.

In this paper, we have presented the design of a client-side text information filtering system based on a rough-fuzzy reasoning paradigm. This can pro-actively filter out irrelevant documents for a user, in his or her domain of long-term interest, after learning the user's preferences. To begin with, the user rates a set of training documents retrieved as a result of posing a query to a standard search-engine. The user response is then analyzed to formulate a modified query which represents the user's interests in a more focused way. This modified query is again fed to the search engine and has been found to retrieve better documents. However, since these documents are also ordered by the grading scheme of the search engine, their ordering do not still reflect the client's preferences. A rough-fuzzy grading scheme is thereafter employed to re-evaluate these documents and order them according to the user preferences.

The most unique aspects of this work are:

•
The use of discernibility to represent the user's relevance feedback.
•
We have presented how a client-side user relevance feedback based text retrieval system can be designed using rough-fuzzy reasoning. Most of the existing systems of this category use probabilistic reasoning (Intarka, Inc., 1999; Pazzani, Muramatsu, & Billsus, 1996). Rough-Fuzzy reasoning paradigm helps in modeling natural language based information more elegantly through the use of equivalence relations.

The remaining paper is organized as follows. Section 2 presents a brief review of related work on text-filtering systems in general and also on application of rough-set theory to text-information retrieval. 3 A brief overview of rough-set based reasoning for text information retrieval, 4 Architecture of the customized text filtering system, 5 Query modifier––forming modified query with most discerning words, 6 User preference analyzer––learning the user's basis for rating, 7 Fuzzy grading of documents present the details of the various modules of our system. Section 8 provides some results and their analysis.

Section snippets

Review of related work

Significant work has been done towards building client side text retrieval systems based on user ratings. In this section, we first provide a brief overview of these. Later in this section we present some of the recent developments in applying rough sets for text information retrieval.

A brief overview of rough-set based reasoning for text information retrieval

Rough sets were introduced by Pawlak (1982). An information system can be defined as a pair A=(U,A) where U is a non-empty finite set of objects called the universe and A is a non-empty finite set of attributes. For every a∈A, V_a(x) represents the value of attribute a for object x. An information system is called a decision system if it has an additional decision attribute. The core of all rough-set based reasoning contains an equivalence relation called the indiscernibility relation. For any B⊆

Architecture of the customized text filtering system

In our system text documents are represented as weighted vector of words like that used by google (Google Search Engine Optimization, 1999). The information system for classifying documents is constructed by taking words to represent attributes, and their weights in documents to represent the values of these attributes. The information system is converted to a decision system by including the user relevance feedback for each document.

This decision table is analyzed using rough-set based

Query modifier––forming modified query with most discerning words

As stated in the earlier section, every training document is converted to a weighted vector of words appearing in the document. To calculate the weights of the words, we use the HTML source code of the pages. Since each tag like 〈TITLE〉, 〈B〉 etc. in an HTML document has a special significance, we assign separate weights to each one of them. We have given tag weights in the range of 1–10, with 10 for 〈TITLE〉, then 8 for 〈META〉, 6 for 〈B〉 etc. The plain words have a weighing factor 1. The weight

User preference analyzer––learning the user's basis for rating

To help the system rate the newly fetched documents and eliminate irrelevant ones, it is essential to learn the user's rating paradigm. For this we make use of rough similarity measures between the modified query and the original documents that were rated by the user. Let S₁ denote the set of words along with their weights, extracted from a document as explained in Section 5. Let S₂ denote the set of most discerning words along with their discerning cut values. Using Eqs. , of Section 3.1, one

Fuzzy grading of documents

The rules generated by the preference analyzer are used to rate the new set of documents retrieved using the modified query. For this we use a fuzzy reasoning scheme which provides both a crisp document grading as well as a fuzzy visualizer, to provide a qualitative idea about the relevance of a document.

Fuzzy reasoning consists of two core activities––editing the fuzzy input and output membership functions. To design the fuzzy input membership functions we have made use of the rules obtained

Results

In this section we will present some performance analysis of our system. We have worked with queries some of which like “HIV” and “alcohol addiction” were chosen because they had been mentioned in TREC topics. TREC mentioned “brain cancer” as a topic. But we worked with “Blood cancer” since we had less expertize in rating the other topic. Similarly, rather than “Thailand tourism” as mentioned in Chakrabarti et al. (1998) we chose “Indian Tourism” as a domain. We chose a new query “Alternative

Conclusion

In this paper we have presented the design of a complete client-side filtering system for general text documents. The system uses the rough set theoretic concept of discernibility to find words that can discern between good documents and bad ones by analysing a set of training documents rated by the user. This scheme is more powerful than the usual techniques of computing term frequency and inverse document frequency, since it takes into consideration the synonymous words very elegantly. A

References (28)

P. Srinivasan et al.
Vocabulary mining for information retrieval: Rough sets and fuzzy sets
Information Processing and Management
(2001)
Allan, J., Ballesteros, L., Callar, J., & Croft, W. (1995). Recent experiments with INQUERY. In Proceedings of the...
Balabanovic, M. (2000). An adaptive web page recommendation service. In 1st international conference on autonomous...
D. Bodoff et al.
A unified maximum likelihood approach to document retrieval
Journal of the American Society for Information Science and Technology
(2001)
Bao, Y., Aoyama, S., Du, X., Yamada, K., & Ishii, N. (2001). A rough set based hybrid method to text categorization. In...
Crestani, F. (1993). Learning strategies for an adaptive information retrieval system using neural networks. In...
Chakrabarti, S., Dom, B., Gibson, D., Keinberg, J., Raghavan, P., & Rajagopalan, S. (1998). Automatic resource list...
A. Chouchoulas et al.
Rough set-aided keyword reduction for text categorisation
Journal of Applied Artificial Intelligence
(2001)
Das-Gupta, P. (1988). Rough sets and information retrieval. In Proceedings of the eleventh annual international ACM...
Fuhr, N., & Buckley, C. (1993). Optimizing document indexing and search term weighting based on probabilistic models....

Fuzzy Logic Toolbox. The MathWoks, Incorporation. Available:...

Google Search Engine Optimization. Available:...

Intarka, Inc. (1999). Intarka announces ProspectMiner 1.2––a powerful web mining solution for business. Sun...

Jochem, H., Ralph, B., & Frank, W. (1999). WebPlan: dynamic planning for domain specific search in the internet....

Cited by (0)

View full text