A neural network-based intelligent metasearch engine

Communicated by George Georgiou
https://doi.org/10.1016/S0020-0255(99)00062-6Get rights and content

Abstract

Determining the relevancy of web pages to a query term is basic to the working of any search engine. In this paper we present a neural network based algorithm to classify the relevancy of search results on a metasearch engine. The fast learning neural network technology used by us enables the metasearch engine to handle a query term in a reasonably short time and return the search results with high accuracy.

Introduction

The last 10 years of this century have witnessed the great success of the Internet and the World Wide Web. It is estimated that now there are at least 800 million pages on the Web [1]. To help users mine the Web efficiently, Web search engines have been developed. These search engines collect and index Web pages. Once a search engine accepts a query of keywords from the user, it starts to retrieve Web pages that match the query according to certain standards from its database. Web search is one of the biggest new industries on the Internet. Some of the well-known search engines are Yahoo, Northern Light, WebCrawler, HotBot, Alta Vista, Excite and Infoseek.

Web search engines are of great help for Web surfing. However, as the Web is growing rapidly and as it is a dynamic, distributed and autonomous information system, information retrieval from the Web is more difficult than from conventional systems. The performance of major search engines is far from satisfactory. We list some of the major disadvantages of current search engines below:

First, the coverage of any single search engine is severely limited. Research shows that none of the major search engines indexes more than one-sixth of the total Web pages and the coverage of search engines may vary by an order of magnitude [1].

Second, the search results are not accurate. The accuracy of the search results is measured by the relevancy of the Web pages to the query terms. Often a simple query on a search engine can generate tens of thousands Web pages. It is obvious that if these pages are not sorted and listed in an appropriate order, almost no meaningful information can be retrieved from such a huge amount of data. All the search engines rank the relevancy of the search results to the query terms and display them in such an order that the more relevant pages will come first and the less relevant pages come last. The ranking algorithms are usually based on information retrieval models such as Vector Space Model, probability models and fuzzy logical models [2]. These models depend on the frequency of query keywords in the document to determine the similarity between query terms and the document content [3]. However, the frequency of keywords only reflects the content of the Web page very roughly. A high frequency of keywords does not necessarily mean a high relevancy of the Web page. Also the standard search engines are more concerned with handling the queries quickly and they tend to use relatively simple fast ranking schemes [4]. All these may cause the search engine to give a poor ranking of the search results. For example, when a sample query of “China Sports Express” is submitted to search engines Yahoo, Excite, Infoseek and WebCrawler, we find the all these search engines perform poorly. Here for this search it is assumed that the user aims to find Web pages that provide sports news of China. Any Web page that does not offer China sports news or does not contain a direct hyperlink to a page that does provide such services is considered irrelevant to the query term. The analysis of the search results is showed in Table 1.

Search results in Table 1 show that the search engines may yield highly inaccurate search results. Study by other researchers also shows that up to 75% of the search results could be irrelevant [5].

Section snippets

Metasearch engine and neural classification

As discussed in the previous section, each single standard search engine covers only a small fraction of the indexable Web pages. Due to the different technologies that are used in collecting and indexing Web pages, for each query term these search engines yield different results. It is obvious that if the power of several standard search engines could be combined together then the coverage of Web pages would be greatly improved. This is the basic idea of the metasearch engine. The metasearch

Experimental results

In this section, we present some search examples to illustrate how the Anvish metasearch engine works and compare its performance to the commercial metasearch engine MetaCrawler. Two query terms are used to test these search engines. The first is “China Sports Express” and the second is “Java Language Tutorial”. These two queries cover totally different topics. The first search aims to find Web sites that provide China sports news services while the second search tries to find tutorial

Conclusion

This paper presents a novel type of metasearch engine – the Anvish search engine. Unlike the other standard or metasearch engines that use conventional statistical technologies to process search results, Anvish has embedded a fast learning neural network to classify and organize the search results. Experimental results show that the Anvish search engine can return more relevant Web pages to a query term. In current version of Anvish, the Web are classified into two classes only.

Anvish may be

References (10)

There are more references available in the full text version of this article.

Cited by (35)

  • The Random Neural Network in a neurocomputing application for Web search

    2018, Neurocomputing
    Citation Excerpt :

    Burgues et al. [4] use neural networks to evaluate Web sites by training the neural network based on query-document pairs. Shu and Kak [5] retrieve results from different Web search engines and train the network on the assumption that a result in a top position would be relevant. Boyan et al. [6] use reinforcement learning to rank Web pages using their HTML properties and hyperlink connections.

  • An interactive agent-based system for concept-based web search

    2003, Expert Systems with Applications
    Citation Excerpt :

    Yet in the latter case, a user may have his specific targets in each single search, therefore, previously collected feature words are not necessarily related to a new search. Based on the above method, Shu and Kak have used the retrieved pages with highest and lowest scores by the search engine as positive and negative examples to train a neural network as a filter to recognize other pages for a specific query in web search (Shu & Kak, 1999). However, the task of concept-based semantic search is subjective; whether a certain page is what the user needs for the concept depends on his own judgment.

View all citing articles on Scopus
View full text