A survey of alternative designs for a search engine storage structure

https://doi.org/10.1016/S0950-5849(01)00175-6Get rights and content

Abstract

Three information retrieval storage structures are considered to determine their suitability for a World Wide Web search engine: The Wolverhampton Web Library — The Next Generation. The structures are an inverted file, signature file and Pat tree. A number of implementations are considered for each structure. For the index of an inverted file a sorted array, B-tree, B+-tree, trie and hash table are considered. For the signature file vertical and horizontal partitioning schemes are considered and for the Pat tree a tree and array implementation are considered. A theoretical comparison of the structures is done on seven criteria that include: response time, support for results ranking, search techniques, file maintenance, efficient use of disk space (including the use of compression), scalability and extensibility. The comparison reveals that an inverted file is the most suitable structure, unlike the signature file and Pat tree, which encounter problems with very large corpora.

Introduction

The World Wide Web (Web) can be compared to a library, but “unlike the orderly world of the library collection, this new source of information, is chaotic, often not organised and includes information not of high quality” [32]. To navigate through this vast amount of information is difficult and to find relevant information can seem an impossible task.

In response to this problem, a number of search engines [25] have been developed to help users of the Web locate information of interest to them. However, as the Web is a relatively recent creation (i.e. it became popular around 1993) there are no known standards as yet for cataloguing Web pages and therefore search engines have developed on an ad hoc basis. In practice, search engines can return tens of thousands of hits to a query of which the majority of hits may be irrelevant. The most relevant hits may not always be displayed within the first 10 hits, even though the results are theoretically ranked in order of relevance to the query.

The Wolverhampton Web Library — The Next Generation (WWLib-TNG) is an experimental Web search engine currently under development at the University of Wolverhampton [24], [56]. WWLib-TNG attempts to combine the advantages of classified directories with the advantages of automated search engines by providing automatic classification [26], [27] and automatic resource discovery. Classified directories have many advantages over search engines. They are context sensitive because they are manually maintained, and therefore their results are of high quality and contain little irrelevant information. Unfortunately, in comparison to a search engine classified directories have a small corpus that contains out-of-date information and many dead-links. The constructors of classified directories do not have the resources to check if pages (that have already been classified) have changed or been moved to a new location. Search engines, on the other hand, use a spider (or robot) to gather Web pages automatically and, as a result, they have a large corpus. Some spiders make periodic checks on pages that have already been gathered to see if they have changed or been moved to a new location. The frequency of the visits by the spider is based on how frequently a page changes.

WWLib-TNG gathers and classifies Web pages from sites in the UK. The Dewey Decimal Classification (DDC) [38] system is used to classify Web pages because it has been used by UK libraries for many years and is therefore well understood. Fig. 1 shows the architecture of WWLib-TNG.

WWLib-TNG consists of the following software components: the Dispatcher, Archiver, Analyser, Filter, Classifier, Builder and Searcher. The function of each component is briefly described below:

  • The Dispatcher gathers URLs from a variety of sources;

  • The Archiver receives URLs from the Dispatcher. For each Web page, it obtains it assigns a unique accession number, generates a metadata template and extracts and saves metadata about the Web page. The Archiver then passes a local copy of the page to the Analyser, Classifier and Builder. Metadata generated by the Classifier is returned to the Archiver, which it writes to the corresponding metadata template. The Archiver also maintains two databases: one that holds local copies of Web pages and another that holds metadata templates;

  • The Analyser extracts embedded hyperlinks from Web pages it receives and returns the URLs via the Filter to the Dispatcher;

  • The Filter eliminates URLs which do not have a UK address or are links to dynamic Web pages (for example, pages that contain data taken from a database);

  • The Classifier calculates one or more DDC class marks for each Web page it receives and returns the class mark(s) to the Archiver;

  • The Builder analyses Web pages and extracts information from them to build the main database;

  • The Searcher takes a query from a user and returns results ranked in order of relevance to the query. Upon receiving a query, the Searcher interrogates the main database created by the Builder. A list of accession numbers is returned to the Searcher from the main database. The list of accession numbers is used to locate corresponding metadata templates and local copies that are used to generate results.

This paper investigates and evaluates information retrieval storage techniques with respect to their suitability for the main database constructed by the builder. Information retrieval storage techniques are surveyed because the problem of finding relevant information from a collection of documents was addressed in the 1960s. This led to the creation of a set of techniques in the field of information retrieval. The storage structures of commercial search engines are not disclosed by the search engine developers. One major search engine, however, that has details published about its storage structures is Google [7].

Section snippets

Evaluation criteria for search engine storage structure

This section presents a list of criteria to evaluate the information retrieval storage structures discussed in Section 3. We consider seven criteria to be important. These include response time, support for results ranking, search techniques, file maintenance, efficient use of disk space, scalability and extensibility. Each criterion is discussed below.

Information retrieval storage structures

The information retrieval storage structures described in the literature are flat file, inverted file, signature file and Pat trees. The inverted file and signature file approaches are discussed below along with Pat trees with respect to their suitability for a Web search engine. The flat file approach is not discussed because it is not suitable for a large database due to its slow retrieval speed when compared to the inverted file and signature file methods [13].

Comparison of file structures

Section 3 discussed each file structure separately on the evaluation criteria in Section 2. For each structure, a suitable implementation was chosen. For the index of the inverted file a B+-tree was chosen. For the signature file a multi-organisational partitioning scheme was chosen, and a Pat array was chosen for the Pat tree. This section compares the three structures on the evaluation criteria to determine the most suitable structure for WWLib-TNG.

Conclusions

Web search engines have developed on an ad hoc basis because there are no known standards as yet for cataloguing Web pages. As a result, search engines can return tens of thousands of hits to a query of which the majority of hits may be irrelevant. The most relevant hits may not always be displayed within the first 10 hits, even though the results are theoretically ranked in order of relevance to the query.

This paper has compared three information retrieval storage structures: an inverted file,

References (59)

  • R Baeza-Yates et al.

    Hierarchies of indices for text searching

    Information Systems

    (1996)
  • U Manber et al.

    An algorithm for string matching with a sequence of don't cares

    Information Processing Letters

    (1991)
  • W.Y.P Wong et al.

    Signature file methods for implementing a ranking strategy

    Information Processing and Management

    (1990)
  • R.A Baeza-Yates et al.

    Fast text searching for regular expressions or automaton searching on tries

    Journal of the ACM

    (1996)
  • R Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • E.F Barbosa et al.

    Optimised binary search and text retrieval

    Proceedings of the European Symposium on Algorithms

    (1995)
  • R Bayer et al.

    Organisation and maintenance of large ordered indexes

    Acta Informatica

    (1972)
  • E Bertino et al.

    Indexing Techniques for Advanced Database Systems

    (1997)
  • S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Proceedings of the Seventh International...
  • W.B. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval. Proceedings of the...
  • B. Chidlovskii, U.M. Borghoff, Query translation for distributed information gathering on the web. Proceedings of the...
  • W.B Croft et al.

    Implementing ranking strategies using text signatures

    ACM Transactions on Office Information Systems

    (1988)
  • E.S de Moura et al.

    Indexing compressed text

  • R Fagin

    Extendible hashing — a fast access method for dynamic file

    ACM Transactions on Database Systems

    (1979)
  • C Faloutsos

    Access methods for text

    ACM Computing Surveys

    (1985)
  • C Faloutsos

    Signature files

  • C Faloutsos et al.

    Signature files: an access method for documents and its analytical performance evaluation

    ACM Transactions on Office Information Systems

    (1984)
  • G.H Gonnet et al.

    Handbook of Algorithms and Data Structures in pascal and c

    (1991)
  • G.H Gonnet et al.

    New indices for text: PAT trees and PAT arrays

  • D Harman

    Ranking algorithms

  • D Harman et al.

    Retrieving records from a gigabyte of text on a minicomputer using statistical ranking

    Journal of the American Society for Information Science

    (1990)
  • R Haskin

    Special purpose processors for text retrieval

    Database Engineering

    (1981)
  • H.S Heaps

    Information Retrieval Computation and Theoretical Aspects

    (1978)
  • M.R Henzinger

    Hyperlink analysis for the web

    IEEE Internet Computing

    (2001)
  • E Horowitz et al.

    Fundamentals of Data Structures

    (1976)
  • M.S. Jackson, J.P.H. Burden, WWLib-TNG — new directions in search engine technology. IEE Informatics Colloquium: Lost...
  • C. Jenkins, Searching the World Wide Web: Tools and Resources for Locating Information. University of Wolverhampton,...
  • C. Jenkins, M. Jackson, P. Burden, J. Wallis, The Wolverhampton Web Library (WWLib) and automatic classification,...
  • C. Jenkins, M. Jackson, P. Burden, J. Wallis, Automatic classification of web resources using java and dewey decimal...
  • View full text