A survey of alternative designs for a search engine storage structure
Introduction
The World Wide Web (Web) can be compared to a library, but “unlike the orderly world of the library collection, this new source of information, is chaotic, often not organised and includes information not of high quality” [32]. To navigate through this vast amount of information is difficult and to find relevant information can seem an impossible task.
In response to this problem, a number of search engines [25] have been developed to help users of the Web locate information of interest to them. However, as the Web is a relatively recent creation (i.e. it became popular around 1993) there are no known standards as yet for cataloguing Web pages and therefore search engines have developed on an ad hoc basis. In practice, search engines can return tens of thousands of hits to a query of which the majority of hits may be irrelevant. The most relevant hits may not always be displayed within the first 10 hits, even though the results are theoretically ranked in order of relevance to the query.
The Wolverhampton Web Library — The Next Generation (WWLib-TNG) is an experimental Web search engine currently under development at the University of Wolverhampton [24], [56]. WWLib-TNG attempts to combine the advantages of classified directories with the advantages of automated search engines by providing automatic classification [26], [27] and automatic resource discovery. Classified directories have many advantages over search engines. They are context sensitive because they are manually maintained, and therefore their results are of high quality and contain little irrelevant information. Unfortunately, in comparison to a search engine classified directories have a small corpus that contains out-of-date information and many dead-links. The constructors of classified directories do not have the resources to check if pages (that have already been classified) have changed or been moved to a new location. Search engines, on the other hand, use a spider (or robot) to gather Web pages automatically and, as a result, they have a large corpus. Some spiders make periodic checks on pages that have already been gathered to see if they have changed or been moved to a new location. The frequency of the visits by the spider is based on how frequently a page changes.
WWLib-TNG gathers and classifies Web pages from sites in the UK. The Dewey Decimal Classification (DDC) [38] system is used to classify Web pages because it has been used by UK libraries for many years and is therefore well understood. Fig. 1 shows the architecture of WWLib-TNG.
WWLib-TNG consists of the following software components: the Dispatcher, Archiver, Analyser, Filter, Classifier, Builder and Searcher. The function of each component is briefly described below:
- •
The Dispatcher gathers URLs from a variety of sources;
- •
The Archiver receives URLs from the Dispatcher. For each Web page, it obtains it assigns a unique accession number, generates a metadata template and extracts and saves metadata about the Web page. The Archiver then passes a local copy of the page to the Analyser, Classifier and Builder. Metadata generated by the Classifier is returned to the Archiver, which it writes to the corresponding metadata template. The Archiver also maintains two databases: one that holds local copies of Web pages and another that holds metadata templates;
- •
The Analyser extracts embedded hyperlinks from Web pages it receives and returns the URLs via the Filter to the Dispatcher;
- •
The Filter eliminates URLs which do not have a UK address or are links to dynamic Web pages (for example, pages that contain data taken from a database);
- •
The Classifier calculates one or more DDC class marks for each Web page it receives and returns the class mark(s) to the Archiver;
- •
The Builder analyses Web pages and extracts information from them to build the main database;
- •
The Searcher takes a query from a user and returns results ranked in order of relevance to the query. Upon receiving a query, the Searcher interrogates the main database created by the Builder. A list of accession numbers is returned to the Searcher from the main database. The list of accession numbers is used to locate corresponding metadata templates and local copies that are used to generate results.
This paper investigates and evaluates information retrieval storage techniques with respect to their suitability for the main database constructed by the builder. Information retrieval storage techniques are surveyed because the problem of finding relevant information from a collection of documents was addressed in the 1960s. This led to the creation of a set of techniques in the field of information retrieval. The storage structures of commercial search engines are not disclosed by the search engine developers. One major search engine, however, that has details published about its storage structures is Google [7].
Section snippets
Evaluation criteria for search engine storage structure
This section presents a list of criteria to evaluate the information retrieval storage structures discussed in Section 3. We consider seven criteria to be important. These include response time, support for results ranking, search techniques, file maintenance, efficient use of disk space, scalability and extensibility. Each criterion is discussed below.
Information retrieval storage structures
The information retrieval storage structures described in the literature are flat file, inverted file, signature file and Pat trees. The inverted file and signature file approaches are discussed below along with Pat trees with respect to their suitability for a Web search engine. The flat file approach is not discussed because it is not suitable for a large database due to its slow retrieval speed when compared to the inverted file and signature file methods [13].
Comparison of file structures
Section 3 discussed each file structure separately on the evaluation criteria in Section 2. For each structure, a suitable implementation was chosen. For the index of the inverted file a B+-tree was chosen. For the signature file a multi-organisational partitioning scheme was chosen, and a Pat array was chosen for the Pat tree. This section compares the three structures on the evaluation criteria to determine the most suitable structure for WWLib-TNG.
Conclusions
Web search engines have developed on an ad hoc basis because there are no known standards as yet for cataloguing Web pages. As a result, search engines can return tens of thousands of hits to a query of which the majority of hits may be irrelevant. The most relevant hits may not always be displayed within the first 10 hits, even though the results are theoretically ranked in order of relevance to the query.
This paper has compared three information retrieval storage structures: an inverted file,
References (59)
- et al.
Hierarchies of indices for text searching
Information Systems
(1996) - et al.
An algorithm for string matching with a sequence of don't cares
Information Processing Letters
(1991) - et al.
Signature file methods for implementing a ranking strategy
Information Processing and Management
(1990) - et al.
Fast text searching for regular expressions or automaton searching on tries
Journal of the ACM
(1996) - et al.
Modern Information Retrieval
(1999) - et al.
Optimised binary search and text retrieval
Proceedings of the European Symposium on Algorithms
(1995) - et al.
Organisation and maintenance of large ordered indexes
Acta Informatica
(1972) - et al.
Indexing Techniques for Advanced Database Systems
(1997) - S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Proceedings of the Seventh International...
- W.B. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval. Proceedings of the...
Implementing ranking strategies using text signatures
ACM Transactions on Office Information Systems
Indexing compressed text
Extendible hashing — a fast access method for dynamic file
ACM Transactions on Database Systems
Access methods for text
ACM Computing Surveys
Signature files
Signature files: an access method for documents and its analytical performance evaluation
ACM Transactions on Office Information Systems
Handbook of Algorithms and Data Structures in pascal and c
New indices for text: PAT trees and PAT arrays
Ranking algorithms
Retrieving records from a gigabyte of text on a minicomputer using statistical ranking
Journal of the American Society for Information Science
Special purpose processors for text retrieval
Database Engineering
Information Retrieval Computation and Theoretical Aspects
Hyperlink analysis for the web
IEEE Internet Computing
Fundamentals of Data Structures
Cited by (4)
Research on the optimization strategy of web search engine based on data mining
2018, AIP Conference ProceedingsBased on DNS of a layered web search engine study
2014, Advanced Materials ResearchStorage and retrieval of large data sets: Dimensionality reduction and nearest neighbour search
2012, Communications in Computer and Information ScienceOrder-sensitive retrieval in search engine using wildcard
2007, Journal of Computational Information Systems