Abstract:
In recent years, several information retrieval methods using information about Web-links have been developed, such as HITS and trawling. In order to analyze Web-links div...Show MoreMetadata
Abstract:
In recent years, several information retrieval methods using information about Web-links have been developed, such as HITS and trawling. In order to analyze Web-links dividing into links inside each Web site (local-links) and links between Web sites (global-links)for information retrieval, a proper model of the Web site is required. In existing research, a Web server is used as a model of the Web site. This idea works relatively well when a Web site corresponds to a server, as is the case for public Web sites, but works poorly when multiple Web sites correspond to a server, as is the case for private Web sites on rental Web servers. We propose a new model of the Web site, "directory-based site", to handle typical private sites, and a method to identify them using information about the URL and Web-links. We verify the method can approximately identify, at a rate of 66% of over 110,000 servers, whether each server has multiple directory-based sites or not, and extract over 500,000 directory-based sites and 4 million global-links by computational experiments using jp-domain URLs and Web-link data contains over 23 million URLs and 100 million Web-links, collected from July to August 2000, by Toyoda and Kitsuregawa. We also propose a new framework of Web-link based information retrieval that uses directory-based sites and global-links instead of Web pages and whole Web-links respectively, and examine the effectiveness of our framework by comparing a result of trawling on our framework to one on the existing framework.
Published in: Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.
Date of Conference: 14-14 December 2002
Date Added to IEEE Xplore: 25 February 2003
Print ISBN:0-7695-1766-8