ABSTRACT
The rapid growth of the world wide web had made the problem of topic specific resource discovery an important one in recent years. In this problem, it is desired to find web pages which satisfy a predicate specified by the user. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy is extremely effective because the topical consistency in world wide web browsing patterns turns out to very reliable. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system which works more effectively than a system based purely on either user behavior or hyperlinks.
- C. C. Aggaxwal. Collaborative Crawling: Mining User Experiences for Topical Resource Discovery. IBM Research Report, 2002. Google ScholarDigital Library
- C. C. Aggarwal, S. C. Gates, P. S. Yu. On the merits of using supervised clustering for building categorization systems. KDD Conference, 1999.Google Scholar
- C. C. Aggarwal, F. Al-Garawi, P. Yu. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. WWW Conference, 2001. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, B. Dom. Focussed Crawling: A New Approach to Topic Specific Resource Discovery. WWW Conference, 1999. Google ScholarDigital Library
- A. Rousskov, V. Solviev. On Performance of Caching Proxies. http://www.cs.ndsu.nodak.edu/rousskov//research/cache/squid/profiling/papers/Google Scholar
- ftp://ircache.nlanr.net/Traces/Google Scholar
Index Terms
- Collaborative crawling: mining user experiences for topical resource discovery
Recommendations
Geographically focused collaborative crawling
WWW '06: Proceedings of the 15th international conference on World Wide WebA collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specific portion of the web. We study the problem of collecting geographi-cally-aware pages using collaborative crawling strategies. We first propose ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Board Forum Crawling: A Web Crawling Method for Web Forum
WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web IntelligenceWe present a new method of Board Forum Crawling to crawl Web forum. This method exploits the organized characteristics of the Web forum sites and simulates human behavior of visiting Web Forums. The method starts crawling from the homepage, and then ...
Comments