ABSTRACT
Focused crawling refers to a process of fetching domain-specific pages from the Web. It is an important method to build domain-specific document collections, but it suffers from low recall due to the local nature of crawling algorithms associated with Web's community structure. In this study, we address the problem of limited crawling scope of focused crawling using a result merging approach. The results of crawling processes based on different start URL sets and focused crawling methods were merged. We found that merging improves considerably the effectiveness of focused crawling. The results reported here are based on 10 test topics and 140 crawls in the domains of genomics and genetics.
- Bergmark, D., Lagoze, C. and Sbityakov, A. Focused crawls, tunneling, and digital libraries. Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (Rome, Italy, September 16--18, 2002). Lecture Notes in Computer Science, 2458, Springer, Berlin, 91--106. Google ScholarDigital Library
- Castillo, C. Effective Web crawling. Ph.D. Thesis. University of Chile, Department of Computer Science, 2004.Google Scholar
- Chakrabarti, S., van den Berg, M. and Dom, B. Focused crawling: a new approach to topic-specific Web resource discovery. Proceedings of the Eighth International World Wide Web Conference (Toronto, Canada, May 11--14, 1999), 1623--1640. Google ScholarDigital Library
- Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L. and Gori, M. Focused crawling using context graphs. Prodeedings of the 26th International Conference on Very Large Databases (VLDB) (Cairo, Egypt, September 10--14, 2000), 527--534. Google ScholarDigital Library
- Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M. and Kraemer, D. F. TREC 2004 genomics track overview. The Thirteenth Text REtrieval Conference (Gaithersburg, MD, November 16--19, 2004). http://trec.nist.gov/pubs/trec13/t13_proceedings.html (Last access October 24, 2009.)Google Scholar
- Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E. and Järvelin K. Non-adjacent digrams improve matching of cross-lingual spelling variants. Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE'03) (Manaus, Brazil, October 8--10, 2003). Lecture Notes in Computer Science, 2857, Springer, Berlin, 252--265.Google ScholarCross Ref
- Pirkola, A. and Talvensaari, T. Effects of crawling strategies on the performance of focused Web crawling. Proceedings of the 5th International Conference on Web Information Systems and Technologies (Lisbon, Portugal, March 23--26, 2009), 376--381.Google Scholar
- Qin, J., Zhou, Y. and Chau, M. Building domain-specific Web collections for scientific digital libraries: a metasearch enhanced focused crawling method. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'04) (Tucson, Arizona, June 7--11, 2004), 135--141. Google ScholarDigital Library
- Srinivasan, P., Menczer, F. and Pant, G. A general evaluation framework for topical crawlers. Information Retrieval, 15, 5 (May 2005), 417--447. Google ScholarDigital Library
- Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11, 5 (October 2008), 427--445. Google ScholarDigital Library
- Tang, T., Hawking, D., Craswell, N. and Griffiths, K. Focused crawling for both topical relevance and quality of medical information. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) (Bremen, Germany, October 31-- November 5, 2005), 147--154. Google ScholarDigital Library
- Toyoda, M. and Kitsuregawa, M. Creating a Web community chart for navigating related communities. Proceedings of the 12th ACM Conference on Hypertext and Hypermedia (Århus, Denmark, August 14--18, 2001), 103--112. Google ScholarDigital Library
- Zhang, J., Jin, R., Yang, Y. and Hauptmann, A. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. Proceedings of the 20th International Conference on Machine Learning (ICML) (Washington, DC, August 21--24, 2003), 888--897.Google Scholar
- Zhuang, Z., Wagle, R. and Giles, C. L. What's there and what's not?: focused crawling for missing documents in digital libraries. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, June 7--11, 2005), 301--310. Google ScholarDigital Library
Index Terms
- Addressing the limited scope problem of focused crawling using a result merging approach
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Using HMM to learn user browsing patterns for focused web crawling
Special issue: WIDM 2004A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use ...
The modified concept based focused crawling using ontology
The major goal of focused crawlers is to crawl web pages that are relevant to a specific topic One of the important issues of focuses crawlers is the difficulty in determining which web pages are relevant to the desired topic. The ontology based web ...
Comments