research-article

Addressing the limited scope problem of focused crawling using a result merging approach

Authors:
Ari Pirkola

University of Tampere, Finland

University of Tampere, Finland
View Profile

,
Tuomas Talvensaari

University of Tampere, Finland

University of Tampere, Finland
View Profile

SAC '10: Proceedings of the 2010 ACM Symposium on Applied ComputingMarch 2010Pages 1735–1740https://doi.org/10.1145/1774088.1774459

Published:22 March 2010Publication History

SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

Pages 1735–1740

ABSTRACT

Focused crawling refers to a process of fetching domain-specific pages from the Web. It is an important method to build domain-specific document collections, but it suffers from low recall due to the local nature of crawling algorithms associated with Web's community structure. In this study, we address the problem of limited crawling scope of focused crawling using a result merging approach. The results of crawling processes based on different start URL sets and focused crawling methods were merged. We found that merging improves considerably the effectiveness of focused crawling. The results reported here are based on 10 test topics and 140 crawls in the domains of genomics and genetics.

References

Bergmark, D., Lagoze, C. and Sbityakov, A. Focused crawls, tunneling, and digital libraries. Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (Rome, Italy, September 16--18, 2002). Lecture Notes in Computer Science, 2458, Springer, Berlin, 91--106. Google ScholarDigital Library
Castillo, C. Effective Web crawling. Ph.D. Thesis. University of Chile, Department of Computer Science, 2004.Google Scholar
Chakrabarti, S., van den Berg, M. and Dom, B. Focused crawling: a new approach to topic-specific Web resource discovery. Proceedings of the Eighth International World Wide Web Conference (Toronto, Canada, May 11--14, 1999), 1623--1640. Google ScholarDigital Library
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L. and Gori, M. Focused crawling using context graphs. Prodeedings of the 26th International Conference on Very Large Databases (VLDB) (Cairo, Egypt, September 10--14, 2000), 527--534. Google ScholarDigital Library
Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M. and Kraemer, D. F. TREC 2004 genomics track overview. The Thirteenth Text REtrieval Conference (Gaithersburg, MD, November 16--19, 2004). http://trec.nist.gov/pubs/trec13/t13_proceedings.html (Last access October 24, 2009.)Google Scholar
Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E. and Järvelin K. Non-adjacent digrams improve matching of cross-lingual spelling variants. Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE'03) (Manaus, Brazil, October 8--10, 2003). Lecture Notes in Computer Science, 2857, Springer, Berlin, 252--265.Google ScholarCross Ref
Pirkola, A. and Talvensaari, T. Effects of crawling strategies on the performance of focused Web crawling. Proceedings of the 5th International Conference on Web Information Systems and Technologies (Lisbon, Portugal, March 23--26, 2009), 376--381.Google Scholar
Qin, J., Zhou, Y. and Chau, M. Building domain-specific Web collections for scientific digital libraries: a metasearch enhanced focused crawling method. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'04) (Tucson, Arizona, June 7--11, 2004), 135--141. Google ScholarDigital Library
Srinivasan, P., Menczer, F. and Pant, G. A general evaluation framework for topical crawlers. Information Retrieval, 15, 5 (May 2005), 417--447. Google ScholarDigital Library
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11, 5 (October 2008), 427--445. Google ScholarDigital Library
Tang, T., Hawking, D., Craswell, N. and Griffiths, K. Focused crawling for both topical relevance and quality of medical information. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) (Bremen, Germany, October 31-- November 5, 2005), 147--154. Google ScholarDigital Library
Toyoda, M. and Kitsuregawa, M. Creating a Web community chart for navigating related communities. Proceedings of the 12th ACM Conference on Hypertext and Hypermedia (Århus, Denmark, August 14--18, 2001), 103--112. Google ScholarDigital Library
Zhang, J., Jin, R., Yang, Y. and Hauptmann, A. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. Proceedings of the 20th International Conference on Machine Learning (ICML) (Washington, DC, August 21--24, 2003), 888--897.Google Scholar
Zhuang, Z., Wagle, R. and Giles, C. L. What's there and what's not?: focused crawling for missing documents in digital libraries. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, June 7--11, 2005), 301--310. Google ScholarDigital Library

Index Terms

Addressing the limited scope problem of focused crawling using a result merging approach
1. Information systems
  1. Information retrieval

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
Using HMM to learn user browsing patterns for focused web crawling
Special issue: WIDM 2004

A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use ...
Read More
The modified concept based focused crawling using ontology

The major goal of focused crawlers is to crawl web pages that are relevant to a specific topic One of the important issues of focuses crawlers is the difficulty in determining which web pages are relevant to the desired topic. The ontology based web ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
March 2010
2712 pages
ISBN:9781605586397
DOI:10.1145/1774088
Conference Chairs:
Sung Y. Shin
South Dakota State University
,
Sascha Ossowski
University Rey Juan Carlos, Spain
,
Michael Schumacher
University of Applied Sciences Western Switzerland, Switzerland
,
Program Chairs:
Mathew J. Palakal
Indiana University Purdue University
,
Chih-Cheng Hung
Southern Polytechnic State University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
digital libraries
focused crawling
information retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '10 Paper Acceptance Rate364of1,353submissions,27%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 164
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Addressing the limited scope problem of focused crawling using a result merging approach

SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Using HMM to learn user browsing patterns for focused web crawling

The modified concept based focused crawling using ontology