skip to main content
10.1145/1774088.1774459acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Addressing the limited scope problem of focused crawling using a result merging approach

Published:22 March 2010Publication History

ABSTRACT

Focused crawling refers to a process of fetching domain-specific pages from the Web. It is an important method to build domain-specific document collections, but it suffers from low recall due to the local nature of crawling algorithms associated with Web's community structure. In this study, we address the problem of limited crawling scope of focused crawling using a result merging approach. The results of crawling processes based on different start URL sets and focused crawling methods were merged. We found that merging improves considerably the effectiveness of focused crawling. The results reported here are based on 10 test topics and 140 crawls in the domains of genomics and genetics.

References

  1. Bergmark, D., Lagoze, C. and Sbityakov, A. Focused crawls, tunneling, and digital libraries. Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (Rome, Italy, September 16--18, 2002). Lecture Notes in Computer Science, 2458, Springer, Berlin, 91--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Castillo, C. Effective Web crawling. Ph.D. Thesis. University of Chile, Department of Computer Science, 2004.Google ScholarGoogle Scholar
  3. Chakrabarti, S., van den Berg, M. and Dom, B. Focused crawling: a new approach to topic-specific Web resource discovery. Proceedings of the Eighth International World Wide Web Conference (Toronto, Canada, May 11--14, 1999), 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L. and Gori, M. Focused crawling using context graphs. Prodeedings of the 26th International Conference on Very Large Databases (VLDB) (Cairo, Egypt, September 10--14, 2000), 527--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M. and Kraemer, D. F. TREC 2004 genomics track overview. The Thirteenth Text REtrieval Conference (Gaithersburg, MD, November 16--19, 2004). http://trec.nist.gov/pubs/trec13/t13_proceedings.html (Last access October 24, 2009.)Google ScholarGoogle Scholar
  6. Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E. and Järvelin K. Non-adjacent digrams improve matching of cross-lingual spelling variants. Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE'03) (Manaus, Brazil, October 8--10, 2003). Lecture Notes in Computer Science, 2857, Springer, Berlin, 252--265.Google ScholarGoogle ScholarCross RefCross Ref
  7. Pirkola, A. and Talvensaari, T. Effects of crawling strategies on the performance of focused Web crawling. Proceedings of the 5th International Conference on Web Information Systems and Technologies (Lisbon, Portugal, March 23--26, 2009), 376--381.Google ScholarGoogle Scholar
  8. Qin, J., Zhou, Y. and Chau, M. Building domain-specific Web collections for scientific digital libraries: a metasearch enhanced focused crawling method. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'04) (Tucson, Arizona, June 7--11, 2004), 135--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Srinivasan, P., Menczer, F. and Pant, G. A general evaluation framework for topical crawlers. Information Retrieval, 15, 5 (May 2005), 417--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11, 5 (October 2008), 427--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tang, T., Hawking, D., Craswell, N. and Griffiths, K. Focused crawling for both topical relevance and quality of medical information. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) (Bremen, Germany, October 31-- November 5, 2005), 147--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Toyoda, M. and Kitsuregawa, M. Creating a Web community chart for navigating related communities. Proceedings of the 12th ACM Conference on Hypertext and Hypermedia (Århus, Denmark, August 14--18, 2001), 103--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhang, J., Jin, R., Yang, Y. and Hauptmann, A. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. Proceedings of the 20th International Conference on Machine Learning (ICML) (Washington, DC, August 21--24, 2003), 888--897.Google ScholarGoogle Scholar
  14. Zhuang, Z., Wagle, R. and Giles, C. L. What's there and what's not?: focused crawling for missing documents in digital libraries. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, June 7--11, 2005), 301--310. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Addressing the limited scope problem of focused crawling using a result merging approach

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
      March 2010
      2712 pages
      ISBN:9781605586397
      DOI:10.1145/1774088

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 March 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SAC '10 Paper Acceptance Rate364of1,353submissions,27%Overall Acceptance Rate1,650of6,669submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader