Skip to main content

Towards a Keyword-Focused Web Crawler

  • Conference paper
Language Processing and Intelligent Information Systems (IIS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7912))

Included in the following conference series:

  • 1118 Accesses

Abstract

This paper concerns predicting the content of textual web documents based on features extracted from web pages that link to them. It may be applied in an intelligent, keyword-focused web crawler. The experiments made on publicly available real data obtained from Open Directory Project with the use of several classification models are promising and indicate potential usefulness of the studied approach in automatically obtaining keyword-rich web document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 96–105. ACM, New York (2001), http://doi.acm.org/10.1145/371920.371955

    Google Scholar 

  2. Alam, M., Ha, J., Lee, S.: Novel approaches to crawling important pages early. Knowledge and Information Systems 33, 707–734 (2012), http://dx.doi.org/10.1007/s10115-012-0535-4

    Article  Google Scholar 

  3. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)

    MATH  Google Scholar 

  4. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/pii/S1389128699000523

    Article  Google Scholar 

  5. Davison, B.D.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 272–279. ACM, New York (2000), http://doi.acm.org/10.1145/345508.345597

    Chapter  Google Scholar 

  6. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 527–534. Morgan Kaufmann Publishers Inc., San Francisco (2000), http://dl.acm.org/citation.cfm?id=645926.671854

    Google Scholar 

  7. John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI 1995, pp. 338–345. Morgan Kaufmann Publishers Inc., San Francisco (1995), http://dl.acm.org/citation.cfm?id=2074158.2074196

    Google Scholar 

  8. Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875

    Article  Google Scholar 

  9. Steinwart, I., Christmann, A.: Support Vector Machines, 1st edn. Springer Publishing Company, Incorporated (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuśmierczyk, T., Sydow, M. (2013). Towards a Keyword-Focused Web Crawler. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38634-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38634-3_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38633-6

  • Online ISBN: 978-3-642-38634-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics