Towards a Keyword-Focused Web Crawler

Kuśmierczyk, Tomasz; Sydow, Marcin

doi:10.1007/978-3-642-38634-3_21

Tomasz Kuśmierczyk¹⁸ &
Marcin Sydow^19,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7912))

Included in the following conference series:

Intelligent Information Systems Symposium

1118 Accesses

Abstract

This paper concerns predicting the content of textual web documents based on features extracted from web pages that link to them. It may be applied in an intelligent, keyword-focused web crawler. The experiments made on publicly available real data obtained from Open Directory Project with the use of several classification models are promising and indicate potential usefulness of the studied approach in automatically obtaining keyword-rich web document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 96–105. ACM, New York (2001), http://doi.acm.org/10.1145/371920.371955
Google Scholar
Alam, M., Ha, J., Lee, S.: Novel approaches to crawling important pages early. Knowledge and Information Systems 33, 707–734 (2012), http://dx.doi.org/10.1007/s10115-012-0535-4
Article Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
MATH Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/pii/S1389128699000523
Article Google Scholar
Davison, B.D.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 272–279. ACM, New York (2000), http://doi.acm.org/10.1145/345508.345597
Chapter Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 527–534. Morgan Kaufmann Publishers Inc., San Francisco (2000), http://dl.acm.org/citation.cfm?id=645926.671854
Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI 1995, pp. 338–345. Morgan Kaufmann Publishers Inc., San Francisco (1995), http://dl.acm.org/citation.cfm?id=2074158.2074196
Google Scholar
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875
Article Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines, 1st edn. Springer Publishing Company, Incorporated (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Tomasz Kuśmierczyk & Marcin Sydow
Polish-Japanese Institute of Information Technology, Warsaw, Poland
Marcin Sydow

Authors

Tomasz Kuśmierczyk
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Sydow
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248, Warsaw, Poland
Mieczysław A. Kłopotek , Jacek Koronacki , Małgorzata Marciniak & Agnieszka Mykowiecka , , &
Institute of Computer Science, Polish Academy of Sciences, ul. Brzegi 55, 80-045, Gdańsk, Poland
Sławomir T. Wierzchoń

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuśmierczyk, T., Sydow, M. (2013). Towards a Keyword-Focused Web Crawler. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38634-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-38634-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38633-6
Online ISBN: 978-3-642-38634-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics