Abstract
The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the limitations of the necessity to provide initial training data while maintaining a high recall/precision ratio.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Google Search Technology Online at, http://www.google.com/technology/index.html
Steele, R.: Techniques for Specialized Search Engines. In: Proc. Internet Computing, Las Vegas (2001)
Chakrabarti, S., Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)
Najork, M., Wiener, J.: Breadth-first search crawling yields high-quality pages. In: Proc. 10th Int. World Wide Web Conf., pp. 114–118 (2001)
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the Web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)
Yang, K.: Combining text- and link-based methods for Web IR. In: Proc. 10th Text Rerieval Conf (TREC-10), Washington, DC, U.S. Government Printing Office (2002)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Ng, A., Zheng, A., Jordan, M.: Stable algorithms for link analysis. In: ACM Conf. on Research and Development in Infomation Retrieval, pp. 258–266 (2001)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. WWW7 / Computer Networks 30(1-7), 107–117 (1998)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9th Annual ACM-SIAM Symposium Discrete Algorithms, January 1998, pp. 668–677 (1998)
Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society of Industrial and Applied Mathematics, Philadelphia (1999)
O’Brien, G.: Information Management Tools for Updating an SVD-Encoded Indexing Scheme. Master’s thesis, University of Tennessee, Knoxville, TN (1994)
Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in hyperlinked environments. In: Proc. Int. Conf. Research and Development in Information Retrieval, Melbourne (Australia), August 1998, pp. 104–111 (1998)
Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proc. 17th Int. Conf. Machine Learning, pp. 167–174 (2000)
Srinivasan, P., Pant, G., Menczer, F.: Target Seeking Crawlers and their Topical Performance. In: Proc. Int. Conf. Research and Development in Information Retrieval (August 2002)
Chau, M., Chen, H.: Comparison of three vertical search spiders. Computer 36(5), 56–62 (2003)
Cohn, D., Hoffman, T.: The Missing Link-A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems 13, 430–436 (2001)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proc. 26th Int. Conf. Very Large Databases (VLDB 2000), Cairo, pp. 527–534 (2000)
Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web efficiently. In: Proc. 16th Int. Conf. Machine Learning (ICML 1999), pp. 335–343 (1999)
Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 211–220 (2001)
Cho, J., Molina, H.G., Page, L.: Efficient Crawling through URL Ordering. In: Proc. 7th Int. World Wide Web Conf., Brisbane, Australia, pp. 161–172 (1998)
Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)
Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven web crawlers. In: Proc. Int. Conf. Research and Development in Information, New Orleans, pp. 241–249 (2001)
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proc. 12th Int. Conf. Information and Knowledge Management, New Orleans, USA, November 2003, pp. 394–401 (2003)
Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B.: THESUS: Effective thematic selection and organization of web document collections based on link semantics. IEEE Trans. Knowledge & Data Engineering 16(6), 585–600 (2004)
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. 6th European Conf. Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. An Application: tailored Web site mapping. Computer Networks and ISDN Systems 30, 317–326 (1998)
CMU World Wide Knowledge Base and WebKB dataset. Online at, http://www-2.cs.cmu.edu/~webkb
Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: Proc. 2nd Int. Workshop Web Dynamics (May 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Almpanidis, G., Kotropoulos, C. (2005). Combining Text and Link Analysis for Focused Crawling. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_30
Download citation
DOI: https://doi.org/10.1007/11551188_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)