Combining Text and Link Analysis for Focused Crawling

Almpanidis, George; Kotropoulos, Constantine

doi:10.1007/11551188_30

Combining Text and Link Analysis for Focused Crawling

George Almpanidis²⁰ &
Constantine Kotropoulos²⁰

Conference paper

1840 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3686))

Abstract

The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the limitations of the necessity to provide initial training data while maintaining a high recall/precision ratio.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Google Search Technology Online at, http://www.google.com/technology/index.html
Steele, R.: Techniques for Specialized Search Engines. In: Proc. Internet Computing, Las Vegas (2001)
Google Scholar
Chakrabarti, S., Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)
Article Google Scholar
Najork, M., Wiener, J.: Breadth-first search crawling yields high-quality pages. In: Proc. 10^th Int. World Wide Web Conf., pp. 114–118 (2001)
Google Scholar
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the Web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)
Article Google Scholar
Yang, K.: Combining text- and link-based methods for Web IR. In: Proc. 10^th Text Rerieval Conf (TREC-10), Washington, DC, U.S. Government Printing Office (2002)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Ng, A., Zheng, A., Jordan, M.: Stable algorithms for link analysis. In: ACM Conf. on Research and Development in Infomation Retrieval, pp. 258–266 (2001)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. WWW7 / Computer Networks 30(1-7), 107–117 (1998)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9^th Annual ACM-SIAM Symposium Discrete Algorithms, January 1998, pp. 668–677 (1998)
Google Scholar
Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society of Industrial and Applied Mathematics, Philadelphia (1999)
Google Scholar
O’Brien, G.: Information Management Tools for Updating an SVD-Encoded Indexing Scheme. Master’s thesis, University of Tennessee, Knoxville, TN (1994)
Google Scholar
Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in hyperlinked environments. In: Proc. Int. Conf. Research and Development in Information Retrieval, Melbourne (Australia), August 1998, pp. 104–111 (1998)
Google Scholar
Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proc. 17^th Int. Conf. Machine Learning, pp. 167–174 (2000)
Google Scholar
Srinivasan, P., Pant, G., Menczer, F.: Target Seeking Crawlers and their Topical Performance. In: Proc. Int. Conf. Research and Development in Information Retrieval (August 2002)
Google Scholar
Chau, M., Chen, H.: Comparison of three vertical search spiders. Computer 36(5), 56–62 (2003)
Article Google Scholar
Cohn, D., Hoffman, T.: The Missing Link-A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems 13, 430–436 (2001)
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proc. 26^th Int. Conf. Very Large Databases (VLDB 2000), Cairo, pp. 527–534 (2000)
Google Scholar
Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web efficiently. In: Proc. 16^th Int. Conf. Machine Learning (ICML 1999), pp. 335–343 (1999)
Google Scholar
Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proc. 10^th Int. World Wide Web Conf., Hong Kong, pp. 211–220 (2001)
Google Scholar
Cho, J., Molina, H.G., Page, L.: Efficient Crawling through URL Ordering. In: Proc. 7^th Int. World Wide Web Conf., Brisbane, Australia, pp. 161–172 (1998)
Google Scholar
Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10^th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)
Google Scholar
Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven web crawlers. In: Proc. Int. Conf. Research and Development in Information, New Orleans, pp. 241–249 (2001)
Google Scholar
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proc. 12^th Int. Conf. Information and Knowledge Management, New Orleans, USA, November 2003, pp. 394–401 (2003)
Google Scholar
Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B.: THESUS: Effective thematic selection and organization of web document collections based on link semantics. IEEE Trans. Knowledge & Data Engineering 16(6), 585–600 (2004)
Google Scholar
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. 6^th European Conf. Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)
Google Scholar
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. An Application: tailored Web site mapping. Computer Networks and ISDN Systems 30, 317–326 (1998)
Article Google Scholar
CMU World Wide Knowledge Base and WebKB dataset. Online at, http://www-2.cs.cmu.edu/~webkb
Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: Proc. 2^nd Int. Workshop Web Dynamics (May 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Infomatics, Aristotle University of Thessaloniki, Box 451, GR-54124, Thessaloniki, Greece
George Almpanidis & Constantine Kotropoulos

Authors

George Almpanidis
View author publications
You can also search for this author in PubMed Google Scholar
Constantine Kotropoulos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research School of Infomatics, Loughborough, UK
Sameer Singh
ATR Lab, Research School of Informatics, University of Loughborough, Loughborough, UK
Maneesha Singh
IBM Corporation, 1133 Wetchester Avenue, White Plains, 10604, New York, United States
Chid Apte
Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almpanidis, G., Kotropoulos, C. (2005). Combining Text and Link Analysis for Focused Crawling. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_30

Download citation

DOI: https://doi.org/10.1007/11551188_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics