skip to main content
10.1145/1273496.1273504acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Focused crawling with scalable ordinal regression solvers

Published: 20 June 2007 Publication History

Abstract

In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.

References

[1]
Aggarwal, C., Al-Garawi, F., & Yu, P. (2001). Intelligent crawling on the World Wide Web with arbitrary predicates. Proc. of 10th Intl. Conf. on WWW.
[2]
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. Proc. of 11th Intl. Conf. on World Wide Web, 148--159.
[3]
Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. WWW Conference.
[4]
Chu, W., & Keerthi, S. (2005). New approaches to support vector ordinal regression. Proc. of 22nd Intl. Conf. on Machine learning, 145--152.
[5]
Crammer, K., & Singer, Y. (2002). Pranking with ranking. NIPS, 14.
[6]
Davison, B. (2000). Topical locality in the Web. Proc. of 23rd Intl. Conf. on Research and development in Information Retrieval, 272--279.
[7]
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000). Focused crawling using context graphs. Proc. of 26th Intl. Conf. on VLDB.
[8]
Erdougan, E., & Iyengar, G. (2006). An active set method for single-cone second-order cone programs. SIAM J. on Optimization, 17, 459--484.
[9]
Grangier, D., & Bengio, S. (2005). Exploiting Hyperlinks to Learn a Retrieval Model. Proc. of NIPS Workshop.
[10]
Har-Peled, S., Roth, D., & Zimak, D. Constraint classification: A new approach to multiclass classification and ranking. NIPS.
[11]
Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115--132.
[12]
Kleinberg, J. (1999). Authoritative sources in a hyper-linked environment. Journal of the ACM (JACM), 46, 604--632.
[13]
Nath, J. S., Bhattacharyya, C., & Murty, M. N. (2006). Clustering based large margin classification: a scalable approach using socp formulation. Proc. of 12th Intl. Conf. on KDD (pp. 674--679).
[14]
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods---Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press.
[15]
Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two approaches. NIPS, 15.
[16]
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. Proc. of Intl. Conf. on Management of data, 103--114.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICML '07: Proceedings of the 24th international conference on Machine learning
June 2007
1233 pages
ISBN:9781595937933
DOI:10.1145/1273496
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Machine Learning Journal

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2007

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ICML '07 & ILP '07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Multiple-Instance Ordinal RegressionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2017.276616429:9(4398-4413)Online publication date: Sep-2018
  • (2018)Focused Web CrawlingEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_165(1493-1501)Online publication date: 7-Dec-2018
  • (2017)Focused Web CrawlingEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_165-2(1-9)Online publication date: 1-Apr-2017
  • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
  • (2013)A user-oriented web crawler for selectively acquiring online content in e-health researchBioinformatics10.1093/bioinformatics/btt57130:1(104-114)Online publication date: 29-Sep-2013
  • (2011)Co-citation & co-reference concepts to control focused crawler explorationProceedings of the 2011 International Conference on Electrical Engineering and Informatics10.1109/ICEEI.2011.6021677(1-7)Online publication date: Jul-2011
  • (2011)Automatic text classification and focused crawling2011 Sixth International Conference on Digital Information Management10.1109/ICDIM.2011.6093329(143-148)Online publication date: Sep-2011
  • (2009)Focused Web CrawlingEncyclopedia of Database Systems10.1007/978-0-387-39940-9_165(1147-1155)Online publication date: 2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media