skip to main content
10.1145/2567948.2579216acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Seed selection for domain-specific search

Published: 07 April 2014 Publication History

Abstract

The last two decades have witnessed an exponential rise in web content from a plethora of domains, which has necessitated the use of domain-specific search engines. Diversity of crawled content is one of the crucial aspects of a domain-specific search engine. To a large extent, diversity is governed by the initial set of seed URLs. Most of the existing approaches rely on manual effort for seed selection. In this work we automate this process using URLs posted on Twitter. We propose an algorithm to get a set of diverse seed URLs from a Twitter URL graph. We compare the performance of our approach against the baseline zero similarity seed selection method and find that our approach beats the baseline by a significant margin.

References

[1]
M. Boanjak, E. Oliveira, J. Martins, E. Mendes Rodrigues, and L. Sarmento. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st international conference companion on World Wide Web, pages 1233--1240. ACM, 2012.
[2]
C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. In Proceedings of the 20th international conference on World wide web, pages 675--684. ACM, 2011.
[3]
P. Dmitriev. Host-based seed selection algorithm for web crawlers, 2008. US Patent App. 12/259,164.
[4]
A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of the essence: improving recency ranking using twitter data. In Proceedings of the 19th international conference on World wide web, pages 331--340. ACM, 2010.
[5]
D. Fesenmaier, H. Werthner, and K. Wober. Domain specific search engines. In Travel Destination Recommendation Systems: Behavioural Foundations and Applications, pages 205--211. CABI, 2006.
[6]
T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88. Association for Computational Linguistics, 2010.
[7]
F. Menczer, G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven web crawlers. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 241--249. ACM, 2001.
[8]
G. Mishne and J. Lin. Twanchor text: a preliminary study of the value of tweets as anchor text. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1159--1160. ACM, 2012.
[9]
O. Phelan, K. McCarthy, and B. Smyth. Using twitter to recommend real-time topical news. In Proceedings of the third ACM conference on Recommender systems, pages 385--388. ACM, 2009.
[10]
R. Prasath and P. Öztürk. Finding potential seeds through rank aggregation of web searches. In Proceedings of the 4th international conference on Pattern Recognition and Machine Intelligence, pages 227--234. Springer, 2011.
[11]
K. Shankar and M. Levy.# Crowdsourcing Tweet Book01: 140 Bite-Sized Ideas to Leverage the Wisdom of the Crowd. Thinkaha, 2011.
[12]
R. Yan, M. Lapata, and X. Li. Tweet recommendation with graph co-ranking. In Proceedings of the 50th Annual meeting of the Association for Computational Linguistics, pages 516--525. ACL, 2012.
[13]
S. Zheng, P. Dmitriev, and C. Giles. Graph based crawler seed selection. In Proceedings of the 18th international conference on World wide web, pages 1089--1090. ACM, 2009.

Cited By

View all
  • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2021)SIREN: A Fine Grained Approach to Develop Information Security Search EngineAdvances in Cybersecurity Management10.1007/978-3-030-71381-2_16(337-367)Online publication date: 23-Feb-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web
April 2014
1396 pages
ISBN:9781450327459
DOI:10.1145/2567948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crawling
  2. diversity
  3. seeds

Qualifiers

  • Research-article

Conference

WWW '14
Sponsor:
  • IW3C2

Acceptance Rates

Overall Acceptance Rate 728 of 4,541 submissions, 16%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2021)SIREN: A Fine Grained Approach to Develop Information Security Search EngineAdvances in Cybersecurity Management10.1007/978-3-030-71381-2_16(337-367)Online publication date: 23-Feb-2021
  • (2019)Using micro-collections in social media to generate seeds for web archive collectionsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00042(251-260)Online publication date: 2-Jun-2019
  • (2017)Piecing together the puzzle: Improving event content coverage for real-time sub-event detection using adaptive microblog crawlingPLOS ONE10.1371/journal.pone.018740112:11(e0187401)Online publication date: 6-Nov-2017
  • (2016)User communities and contents co-ranking for user-generated content quality evaluation in social networksInternational Journal of Communication Systems10.1002/dac.290829:14(2147-2168)Online publication date: 25-Sep-2016
  • (2015)Identifying Health Domain URLs using SVMProceedings of the Third International Symposium on Women in Computing and Informatics10.1145/2791405.2791441(203-208)Online publication date: 10-Aug-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media