research-article

Seed selection for domain-specific search

Authors:

Pattisapu Nikhil Priyatam,

Vasudeva VarmaAuthors Info & Claims

WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web

Pages 923 - 928

https://doi.org/10.1145/2567948.2579216

Published: 07 April 2014 Publication History

Get Access

Abstract

The last two decades have witnessed an exponential rise in web content from a plethora of domains, which has necessitated the use of domain-specific search engines. Diversity of crawled content is one of the crucial aspects of a domain-specific search engine. To a large extent, diversity is governed by the initial set of seed URLs. Most of the existing approaches rely on manual effort for seed selection. In this work we automate this process using URLs posted on Twitter. We propose an algorithm to get a set of diverse seed URLs from a Twitter URL graph. We compare the performance of our approach against the baseline zero similarity seed selection method and find that our approach beats the baseline by a significant margin.

References

[1]

M. Boanjak, E. Oliveira, J. Martins, E. Mendes Rodrigues, and L. Sarmento. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st international conference companion on World Wide Web, pages 1233--1240. ACM, 2012.

Digital Library

Google Scholar

[2]

C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. In Proceedings of the 20th international conference on World wide web, pages 675--684. ACM, 2011.

Digital Library

Google Scholar

[3]

P. Dmitriev. Host-based seed selection algorithm for web crawlers, 2008. US Patent App. 12/259,164.

Google Scholar

[4]

A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of the essence: improving recency ranking using twitter data. In Proceedings of the 19th international conference on World wide web, pages 331--340. ACM, 2010.

Digital Library

Google Scholar

[5]

D. Fesenmaier, H. Werthner, and K. Wober. Domain specific search engines. In Travel Destination Recommendation Systems: Behavioural Foundations and Applications, pages 205--211. CABI, 2006.

Crossref

Google Scholar

[6]

T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88. Association for Computational Linguistics, 2010.

Digital Library

Google Scholar

[7]

F. Menczer, G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven web crawlers. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 241--249. ACM, 2001.

Digital Library

Google Scholar

[8]

G. Mishne and J. Lin. Twanchor text: a preliminary study of the value of tweets as anchor text. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1159--1160. ACM, 2012.

Digital Library

Google Scholar

[9]

O. Phelan, K. McCarthy, and B. Smyth. Using twitter to recommend real-time topical news. In Proceedings of the third ACM conference on Recommender systems, pages 385--388. ACM, 2009.

Digital Library

Google Scholar

[10]

R. Prasath and P. Öztürk. Finding potential seeds through rank aggregation of web searches. In Proceedings of the 4th international conference on Pattern Recognition and Machine Intelligence, pages 227--234. Springer, 2011.

Digital Library

Google Scholar

[11]

K. Shankar and M. Levy.# Crowdsourcing Tweet Book01: 140 Bite-Sized Ideas to Leverage the Wisdom of the Crowd. Thinkaha, 2011.

Google Scholar

[12]

R. Yan, M. Lapata, and X. Li. Tweet recommendation with graph co-ranking. In Proceedings of the 50th Annual meeting of the Association for Computational Linguistics, pages 516--525. ACL, 2012.

Digital Library

Google Scholar

[13]

S. Zheng, P. Dmitriev, and C. Giles. Graph based crawler seed selection. In Proceedings of the 18th international conference on World wide web, pages 1089--1090. ACM, 2009.

Digital Library

Google Scholar

Cited By

View all

ALANOĞLU ZAKÇAYOL M(2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
https://doi.org/10.7240/jeps.1174193
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Sanagavarapu LReddy YAgrawal S(2021)SIREN: A Fine Grained Approach to Develop Information Security Search EngineAdvances in Cybersecurity Management10.1007/978-3-030-71381-2_16(337-367)Online publication date: 23-Feb-2021
https://doi.org/10.1007/978-3-030-71381-2_16
Show More Cited By

Index Terms

Seed selection for domain-specific search
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

Using micro-collections in social media to generate seeds for web archive collections
JCDL '19: Proceedings of the 18th Joint Conference on Digital Libraries

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (...
Local search engine with global content based on domain specific knowledge

In the growing need for information we have come to rely on search engines. The use of large scale search engines, such as Google, is as common as surfing the World Wide Web. We are impressed with the capabilities of these search engines but still there ...
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Since the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...

Comments

Information & Contributors

Information

Published In

WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web

April 2014

1396 pages

ISBN:9781450327459

DOI:10.1145/2567948

General Chair:
Chin-Wan Chung
Korea Advanced Institute of Science and Technology, Korea
,
Program Chairs:
Andrei Broder
Google Inc., USA
,
Kyuseok Shim
Seoul National University, Korea
,
Torsten Suel
New York University, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '14

Sponsor:

IW3C2

WWW '14: 23rd International World Wide Web Conference

April 7 - 11, 2014

Seoul, Korea

Acceptance Rates

Overall Acceptance Rate 728 of 4,541 submissions, 16%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

ALANOĞLU ZAKÇAYOL M(2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
https://doi.org/10.7240/jeps.1174193
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Sanagavarapu LReddy YAgrawal S(2021)SIREN: A Fine Grained Approach to Develop Information Security Search EngineAdvances in Cybersecurity Management10.1007/978-3-030-71381-2_16(337-367)Online publication date: 23-Feb-2021
https://doi.org/10.1007/978-3-030-71381-2_16
Nwala AWeigle MNelson MDownie SZhu QHahn J(2019)Using micro-collections in social media to generate seeds for web archive collectionsProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00042(251-260)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1109/JCDL.2019.00042
Tokarchuk LWang XPoslad S(2017)Piecing together the puzzle: Improving event content coverage for real-time sub-event detection using adaptive microblog crawlingPLOS ONE10.1371/journal.pone.018740112:11(e0187401)Online publication date: 6-Nov-2017
https://doi.org/10.1371/journal.pone.0187401
Li LLin XZhai YYuan CZhou YQi J(2016)User communities and contents co-ranking for user-generated content quality evaluation in social networksInternational Journal of Communication Systems10.1002/dac.290829:14(2147-2168)Online publication date: 25-Sep-2016
https://dl.acm.org/doi/10.1002/dac.2908
Rajalakshmi RMitra STrajković LBedi PMcIntosh SRajasree M(2015)Identifying Health Domain URLs using SVMProceedings of the Third International Symposium on Women in Computing and Informatics10.1145/2791405.2791441(203-208)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2791405.2791441

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Using micro-collections in social media to generate seeds for web archive collections

Local search engine with global content based on domain specific knowledge

Focused ranking in a vertical search engine

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations