research-article

An analyst-adaptive approach to focused crawlers

Authors:
Rodolfo Zunino

University of Genova, Genova, Italy

University of Genova, Genova, Italy
View Profile

,
Federica Bisio

University of Genova, Genova, Italy

University of Genova, Genova, Italy
View Profile

,
Chiara Peretti

University of Genova, Genova, Italy

University of Genova, Genova, Italy
View Profile

,
Roberto Surlinelli

Polizia Postale e delle Comunicazioni, Genova, Italy

Polizia Postale e delle Comunicazioni, Genova, Italy
View Profile

,
Eugenio Scillia

Polizia Postale e delle Comunicazioni, Genova, Italy

Polizia Postale e delle Comunicazioni, Genova, Italy
View Profile

,
Augusto Ottaviano

Polizia Postale e delle Comunicazioni, Genova, Italy

Polizia Postale e delle Comunicazioni, Genova, Italy
View Profile

,
Fabio Sangiacomo

CyberLabs Srl, Garbagnate Milanese, Italy

CyberLabs Srl, Garbagnate Milanese, Italy
View Profile

ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningAugust 2013Pages 1073–1077https://doi.org/10.1145/2492517.2500328

Published:25 August 2013Publication History

ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Pages 1073–1077

ABSTRACT

The paper presents a general methodology to implement a flexible Focused Crawler for investigation purposes, monitoring, and Open Source Intelligence (OSINT). The resulting tool is specifically aimed to fit the operational requirements of law-enforcement agencies and intelligence analyst. The architecture of the semantic Focused Crawler features static flexibility in the definition of desired concepts, used metrics, and crawling strategy; in addition, the method is capable to learn (and adapt to) the analyst's expectations at runtime. The user may instruct the crawler with a binary feedback (yes/no) about the current performance of the surfing process, and the crawling engine progressively refines the expected targets accordingly. The method implementation is based on an existing text-mining environment, integrated with semantic networks and ontologies. Experimental results witness the effectiveness of the adaptive mechanism.

References

J. J. Xu, H. Chen "Fighting organized crimes: using shortest-path algorithms to identify associations in criminal networks", Decision Support Systems, 2004, vol. 38, pp. 473--487. Google ScholarDigital Library
S. Chakrabarti, M. Van den Berg, B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery", Computer Networks, 1999, vol. 31 No. 11, pp. 1623--1640. Google ScholarDigital Library
S. Batsakis, E. G. Petrakis, E. Milios, "Improving the performance of focused web crawlers", Data & Knowledge Engineering, 2009, vol. 68, No. 10, pp. 1001--1013. Google ScholarDigital Library
F. Menczer, G. Pant, P. Srinivasan, "Topical web crawlers: Evaluating adaptive algorithms", ACM Transactions on Internet Technology, 2004, vol. 4, No. 4, pp. 378--419. Google ScholarDigital Library
P. De Bra, G. J. Houben, Y. Kornatzky, R. Post, "Information retrieval in distributed hypertexts", Proc. 4th RIAO Conf., Oct 1994, pp. 481--491.Google Scholar
G. Salton, A. Wong, C. S. Yang, "A vector space model for automatic indexing" Comm. of the ACM, 1975, vol. 18, No. 11, pp. 613--620. Google ScholarDigital Library
M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, S. Ur, "The shark-search algorithm. An application: tailored Web site mapping", Computer Networks and ISDN Systems, 1998, vol. 30, No. 1, pp. 317--326. Google ScholarDigital Library
M. Ehrig, A. Maedche, "Ontology-focused crawling of Web documents" Proc. 2003 ACM Symp. Applied Computing, Mar 2003, pp. 1174--1178. Google ScholarDigital Library
A. Hliaoutakis, G. Varelas, E. Voutsakis, E. G. Petrakis, E. Milios, "Information retrieval by semantic similarity" Int. J. Semantic Web and Information Systems, 2006, vol. 2, No. 3, pp. 55--73.Google ScholarCross Ref
A. Leoncini, F. Sangiacomo, S. Decherchi, P. Gastaldo, R. Zunino "Semantic Oriented Clustering of Documents", Proc. Int. Symp. Neural Networks ISNN 2011, May 2011, Part III, pp. 523--529. Google ScholarDigital Library
G. Pant, P. Srinivasan, "Learning to crawl: Comparing classification schemes" ACM Transactions on Information Systems, 2005, vol. 23, No. 4, pp. 430--462. Google ScholarDigital Library
J. Li, K. Furuse, K. Yamaguchi, "Focused crawling by exploiting anchor text using decision tree", 14th Int. conf. on World Wide Web, May 2005, pp. 1190--1191. Google ScholarDigital Library
G. Pant, P. Srinivasan, "Link contexts in classifier-guided topical crawlers" IEEE Trans. Knowledge and Data Engineering, 2006, vol. 18, No. 1, pp. 107--122. Google ScholarDigital Library
T. Fu, A. Abbasi, H. Chen, "A focused crawler for Dark Web forums", J. American Soc. Info. Science and Technol., 2010, vol. 61, No. 6, pp. 1213--1231. Google ScholarDigital Library
C. C. Aggarwal, F. Al-garawi, P. S. Yu, "Intelligent Crawling on the World Wide Web with Arbitrary Predicates" WWW10, May 2001, pp. 96--105. Google ScholarDigital Library
F. Sangiacomo, A. Leoncini, S. Decherchi, P. Gastaldo, R. Zunino "SeaLab Advanced Information Retrieval", Proc. IEEE Int. Conf. Semantic Computing ICSC 2010, Sept 2010, pp. 444--445. Google ScholarDigital Library
P. Gastaldo, S. Decherchi, R. Zunino "K-means clustering for content-based document management" in A. Solanas and A. Martinez (Eds.), Advances in Artificial Intelligence for Privacy, Protection, and Security, World Scientific, 2009.Google Scholar
A. Leoncini, F. Sangiacomo, P. Gastaldo, R. Zunino "A semantic-based framework for summarization and page segmentation in web mining" in S. Sakurai (Ed.), Theory and Applications for Advanced Text Mining, InTech Publishing, 2012.Google Scholar
Vossen, P. (Ed.). (1998). EuroWordNet: a multilingual database with lexical semantic networks. Boston: Kluwer Academic. Google ScholarDigital Library
H. Zhang, T. W. S. Chow, W. Liu, "Textual and visual content-based anti-phishing: a Bayesian approach" IEEE Trans. Neural Networks, Oct 2011, vol. 22, o. 10, pp. 1532--1546. Google ScholarDigital Library
J. Kumar, N. Gupta, N. Sharma, P. Rawat, "A review of content based image classification using color clustering technique approach" Int. J. Emerging Technology and Advanced Engineering, vol. 3, No. 3, March 2013, pp. 922--926.Google Scholar
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, "Visual categorization with bags of keypoints" Int. Workshop on Statistical Learning in Computer Vision, ECCV 2004, pp. 1--22.Google Scholar
H. Bay, T. Tuytelaars, and L. Van Gool. "SURF: Speeded up robust features", Proc. European Conference on Computer Vision, 2006. Google ScholarDigital Library
A. Leoncini, F. Sangiacomo, S. Argentesi, R. Zunino, E. Cambria "Semantic Models for Style-based Text Clustering", Proc. IEEE Int. Conf. Semantic Computing ICSC 2011, Sept 2011, pp. 75--82. Google ScholarDigital Library
R. T. Freeman, H. Yin, "Web Content management by self-organization" IEEE Trans. Neural Networks, Sept 2005, vol. 16, No. 5, pp. 1256--1268. Google ScholarDigital Library
R. Zhang, A. I. Rudnicky "A large scale clustering scheme for kernel K-means", Proc. 16th Int. Conf. Pattern Recognition, 2002, vol. 4, pp. 289--292.Google ScholarCross Ref
D. R. Radev, H. Jing, M. Stys, D. Tam, "Centroid-based summarization of multiple documents", Information Processing and Management, 2004, vol. 40, pp. 919--938. Google ScholarDigital Library
T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features" Proc. Eur. Conf. Machine Learning, 1998. Google ScholarDigital Library

Index Terms

An analyst-adaptive approach to focused crawlers
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Web-based interaction
2. Information systems
  1. Information retrieval
    1. Document representation
  2. Information systems applications
    1. Data mining

Recommendations

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are ...
Read More
Efficient Topical Focused Crawling Through Neighborhood Feature
Abstract
A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Read More
Exploiting Tags and Social Profiles to Improve Focused Crawling
WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

Recent years have transformed the Web from a Web of content to a Web of applications and social content. Thus, it has become crucial to be able to tap on this social aspect of the Web whenever possible, in addition to its content, particularly for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
August 2013
1558 pages
ISBN:9781450322409
DOI:10.1145/2492517
General Chairs:
Jon Rokne
University of Calgary, Calgary, AB, Canada
,
Christos Faloutsos
Carnegie Mellon University, Pittsburgh, PA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OSINT
analyst-adaptation
focused crawler
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate116of549submissions,21%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 216
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An analyst-adaptive approach to focused crawlers

ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving the performance of focused web crawlers

Efficient Topical Focused Crawling Through Neighborhood Feature

Exploiting Tags and Social Profiles to Improve Focused Crawling