skip to main content
10.1145/2492517.2500328acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An analyst-adaptive approach to focused crawlers

Published:25 August 2013Publication History

ABSTRACT

The paper presents a general methodology to implement a flexible Focused Crawler for investigation purposes, monitoring, and Open Source Intelligence (OSINT). The resulting tool is specifically aimed to fit the operational requirements of law-enforcement agencies and intelligence analyst. The architecture of the semantic Focused Crawler features static flexibility in the definition of desired concepts, used metrics, and crawling strategy; in addition, the method is capable to learn (and adapt to) the analyst's expectations at runtime. The user may instruct the crawler with a binary feedback (yes/no) about the current performance of the surfing process, and the crawling engine progressively refines the expected targets accordingly. The method implementation is based on an existing text-mining environment, integrated with semantic networks and ontologies. Experimental results witness the effectiveness of the adaptive mechanism.

References

  1. J. J. Xu, H. Chen "Fighting organized crimes: using shortest-path algorithms to identify associations in criminal networks", Decision Support Systems, 2004, vol. 38, pp. 473--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Chakrabarti, M. Van den Berg, B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery", Computer Networks, 1999, vol. 31 No. 11, pp. 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Batsakis, E. G. Petrakis, E. Milios, "Improving the performance of focused web crawlers", Data & Knowledge Engineering, 2009, vol. 68, No. 10, pp. 1001--1013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Menczer, G. Pant, P. Srinivasan, "Topical web crawlers: Evaluating adaptive algorithms", ACM Transactions on Internet Technology, 2004, vol. 4, No. 4, pp. 378--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. De Bra, G. J. Houben, Y. Kornatzky, R. Post, "Information retrieval in distributed hypertexts", Proc. 4th RIAO Conf., Oct 1994, pp. 481--491.Google ScholarGoogle Scholar
  6. G. Salton, A. Wong, C. S. Yang, "A vector space model for automatic indexing" Comm. of the ACM, 1975, vol. 18, No. 11, pp. 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, S. Ur, "The shark-search algorithm. An application: tailored Web site mapping", Computer Networks and ISDN Systems, 1998, vol. 30, No. 1, pp. 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Ehrig, A. Maedche, "Ontology-focused crawling of Web documents" Proc. 2003 ACM Symp. Applied Computing, Mar 2003, pp. 1174--1178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Hliaoutakis, G. Varelas, E. Voutsakis, E. G. Petrakis, E. Milios, "Information retrieval by semantic similarity" Int. J. Semantic Web and Information Systems, 2006, vol. 2, No. 3, pp. 55--73.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. Leoncini, F. Sangiacomo, S. Decherchi, P. Gastaldo, R. Zunino "Semantic Oriented Clustering of Documents", Proc. Int. Symp. Neural Networks ISNN 2011, May 2011, Part III, pp. 523--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Pant, P. Srinivasan, "Learning to crawl: Comparing classification schemes" ACM Transactions on Information Systems, 2005, vol. 23, No. 4, pp. 430--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Li, K. Furuse, K. Yamaguchi, "Focused crawling by exploiting anchor text using decision tree", 14th Int. conf. on World Wide Web, May 2005, pp. 1190--1191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Pant, P. Srinivasan, "Link contexts in classifier-guided topical crawlers" IEEE Trans. Knowledge and Data Engineering, 2006, vol. 18, No. 1, pp. 107--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Fu, A. Abbasi, H. Chen, "A focused crawler for Dark Web forums", J. American Soc. Info. Science and Technol., 2010, vol. 61, No. 6, pp. 1213--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. C. Aggarwal, F. Al-garawi, P. S. Yu, "Intelligent Crawling on the World Wide Web with Arbitrary Predicates" WWW10, May 2001, pp. 96--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Sangiacomo, A. Leoncini, S. Decherchi, P. Gastaldo, R. Zunino "SeaLab Advanced Information Retrieval", Proc. IEEE Int. Conf. Semantic Computing ICSC 2010, Sept 2010, pp. 444--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Gastaldo, S. Decherchi, R. Zunino "K-means clustering for content-based document management" in A. Solanas and A. Martinez (Eds.), Advances in Artificial Intelligence for Privacy, Protection, and Security, World Scientific, 2009.Google ScholarGoogle Scholar
  18. A. Leoncini, F. Sangiacomo, P. Gastaldo, R. Zunino "A semantic-based framework for summarization and page segmentation in web mining" in S. Sakurai (Ed.), Theory and Applications for Advanced Text Mining, InTech Publishing, 2012.Google ScholarGoogle Scholar
  19. Vossen, P. (Ed.). (1998). EuroWordNet: a multilingual database with lexical semantic networks. Boston: Kluwer Academic. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Zhang, T. W. S. Chow, W. Liu, "Textual and visual content-based anti-phishing: a Bayesian approach" IEEE Trans. Neural Networks, Oct 2011, vol. 22, o. 10, pp. 1532--1546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kumar, N. Gupta, N. Sharma, P. Rawat, "A review of content based image classification using color clustering technique approach" Int. J. Emerging Technology and Advanced Engineering, vol. 3, No. 3, March 2013, pp. 922--926.Google ScholarGoogle Scholar
  22. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, "Visual categorization with bags of keypoints" Int. Workshop on Statistical Learning in Computer Vision, ECCV 2004, pp. 1--22.Google ScholarGoogle Scholar
  23. H. Bay, T. Tuytelaars, and L. Van Gool. "SURF: Speeded up robust features", Proc. European Conference on Computer Vision, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Leoncini, F. Sangiacomo, S. Argentesi, R. Zunino, E. Cambria "Semantic Models for Style-based Text Clustering", Proc. IEEE Int. Conf. Semantic Computing ICSC 2011, Sept 2011, pp. 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. T. Freeman, H. Yin, "Web Content management by self-organization" IEEE Trans. Neural Networks, Sept 2005, vol. 16, No. 5, pp. 1256--1268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Zhang, A. I. Rudnicky "A large scale clustering scheme for kernel K-means", Proc. 16th Int. Conf. Pattern Recognition, 2002, vol. 4, pp. 289--292.Google ScholarGoogle ScholarCross RefCross Ref
  27. D. R. Radev, H. Jing, M. Stys, D. Tam, "Centroid-based summarization of multiple documents", Information Processing and Management, 2004, vol. 40, pp. 919--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features" Proc. Eur. Conf. Machine Learning, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analyst-adaptive approach to focused crawlers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
          August 2013
          1558 pages
          ISBN:9781450322409
          DOI:10.1145/2492517

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 August 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate116of549submissions,21%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader