Focused crawler for events

Farag, Mohamed M. G.; Lee, Sunshin; Fox, Edward A.

doi:10.1007/s00799-016-0207-1

Mohamed M. G. Farag¹,
Sunshin Lee¹ &
Edward A. Fox¹

1380 Accesses
27 Citations
4 Altmetric
Explore all metrics

Abstract

There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Exposing and explaining fake news on-the-fly

Article Open access 10 April 2024

A Flexible Big Data System for Credibility-Based Filtering of Social Media Information According to Expertise

Article Open access 15 April 2024

References

O’reilly, T.: What is web 2.0: design patterns and business models for the next generation of software. Commun. Strateg. 1(1), 17 (2007)
Google Scholar
Fox, E.A., Leidig, J.P.: Digital Libraries Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS, vol. 6. Morgan & Claypool Publishers, San Rafael (2014)
Google Scholar
Fox, E.A., da Silva Torres, R.: Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security, vol. 6. Morgan & Claypool Publishers, San Rafael (2014)
Google Scholar
Shen, R., Goncalves, M.A., Fox, E.A.: Key Issues Regarding Digital Libraries: Evaluation and Integration, vol. 5. Morgan & Claypool Publishers, San Rafael (2013)
Google Scholar
IDEAL. Integrated Digital Event Archive and Library. Accessed: 2016-07-26
Internet Archive. A digital library of free content and wayback machine. Accessed: 2016-07-26
Archive-It Collections. Spontaneous events. Accessed: 2016-07-26
Farag, M., Nakate, P., Fox, E.A.: Big data processing of school shooting archives. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 271–272. ACM (2016)
IDEAL Collections. IDEAL event collections. Accessed: 2016-07-26
Archive-It. Web archiving services for libraries and archives. Accessed: 2016-07-26
Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)
Article Google Scholar
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)
Article Google Scholar
Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)
Article Google Scholar
Rennie, J., McCallum, A.: Efficient web spidering with reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Citeseer (1999)
Grigoriadis, A., Paliouras, G.: Focused crawling using temporal difference-learning. In: Hellenic Conference on Artificial Intelligence, pp. 142–153. Springer (2004)
Singh, N., Sandhawalia, H., Monet, N., Poirier, H., Coursimault, J.-M.: Large scale URL-based classification using online incremental learning. In: 2012 11th International Conference on Machine Learning and Applications (ICMLA), vol. 2, pp. 402–409. IEEE (2012)
Menczer, F., Monge, A.E.: Scalable web search by adaptive online agents: an infospiders case study. In: Intelligent Information Agents, pp. 323–347. Springer (1999)
Dong, H., Hussain, F.K., Chang, E.: A survey in semantic web technologies-inspired focused crawlers. In: Third International Conference on Digital Information Management, 2008 (ICDIM 2008), pp. 934–936. IEEE (2008)
Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the 2003 ACM symposium on Applied computing, pp. 1174–1178. ACM (2003)
Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling—an application for vertical search engines. Inf. Syst. 32(6), 886–908 (2007)
Article Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M. et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006)
Article Google Scholar
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: measurements, models, and methods. In: International Computing and Combinatorics Conference, pp. 1–17. Springer (1999)
Brin, S., Page, L.: Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 56(18), 3825–3833 (2012)
Article Google Scholar
De Assis, Guilherme T., Laender, A.H.F., Gonçalves, M.A., Da Silva, A.S.: Exploiting genre in focused crawling. In: International Symposium on String Processing and Information Retrieval, pp. 62–73. Springer (2007)
Pant, G., Srinivasan, P.: Predicting web page status. Inf. Syst. Res. 21(2), 345–364 (2010)
Article Google Scholar
Pant, G., Srinivasan, P.: Status locality on the web: implications for building focused collections. Inf. Syst. Res. 24(3), 802–821 (2013)
Article Google Scholar
Chen, Y.: A novel hybrid focused crawling algorithm to build domain-specific collections. PhD thesis, Virginia Polytechnic Institute and State University (2007)
Allan, J.: Introduction to topic detection and tracking. In: Topic detection and tracking, pp. 1–16. Springer (2002)
Volkova, S., Caragea, D., Hsu, W.H., Bujuru, S.: Animal disease event recognition and classification. In: Proceedings of the First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010). Citeseer (2010)
Westermann, U., Jain, R.: Toward a common event model for multimedia applications. IEEE Multimed. 14(1), 19–29 (2007)
Article Google Scholar
Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962. ACM (2011)
Farag, M.M.G., Fox, E.A.: Intelligent event focused crawling. In: Proceedings of the 11th International ISCRAM Conference. University Park, Pennsylvania, USA (2014)
Allan, J.: Topic Detection and Tracking: Event-Based Information Organization, vol. 12. Springer, Berlin (2012)
MATH Google Scholar
Gossen, G., Demidova, E., Risse, T.: iCrawl: improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 75–84. ACM (2015)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages in web archives. In: International Conference on Theory and Practice of Digital Libraries, pp. 225–237. Springer (2015)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)
Article Google Scholar
Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, pp. 3–12. ACM (2010)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM press, New York (1999)
Google Scholar

Download references

Acknowledgements

Thanks go to NSF for support, especially through Grants IIS-1619028, CMMI-1638207, DUE-1141209, IIS-1319578, IIS-0916733, and IIS-0736055. Thanks also go to Virginia Tech’s Digital Library Research Laboratory and Department of Computer Science. The statements made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Virginia Tech, Blacksburg, VA, 24061, USA
Mohamed M. G. Farag, Sunshin Lee & Edward A. Fox

Authors

Mohamed M. G. Farag
View author publications
You can also search for this author in PubMed Google Scholar
Sunshin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Edward A. Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed M. G. Farag.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farag, M.M.G., Lee, S. & Fox, E.A. Focused crawler for events. Int J Digit Libr 19, 3–19 (2018). https://doi.org/10.1007/s00799-016-0207-1

Download citation

Received: 01 April 2016
Revised: 27 December 2016
Accepted: 29 December 2016
Published: 07 January 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s00799-016-0207-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Focused crawler for events

Abstract

Access this article

Similar content being viewed by others

Automated identification of media bias in news articles: an interdisciplinary literature review

Exposing and explaining fake news on-the-fly

A Flexible Big Data System for Credibility-Based Filtering of Social Media Information According to Expertise

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Focused crawler for events

Abstract

Access this article

Similar content being viewed by others

Automated identification of media bias in news articles: an interdisciplinary literature review

Exposing and explaining fake news on-the-fly

A Flexible Big Data System for Credibility-Based Filtering of Social Media Information According to Expertise

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation