Skip to main content
Log in

An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Web-page indexing and classification have been studied extensively starting from the early WWW years. A smart intelligent web agent called focused crawler is a specific software able to seek web pages that are relevant to a particular topic domain. In this article we propose a novel approach to focused crawling based on the use of both textual and multimedia web page content. In our approach we define a novel strategy to choose if a web page should be further explored. We implement our framework in a system which aims to improve the crawling task using semantic based techniques and combining the results with novel technologies like convolutional neural networks and linked open data. Our framework uses ontologies to correlate different topics and understanding their relationships. The correlation among topics is used to improve a textual topic detection step. These results are combined with multimedia analysis and classification based on convolutional neural networks to extract image features. Experimental results are also presented and discussed in order to measure the effectiveness of our framework compared with other approaches using a ground truth composed of web pages about a specific domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Listing 1
Listing 2
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Building an image classification web application using vgg-16 - deeplearning4j: Open-source, distributed deep learning for the jvm. https://deeplearning4j.org/build_vgg_webapp. (Accessed on 03/26/2018)

  2. Abualigah L, Qasim M, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Computer Sci Eng Appl 5(1):19

    Google Scholar 

  3. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomputing 73(11):4773–4795

    Article  Google Scholar 

  4. Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071

    Article  Google Scholar 

  5. Aggarwal CC, Zhai CX (2012) A survey of text classification algorithms. In: Mining text data. Springer, pp 163–222

  6. Albanese M, Capasso P, Picariello A, Rinaldi AM (2005) Information retrieval from the web: an interactive paradigm. In: International workshop on multimedia information systems. Springer, pp 17–32

  7. Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv:https://arxiv.org/abs/1707.02919

  8. Babenko A, Lempitsky V (2015) Aggregating local deep features for image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 1269–1277

  9. Baeza-Yates R, Ribeiro-Neto B, et al. (1999) Modern information retrieval, vol 463. ACM Press, New York

    Google Scholar 

  10. Batsakis S, Petrakis EGM, Milios E (2009) Improving the performance of focused web crawlers. Data Knowledge Eng 68(10):1001–1013

    Article  Google Scholar 

  11. Bergman MK (2001) White paper: the deep web: surfacing hidden value. J Electron Publishing, 7(1)

  12. Caldarola EG, Picariello A, Rinaldi AM (2015) Big graph-based data visualization experiences: the wordnet case study. In: IC3K 2015 - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol 1, pp 104–115

  13. Caldarola EG, Picariello A, Rinaldi AM (2016) Experiences in wordnet visualization with labeled graph databases. Commun Comput Inform Sci 631:80–99

    Article  Google Scholar 

  14. Caldarola EG, Rinaldi AM (2015) Big data: a survey: the new paradigms, methodologies and tools. In: DATA 2015 - 4Th international conference on data management technologies and applications, proceedings, pp 362–370

  15. Caldarola EG, Rinaldi AM (2016) An approach to ontology integration for ontology reuse. In: Proceedings - 2016 IEEE 17th International Conference on Information Reuse and Integration, IRI 2016, pp 384–393

  16. Caldarola EG, Rinaldi AM (2017) Big data visualization tools: a survey: the new paradigms, methodologies and tools for large data sets visualization. In: DATA 2017 - Proceedings of the 6th International Conference on Data Science, Technology and Applications, pp 296–305

  17. Caldarola EG, Rinaldi AM (2018) A multi-strategy approach for ontology reuse through matching and integration techniques. Advan Intell Syst Comput 561:63–90

    Article  Google Scholar 

  18. Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11-16):1623–1640

    Article  Google Scholar 

  19. Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through url ordering. Comput Netw ISDN Syst 30(1-7):161–172

    Article  Google Scholar 

  20. Chris Nicholson A, Gibson A (2017) Deeplearning4j: Open-source, distributed deep learning for the jvm. Deeplearning4j org

  21. Maurice de Kunder (2016) The size of the world wide web (the internet). Hentet 15

  22. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR09

  23. Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M et al (2000) Focused crawling using context graphs. In: VLDB, pp 527–534

  24. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: International conference on machine learning, pp 647–655

  25. Ehrig M, Maedche A (2003) Ontology-focused crawling of web documents. In: Proceedings of the 2003 ACM symposium on Applied computing. ACM, pp 1174–1178

  26. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using web structure for classifying and describing web pages. In: Proceedings of the 11th international conference on World Wide Web. ACM, pp 562–569

  27. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407

  28. Gruber TR (1993) A translation approach to portable ontology specifications. Knowledge Acquisition 5(2):199–220

    Article  Google Scholar 

  29. Hassan T, Cruz C, Bertaux A (2017) Ontology-based approach for unsupervised and adaptive focused crawling. In: Proceedings of The International Workshop on Semantic Big Data. ACM, p 2

  30. Ji W, Wang D, Hoi SCH, Pengcheng W, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM international conference on Multimedia. ACM, pp 157–166

  31. Joe Y-HN, Fan Y, Davis LS (2015) Exploiting local features from deep networks for image retrieval. arXiv:https://arxiv.org/abs/1504.05133

  32. Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  33. Kosala R, Blockeel H (2000) Web mining research: a survey. ACM Sigkdd Explorations Newsletter 2(1):1–15

    Article  Google Scholar 

  34. Lefteris K (2008) An ontology-based focused crawler. In: International conference on application of natural language to information systems. Springer, pp 376–379

  35. Li Y, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowledge data Eng 15(4):871–882

    Article  Google Scholar 

  36. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26

    Article  Google Scholar 

  37. Mendes PN, Jakob M, García-Silva A, Bizer C (2011) Dbpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems. ACM, pp 1–8

  38. Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38 (11):39–41

    Article  Google Scholar 

  39. Mohammad L, Abualigah Q (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin

    Google Scholar 

  40. Mukhopadhyay D, Biswas A, Sinha S (2007) A new approach to design domain specific ontology based web crawler. In: 10th International Conference on Information Technology (ICIT 2007). IEEE, pp 289–291

  41. Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th international conference on World Wide Web. ACM, pp 114–118

  42. Novak B (2004) A survey of focused web crawling algorithms. Proc SIKDD 5558:55–58

    Google Scholar 

  43. Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: 2014 IEEE conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1717–1724

  44. Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Transactions on Information Systems (TOIS) 23(4):430–462

    Article  Google Scholar 

  45. Picariello A, Rinaldi AM (2007) Crawling the web with ontodir. In: International conference on database and expert systems applications. Springer, pp 730–739

  46. Purificato E, Rinaldi AM (2018) Multimedia and geographic data integration for cultural heritage information retrieval. Multimed Tool Appl 77(20):27447–27469

    Article  Google Scholar 

  47. Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Computing Surveys (CSUR) 41(2):12

    Article  Google Scholar 

  48. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE conference on Computer vision and pattern recognition workshops (CVPRW). IEEE, pp 512–519

  49. Razavian AS, Sullivan J, Carlsson S, Maki A (2016) Visual instance retrieval with deep convolutional networks. ITE Trans Media Technol Appl 4(3):251–258

    Article  Google Scholar 

  50. Rinaldi AM, Russo C (2018) A matching framework for multimedia data integration using semantics and ontologies. In: 2018 IEEE 12Th international conference on semantic computing (ICSC). IEEE, pp 363–368

  51. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  52. Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1):1–47

    Article  Google Scholar 

  53. Sharma DK, Khan MA (2015) Safsb: a self-adaptive focused crawler. In: 2015 1st International Conference on Next Generation Computing Technologies (NGCT). IEEE, pp 719–724

  54. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:https://arxiv.org/abs/1409.1556

  55. Yajun D, Liu W, Lv X, Peng G (2015) An improved focused crawler based on semantic similarity vector space model. Appl Soft Comput 36:392–407

    Article  Google Scholar 

  56. Yang S-Y (2010) Ontocrawler: a focused crawler with ontology-supported website models for information agents. Expert Syst Appl 37(7):5381–5389

    Article  Google Scholar 

  57. Yohanes BW, Handoko H, Wardana HK (2013) Focused crawler optimization using genetic algorithm. TELKOMNIKA (Telecommunication Computing Electronics and Control) 9(3):403–410

    Article  Google Scholar 

  58. Zhang F, Zhong B-J (2016) Image retrieval based on fused cnn features DEStech Transactions on Computer Science and Engineering (aics)

  59. Zhi T, Duan L-Y, Wang Y, Huang T (2016) Two-stage pooling of deep convolutional features for image retrieval. In: 2016 IEEE International Conference on Image processing (ICIP). IEEE, pp 2465–2469

  60. Zhou Zz, Zhang L (2017) Content-based image retrieval using iterative search. Neural Process Lett 1–13

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio M. Rinaldi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Capuano, A., Rinaldi, A.M. & Russo, C. An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multimed Tools Appl 79, 7577–7598 (2020). https://doi.org/10.1007/s11042-019-08252-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08252-2

Keywords

Navigation