Skip to main content
Log in

A Multi-Threaded Semantic Focused Crawler

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The Web comprises of voluminous rich learning content. The volume of ever growing learning resources however leads to the problem of information overload. A large number of irrelevant search results generated from search engines based on keyword matching techniques further augment the problem. A learner in such a scenario needs semantically matched learning resources as the search results. Keeping in view the volume of content and significance of semantic knowledge, our paper proposes a multi-threaded semantic focused crawler (SFC) specially designed and implemented to crawl on the WWW for educational learning content. The proposed SFC utilizes domain ontology to expand a topic term and a set of seed URLs to initiate the crawl. The results obtained by multiple iterations of the crawl on various topics are shown and compared with the results obtained by executing an open source crawler on the similar dataset. The results are evaluated using Semantic Similarity, a vector space model based metric, and the harvest ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Spivack N.Web evolution. http://www.slideshare.net/novaspi-vack/web-evolution-nova-spivack-twine, June 2011.

  2. Kleinberg J, Lawrence S (2001) The structure of the Web. Science 294(5548):1849–1850

    Article  Google Scholar 

  3. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5):513–523

    Article  Google Scholar 

  4. Navigli R, Velardi P. An analysis of ontology-based query expansion strategies. In Proc. Workshop on Adaptive Text Extraction and Mining, Sept. 2003, pp.42–49.

  5. Bedi P, Banati H, Thukral A. Social semantic retrieval and ranking of eResources. In Proc. the 2nd Int. Conference on Advances in Recent Technologies in Communication and Computing, Oct. 2010, pp.343–347.

  6. Berners-Lee T. Giant global graph. http://dig.csail.mit.edu/breadcrumbs/node/215, May 2011.

  7. Farber D. From semantic Web (3.0) to the WebOS (4.0). http://www.zdnet.com/blog/btl/from-semantic-web-30-to-the-webos-40/4499, May 2011.

  8. Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Scientific American 284(3):34–43

    Article  Google Scholar 

  9. Bedi P, Banati H, Thukral A. Use of ontology for reusing web repositories for eLearning. In Technological Developments in Networking, Education and Automation, Elleithy K et al. (eds.), New York, USA: Springer, 2010, pp.97–101.

  10. Hendler J, Berners-Lee T (2010) From the semantic web to social machines: A research challenge for AI on the World Wide Web. Artificial Intelligence 174(2):156–161

    Article  MathSciNet  Google Scholar 

  11. Berners-Lee T. Semantic Web and linked data. http://www.w3.org/2009/Talks/0120-campus-party-tbl/, June 2011.

  12. Pant G, Srinivasan P, Menczer F. Crawling the web. In Web Dynamics: Adapting to Change in Content, Size, Topology and Use, Levene M, Poulovassilis A (eds.), Springer-Verlag, 2004, pp.153–178.

  13. Castillo C. Effective Web crawling [Ph.D. Thesis]. Dept. of Computer Science, University of Chile, November 2004.

  14. Bidoki A M Z, Salehie M, Azadnia M. Analysis of priority and partitioning effects on web crawling performance. In Proc. the Intelligent Information Processing and Web Mining Conference, May 2004, pp.287–296.

  15. Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks 31(11–16):1623–1640

    Article  Google Scholar 

  16. Dong H, Hussain FK (2011) Focused crawling for automatic service discovery, annotation and classification in industrial digital ecosystems. IEEE Transactions on Industrial Electronics 58(6):2106–2116

    Article  Google Scholar 

  17. Craswell N, Hawking D, Robertson S. Effective site finding using link anchor information. In Proc. the 24th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, Sept. 2001, pp.250–257.

  18. Jamali M, Sayyadi H, Hariri B B, Abolhassani H. A method for focused crawling using combination of link structure and content similarity. In Proc. IEEE/WIC/ACM Int. Conference on Web Intelligence, Dec. 2006, pp.753–756.

  19. Hati D, Kumar A (2010) An approach for identifying URLs based on division score and link score in focused crawler. Int Journal of Computer Application 2(3):48–53

    Article  Google Scholar 

  20. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the Web. In Proc. the 7th Int. WWW Conference, April 1998, pp.161–172.

  21. Callen B. Search Engine Optimization Made Easy. http://www.easywebtutorials.com/ebooks/SEO-MadeEasy.pdf, June 2011.

  22. The Bivings group. SEO basics. http://www.knightdigitalme-diacenter.org/images/uploads/leadership/SEO%20Basics.pdf, June 2011.

  23. Google. Search engine optimization starter guide. http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf, June 2011.

  24. Batsakis S, Petrakis EGM, Milios E (2009) Improving the performance of focused web crawlers. Data & Knowledge Engineering 68(10):1001–1013

    Article  Google Scholar 

  25. Thukral A, Mendiratta V, Behl A, Banati H, Bedi P. FCHC: A social semantic focused crawler. In Proc. Int. Conf. Advances in Computing and Communications, July 2011, pp.273–283.

  26. Thukral A, Banati H, Bedi P (2011) Ranking tagged resources using social semantic relevance. Information Retrieval Research 1(3):15–34

    Article  Google Scholar 

  27. Ding L, Finin T, Joshi A et al. Swoogle: A search and metadata engine for the semantic web. In Proc. the 13th ACM Conf. Information and Knowledge Management, Nov. 2004, pp.652–659.

  28. Patel C, Supekar K, Lee Y, Park E K. OntoKhoj: A semantic web portal for ontology searching, ranking and classification. In Proc. the 5th ACM Int. Workshop on Web Information and Data Management, Nov. 2003, pp.58–61.

  29. Lozano-Tello A, Gómez-Pérez A (2004) ONTOMETRIC: A method to choose the appropriate ontology. Journal of Database Management 15(2):1–18

    Article  Google Scholar 

  30. Alani H, Brewster C, Shadbolt N. Ranking ontologies with AKTiveRank. In Proc. the 5th Int. Conf. Semantic Web, Nov. 2006, pp.1–15.

  31. Dong H, Hussain F K, Chang E. A survey in semantic web technologies-inspired focused crawlers. In Proc. the 3rd Int. Conf. Digital Information Management, Nov. 2008. pp.934–936.

  32. Dong H, Hussain F K, Chang E. State of the art in semantic focused crawlers. In Proc. Int. Conference on Computational Science and its Applications, June 29-July 1, 2009, Part 2, pp.910–924.

  33. Ehrig M, Maedche A. Ontology-focused crawling of Web documents. In ACM Symposium on Applied Computing, March 2003, pp.1174–1178.

  34. Garcia E. The classical vector space model: Description, advantages and limitations of the classic vector space model. http://www.miislita.com/term-vector/term-vector-3.html, Oct. 2010.

  35. Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M. Focused crawling using context graphs. In Proc. the 26th Int. Conference on Very Large Data Bases, Sept. 2000, pp.527–534.

  36. Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2003) THESUS: Organizing Web document collection based on link semantics. Journal on Very Large Data Bases 12(4):1–13

    Google Scholar 

  37. Ganesh S, Jayaraj M, Kalyan V et al (2004) Ontology-based web crawler. In Proc Int Conf Information Technology: Coding and Computing 2:337–341

    Article  Google Scholar 

  38. Tane J, Schmitz C, Stumme G. Semantic resource management for the web: An e-learning application. In Proc. the 13th Int. World Wide Web Conference on Alternate Track Papers & Posters, May 2004, pp.1–10.

  39. Maedche A, Staab S. Ontology learning. In Handbook on Ontologies, Staab S, Studer R (eds.), Springer-Germany, 2004.

  40. Yuvarani M, Iyengar N Ch S N, Kannan A. LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In Proc. Int. Conference on Web Intelligence, Dec. 2006, pp.794–800.

  41. Thukral A, Bedi P, Banati H (2011) Architecture to organize social semantic relevant web resources in a knowledgebase. Int Journal of e-Education, e-Business, e-Management and e-Learning 1(1):45–51

    Google Scholar 

  42. Thukral A, Bedi P, Banati H. Automatic organization of web resources in ontologies for learning purpose. In Proc. the 2nd Int. Conference on e-Education, e-Business, e-Management and E-Learning, Jan. 2011, pp.38–44.

  43. Cimiano P. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer Heidelberg, 2006.

  44. Novak JD (2010) Learning, creating, and using knowledge: Concept maps as facilitative tools in schools and corporations. Journal of e-Learning and Knowledge Society 6(3):21–30

    Google Scholar 

  45. Isaac A, Summers E. SKOS: Simple knowledge organization system primer. http://www.w3.org/TR/skos-primer, Feb. 2011.

  46. Hliaoutakis A, Varelas G, Voutsakis E et al (2006) Information retrieval by semantic similarity. Int Journal on Semantic Web and Information Systems 3(3):55–73

    Article  Google Scholar 

  47. Dong H, Hussain FK, Chang E (2010) A context-aware semantic similarity model for ontology environments. Concurrency and Computation: Practice and Experience 23(5):505–524

    Article  Google Scholar 

  48. Menczer F, Pant G, Ruiz M E, Srinivasan P. Evaluating topic-driven web crawlers. In Proc. the 24th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, Sept. 2001, pp.241–249.

  49. Zheng H T, Kang B Y, Kim H G. Learnable focused crawling based on ontology. In Proc. the 4th AIRS, Jan. 2008, pp.264–275.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anjali Thukral.

Additional information

**MSc (Comp. Sc.) student at University of Delhi, India, at the time of the research of the paper

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bedi, P., Thukral, A., Banati, H. et al. A Multi-Threaded Semantic Focused Crawler. J. Comput. Sci. Technol. 27, 1233–1242 (2012). https://doi.org/10.1007/s11390-012-1299-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-012-1299-8

Keywords

Navigation