Skip to main content
Log in

An adaptive focused Web crawling algorithm based on learning automata

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The recent years have witnessed the birth and explosive growth of the Web. The exponential growth of the Web has made it into a huge source of information wherein finding a document without an efficient search engine is unimaginable. Web crawling has become an important aspect of the Web search on which the performance of the search engines is strongly dependent. Focused Web crawlers try to focus the crawling process on the topic-relevant Web documents. Topic oriented crawlers are widely used in domain-specific Web search portals and personalized search tools. This paper designs a decentralized learning automata-based focused Web crawler. Taking advantage of learning automata, the proposed crawler learns the most relevant URLs and the promising paths leading to the target on-topic documents. It can effectively adapt its configuration to the Web dynamics. This crawler is expected to have a higher precision rate because of construction a small Web graph of only on-topic documents. Based on the Martingale theorem, the convergence of the proposed algorithm is proved. To show the performance of the proposed crawler, extensive simulation experiments are conducted. The obtained results show the superiority of the proposed crawler over several existing methods in terms of precision, recall, and running time. The t-test is used to verify the statistical significance of the precision results of the proposed crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Akbari Torkestani J (2012) LAAP: a learning automata-based adaptive polling scheme for clustered wireless ad-hoc networks. Wirel Pers Commun (in press)

  2. Akbari Torkestani J (2012) Mobility prediction in mobile wireless networks. J Netw Comput Appl (in press)

  3. Jung JJ (2009) Using evolution strategy for cooperative focused crawling on semantic web. Neural Comput Appl 18:213–221

    Article  Google Scholar 

  4. Rungsawang A, Angkawattanawit N (2005) Learnable topic-specific web crawler. J Netw Comput Appl 28:97–114

    Article  Google Scholar 

  5. Batzios A, Dimou C, Symeonidis AL, Mitkas PA (2008) BioCrawler: an intelligent crawler for the semantic web. Expert Syst Appl 35:524–530

    Article  Google Scholar 

  6. Batsakis S, Petrakis EGM, Milios E (2009) Improving the performance of focused Web crawlers. Data Knowl Eng 68:1001–1013

    Article  Google Scholar 

  7. Zhang H, Lu J (2010) SCTWC: an online semi-supervised clustering approach to topical web crawlers. Appl Soft Comput 10:490–495

    Article  Google Scholar 

  8. Liu H, Janssen J, Milios E (2006) Using HMM to learn user browsing patterns for focused Web crawling. Data Knowl Eng 59:270–291

    Article  Google Scholar 

  9. Akbari Torkestani J, Meybodi MR (2012) Finding minimum weight connected dominating set in stochastic graph based on learning automata. Inf Sci (in press)

  10. Akbari Torkestani J (2012) An adaptive learning automata-based ranking function discovery algorithm. J Intell Inf Syst (in press)

  11. Patel A, Schmidt N (2011) Application of structured document parsing to focused web crawling. Comput Stand Interfaces 33:325–331

    Article  Google Scholar 

  12. Hsu C-C, Wub F (2006) Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf Sci 31:232–246

    Google Scholar 

  13. Akbari Torkestani J (2012) A stable virtual backbone for wireless MANETS. Telecommun Syst J (in press)

  14. Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419

    Article  Google Scholar 

  15. Ehrig M, Maedche A (2003) Ontology-focused crawling of web documents. In: Proceedings of the symposium on applied computing (SAC 2003)

    Google Scholar 

  16. Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E (2006) Information retrieval by semantic similarity. Int J Semantic Web Inf Syst 3(3):55–73

    Article  Google Scholar 

  17. Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462

    Article  Google Scholar 

  18. Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the 26th international conference on very large databases (VLDB 2000), pp 527–534

    Google Scholar 

  19. Symeonidis AL, Valtos V, Seroglou S, Mitkas PA (2005) Biotope: an integrated simulation tool for augmenting the intelligence of multi-agent communities residing in hostile environments. IEEE Trans Syst Man Cybern, Part A 35(3):420–432. Special Issue on Self-organization in Distributed Systems Engineering

    Article  Google Scholar 

  20. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  21. http://dir.yahoo.com

  22. Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice-Hall, New York

    Google Scholar 

  23. Lakshmivarahan S, Thathachar MAL (1976) Bounds on the convergence probabilities of learning automata. IEEE Trans Syst Man Cybern SMC-6:756–763

    Article  MathSciNet  Google Scholar 

  24. Billard EA, Lakshmivarahan S (1999) Learning in multi-level games with incomplete information—Part I. IEEE Trans Syst Man Cybern, Part B, Cybern 19:329–339

    Article  Google Scholar 

  25. Akbari Torkestani J, Meybodi MR (2010) A new vertex coloring algorithm based on variable action-set learning automata. J Comput Inform 29(3):447–466

    MathSciNet  Google Scholar 

  26. Akbari Torkestani J, Meybodi MR (2010) An efficient cluster-based CDMA/TDMA scheme for wireless mobile AD-hoc networks: a learning automata approach. J Netw Comput Appl 33:477–490

    Article  Google Scholar 

  27. Meybodi MR (1983) Learning automata and its application to priority assignment in a queuing system with unknown characteristics. PhD thesis, Department of Electrical Engineering and Computer Science, University of Oklahoma, Norman, Oklahoma, USA

  28. Hashim AA, Amir S, Mars P (1986) Application of learning automata to data compression. In: Narendra KS (ed) Adaptive and learning systems. Plenum Press, New York, pp 229–234

    Google Scholar 

  29. Oommen BJ, Hansen ER (1987) List organizing strategies using stochastic move-to-front and stochastic move-to-rear operations. SIAM J Comput 16:705–716

    Article  MathSciNet  MATH  Google Scholar 

  30. Unsal C, Kachroo P, Bay JS (1999) Multiple stochastic learning automata for vehicle path control in an automated highway system. IEEE Trans Syst Man Cybern, Part A 29:120–128

    Article  Google Scholar 

  31. Barto AG, Anandan P (1985) Pattern-recognizing stochastic learning automata. IEEE Trans Syst Man Cybern SMC-15:360–375

    Article  MathSciNet  Google Scholar 

  32. Thathachar MAL, Harita BR (1987) Learning automata with changing number of actions. IEEE Trans Syst Man Cybern SMG17:1095–1100

    Google Scholar 

  33. Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the web: system and techniques. Appl Intell 21(2):195–224

    Article  MATH  Google Scholar 

  34. Camacho D, Aler R, Borrajo D, Molina JM (2006) Multi-agent plan based information gathering. Appl Intell 25(1):59–71

    Article  MATH  Google Scholar 

  35. Santos E, Santos EE, Nguyen H, Pan L, Korah J (2011) A large-scale distributed framework for information retrieval in large dynamic search spaces. Appl Intell 35(3):375–398

    Article  Google Scholar 

  36. Kim S, Zhang BT (2003) Genetic mining of HTML structures for effective Web-document retrieval. Appl Intell 18(3):243–256

    Article  Google Scholar 

  37. Akbari Torkestani J (2012) Degree constrained minimum spanning tree problem in stochastic graph. J Cybern Syst 43(1):1–21

    Article  Google Scholar 

  38. Akbari Torkestani J, Meybodi MR (2011) LLACA: an adaptive localized clustering algorithm for wireless ad hoc networks based on learning automata. J Comput Electr Eng 37:461–474

    Article  Google Scholar 

  39. Akbari Torkestani J, Meybodi MR (2011) A link stability-based multicast routing protocol for wireless mobile ad hoc networks. J Netw Comput Appl 34(4):1429–1440

    Article  Google Scholar 

  40. Akbari Torkestani J (2012) An adaptive backbone formation algorithm for wireless sensor networks. Comput Commun (in press)

  41. Akbari Torkestani J (2012) A new approach to the job scheduling problem in computational grids. J Clust Comput (in press)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javad Akbari Torkestani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akbari Torkestani, J. An adaptive focused Web crawling algorithm based on learning automata. Appl Intell 37, 586–601 (2012). https://doi.org/10.1007/s10489-012-0351-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0351-2

Keywords

Navigation