Abstract
The recent years have witnessed the birth and explosive growth of the Web. The exponential growth of the Web has made it into a huge source of information wherein finding a document without an efficient search engine is unimaginable. Web crawling has become an important aspect of the Web search on which the performance of the search engines is strongly dependent. Focused Web crawlers try to focus the crawling process on the topic-relevant Web documents. Topic oriented crawlers are widely used in domain-specific Web search portals and personalized search tools. This paper designs a decentralized learning automata-based focused Web crawler. Taking advantage of learning automata, the proposed crawler learns the most relevant URLs and the promising paths leading to the target on-topic documents. It can effectively adapt its configuration to the Web dynamics. This crawler is expected to have a higher precision rate because of construction a small Web graph of only on-topic documents. Based on the Martingale theorem, the convergence of the proposed algorithm is proved. To show the performance of the proposed crawler, extensive simulation experiments are conducted. The obtained results show the superiority of the proposed crawler over several existing methods in terms of precision, recall, and running time. The t-test is used to verify the statistical significance of the precision results of the proposed crawler.
Similar content being viewed by others
References
Akbari Torkestani J (2012) LAAP: a learning automata-based adaptive polling scheme for clustered wireless ad-hoc networks. Wirel Pers Commun (in press)
Akbari Torkestani J (2012) Mobility prediction in mobile wireless networks. J Netw Comput Appl (in press)
Jung JJ (2009) Using evolution strategy for cooperative focused crawling on semantic web. Neural Comput Appl 18:213–221
Rungsawang A, Angkawattanawit N (2005) Learnable topic-specific web crawler. J Netw Comput Appl 28:97–114
Batzios A, Dimou C, Symeonidis AL, Mitkas PA (2008) BioCrawler: an intelligent crawler for the semantic web. Expert Syst Appl 35:524–530
Batsakis S, Petrakis EGM, Milios E (2009) Improving the performance of focused Web crawlers. Data Knowl Eng 68:1001–1013
Zhang H, Lu J (2010) SCTWC: an online semi-supervised clustering approach to topical web crawlers. Appl Soft Comput 10:490–495
Liu H, Janssen J, Milios E (2006) Using HMM to learn user browsing patterns for focused Web crawling. Data Knowl Eng 59:270–291
Akbari Torkestani J, Meybodi MR (2012) Finding minimum weight connected dominating set in stochastic graph based on learning automata. Inf Sci (in press)
Akbari Torkestani J (2012) An adaptive learning automata-based ranking function discovery algorithm. J Intell Inf Syst (in press)
Patel A, Schmidt N (2011) Application of structured document parsing to focused web crawling. Comput Stand Interfaces 33:325–331
Hsu C-C, Wub F (2006) Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf Sci 31:232–246
Akbari Torkestani J (2012) A stable virtual backbone for wireless MANETS. Telecommun Syst J (in press)
Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419
Ehrig M, Maedche A (2003) Ontology-focused crawling of web documents. In: Proceedings of the symposium on applied computing (SAC 2003)
Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E (2006) Information retrieval by semantic similarity. Int J Semantic Web Inf Syst 3(3):55–73
Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462
Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the 26th international conference on very large databases (VLDB 2000), pp 527–534
Symeonidis AL, Valtos V, Seroglou S, Mitkas PA (2005) Biotope: an integrated simulation tool for augmenting the intelligence of multi-agent communities residing in hostile environments. IEEE Trans Syst Man Cybern, Part A 35(3):420–432. Special Issue on Self-organization in Distributed Systems Engineering
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice-Hall, New York
Lakshmivarahan S, Thathachar MAL (1976) Bounds on the convergence probabilities of learning automata. IEEE Trans Syst Man Cybern SMC-6:756–763
Billard EA, Lakshmivarahan S (1999) Learning in multi-level games with incomplete information—Part I. IEEE Trans Syst Man Cybern, Part B, Cybern 19:329–339
Akbari Torkestani J, Meybodi MR (2010) A new vertex coloring algorithm based on variable action-set learning automata. J Comput Inform 29(3):447–466
Akbari Torkestani J, Meybodi MR (2010) An efficient cluster-based CDMA/TDMA scheme for wireless mobile AD-hoc networks: a learning automata approach. J Netw Comput Appl 33:477–490
Meybodi MR (1983) Learning automata and its application to priority assignment in a queuing system with unknown characteristics. PhD thesis, Department of Electrical Engineering and Computer Science, University of Oklahoma, Norman, Oklahoma, USA
Hashim AA, Amir S, Mars P (1986) Application of learning automata to data compression. In: Narendra KS (ed) Adaptive and learning systems. Plenum Press, New York, pp 229–234
Oommen BJ, Hansen ER (1987) List organizing strategies using stochastic move-to-front and stochastic move-to-rear operations. SIAM J Comput 16:705–716
Unsal C, Kachroo P, Bay JS (1999) Multiple stochastic learning automata for vehicle path control in an automated highway system. IEEE Trans Syst Man Cybern, Part A 29:120–128
Barto AG, Anandan P (1985) Pattern-recognizing stochastic learning automata. IEEE Trans Syst Man Cybern SMC-15:360–375
Thathachar MAL, Harita BR (1987) Learning automata with changing number of actions. IEEE Trans Syst Man Cybern SMG17:1095–1100
Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the web: system and techniques. Appl Intell 21(2):195–224
Camacho D, Aler R, Borrajo D, Molina JM (2006) Multi-agent plan based information gathering. Appl Intell 25(1):59–71
Santos E, Santos EE, Nguyen H, Pan L, Korah J (2011) A large-scale distributed framework for information retrieval in large dynamic search spaces. Appl Intell 35(3):375–398
Kim S, Zhang BT (2003) Genetic mining of HTML structures for effective Web-document retrieval. Appl Intell 18(3):243–256
Akbari Torkestani J (2012) Degree constrained minimum spanning tree problem in stochastic graph. J Cybern Syst 43(1):1–21
Akbari Torkestani J, Meybodi MR (2011) LLACA: an adaptive localized clustering algorithm for wireless ad hoc networks based on learning automata. J Comput Electr Eng 37:461–474
Akbari Torkestani J, Meybodi MR (2011) A link stability-based multicast routing protocol for wireless mobile ad hoc networks. J Netw Comput Appl 34(4):1429–1440
Akbari Torkestani J (2012) An adaptive backbone formation algorithm for wireless sensor networks. Comput Commun (in press)
Akbari Torkestani J (2012) A new approach to the job scheduling problem in computational grids. J Clust Comput (in press)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Akbari Torkestani, J. An adaptive focused Web crawling algorithm based on learning automata. Appl Intell 37, 586–601 (2012). https://doi.org/10.1007/s10489-012-0351-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0351-2