Skip to main content

Recombination Operators in Genetic Algorithm – Based Crawler: Study and Experimental Appraisal

  • Conference paper
Advanced Methods for Computational Collective Intelligence

Part of the book series: Studies in Computational Intelligence ((SCI,volume 457))

Abstract

A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. This paper proposes a topical crawler for Vietnamese web pages using greedy heuristic and genetic algorithms. Our crawler based on genetic algorithms uses different recombination operators in the genetic algorithms to improve the crawling performance. We tested our algorithms on Vietnamese newspaper VnExpress websites. Experimental results show the efficiency and the viability of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, H., Chung, Y., Ramsey, M., Yang, C.: A smart Itsy Bitsy Spider for the Web. Journal of the American Society for Information Science 49(7), 604–618 (1998)

    Article  Google Scholar 

  2. Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 29(2/3), 203–242 (2000); Longer version available as Technical Report CS98-579, University of California, San Diego

    Article  Google Scholar 

  3. Micarelli, A., Gasparetti, F.: Adaptive Focused Crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  4. Shokouhi, M., Chubak, P., Raeesy, Z.: Enhancing Focused Crawling with Genetic Algorithms. In: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC 2005), pp. 503–508 (2005)

    Google Scholar 

  5. Pal, A., Tomar, D.S., Shrivastava, S.C.: C Shrivastava, Effective Focused Crawling Based on Content and Link Structure Analysis (IJCSIS) International Journal of Computer Science and Information Security 2(1) (June 2009)

    Google Scholar 

  6. Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.: Evaluating Topic-Driven Web Crawlers. In: Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA, pp. 241–249 (2001)

    Google Scholar 

  7. Chakrabarti, S., van den Berg, M., Domc, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proceedings of the 8th International World Wild Web Conference, Toronto, Canada, pp. 1623–1640 (1999)

    Google Scholar 

  8. Petry, F., Buckles, B., Prabhu, D., Kraft, D.: Fuzzy Information Retrieval Using Genetic Algorithms and Relevance Feedback. In: Bonzi, S. (ed.) Proceedings of the Fifty-Sixth Annual Meeting of the American Society for Information Science Annual Meeting, Silver Spring, MD, vol. 30, pp. 122–125 (1993)

    Google Scholar 

  9. David, E.: Goldberg, Genetic Algorithms in Search, Optimization, Machine Learning. Addison Wesley (1989)

    Google Scholar 

  10. Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)

    Google Scholar 

  11. Hsinchum, C., Chen, Y.M., Ramsey, M., Yang, C.C., Ma, P.C., Yen, J.: Intelligent spider for Internet searching. In: Proceedings of the Thirtieth Hawaii International Conference on System Sciences, Maui, Hawaii, January 4-7, pp. 178–188 (1997)

    Google Scholar 

  12. Angkawattanawit, N., Rungsawang, A.: Learnable Crawling: An Efficient Ap-proach to Topic-specific Web Resource Discovery. Journal of Network and Computer Applications, 97–114 (April 2005)

    Google Scholar 

  13. Chen, H.: Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms. Journal of the American Society for Information Science, 194–216 (1995)

    Google Scholar 

  14. Liu, B., Chin, C.W., Ng, H.T.: Mining Topic-Specific Concepts and Definitions on the web. In: Proceedings of the 12th International World Wild Web Conference (www 2003), Budapest, Hungary, pp. 251–260 (May 2003)

    Google Scholar 

  15. Raghavan, V., Aggarwal, B.: Optimal Determination of User-Oriented Clusters: An Application for the Reproductive Plan. In: Proceedings of the Second International Conference on Genetic Algorithms and Their Applications, Cambridge, MA, pp. 241–246 (1987)

    Google Scholar 

  16. Gordon, M.: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of ACM 31(2), 152–169 (1988)

    Article  Google Scholar 

  17. Yang, J., Korfhage, R., Rasmussen, E.: Query Improvement in Information Retrieval Using Genetic Algorithms: A Report on the Experiments of the TREC Project. In: Harman, D.K. (ed.) Proceedings of the First Text Retrieval Conference, pp. 31–58. National Institute of Standards and Technology (NIST) Special Publication 500-207, Washington, DC (1993)

    Google Scholar 

  18. Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A.R.: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams. In: Proceedings of the 5th International Conference on Machine Learning and Applications, pp. 258–263 (2006)

    Google Scholar 

  19. Qin, J., Chen, H.: Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain. In: Proceedings of the 38th Hawaii International Conference on System Sciences, vol. 102 (2005)

    Google Scholar 

  20. Hông Phuong, L.ê., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A Hybrid Approach to Word Segmentation of Vietnamese Texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  21. Daciuk, Jan, Watson, B.W., Watson, R.E.: Incremental construction of minimal acyclic finite state automata and transducers. In: Proceedings of the International Workshop on Finite State Methods in Natural Language Processing, Ankara, Turkey, June 30-July 1, vol. 1, pp. 48–56 (1998)

    Google Scholar 

  22. Maurel, D.: Electronic Dictionaries and Acyclic Finite-State Automata: A State of The Art. In: Published in Grammars and Automata for String Processing, Ankara, Turkey, June 30-July 1, vol. 1, Part 3, pp. 177–188 (1998)

    Google Scholar 

  23. Nhan, N.D., Son, V.T., Binh, H.T.T., Khanh, T.D.: Crawl Topical Vietnamese Web Pages using Genetic Algorithm. In: Proceedings of Second International on Knowledge and System Engineering, pp. 217–223 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huynh Thi Thanh Binh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Binh, H.T.T., Long, H.M., Khanh, T.D. (2013). Recombination Operators in Genetic Algorithm – Based Crawler: Study and Experimental Appraisal. In: Nguyen, N., Trawiński, B., Katarzyniak, R., Jo, GS. (eds) Advanced Methods for Computational Collective Intelligence. Studies in Computational Intelligence, vol 457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34300-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34300-1_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34299-8

  • Online ISBN: 978-3-642-34300-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics