Skip to main content
Log in

Genetic Mining of HTML Structures for Effective Web-Document Retrieval

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. M. Gordon and P. Pathak, “Finding information on the world wide web: The retrieval effectiveness of search engines,” Information Processing and Management, vol. 35, no.2, pp. 141–180, 1999.

    Google Scholar 

  2. N.J. Belkin and W.B. Croft, “Retrieval techniques,” Annual Review of Information Science and Technology, vol. 22, pp. 109–145, 1987.

    Google Scholar 

  3. J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture for optimizing web search engines,” in Proceedings of the AAAI Workshop on Internet-Based Information Systems, pp. 1–8, 1996.

  4. K. Bharat and M.R. Henzinger, “Improved algorithms for topic distillation in a hyperlinked environment,” in Proceedings of the ACM SIGIR’98 Conference, pp. 104–111, 1998.

  5. S. Chakrabarti, “Data mining for hypertext: A tutorial survey,” ACM SIGKDD Explorations, vol. 1, no.2, pp. 1–11, 2000.

    Google Scholar 

  6. J. Kleinberg, “Authoriatative sources in a hyperlinked environment,” in Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677, 1998.

  7. J. Picard, “Modeling and combining evidence provided by document relationships using probabilistic argumentation systems,” in Proceedings of the ACM SIGIR’98 Conference, pp. 182–189, 1998.

  8. E. Spertus, “ParaSite: Mining structural information on the web,” in Proceedings of the Sixth International World Wide Web Conference (WWW6), pp. 1205–1215, 1997.

  9. S. Chakrabarti et al., “Experiments in topic distillation,” ACM-SIGIR’ 98 Post-ConferenceWorkshop on Hypertext Information Retrieval for the Web, 1998.

  10. S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proceedings of the Seventh International World Wide Web Conference (WWW7), pp. 107–117, 1998.

  11. M.L. Mauldin, “Lycos: Design choices in an internet search service,” IEEE Expert, vol. 12, no.1, pp. 8–11, 1997.

    Google Scholar 

  12. S. Kim and B.-T. Zhang, “Web-Document retrieval by genetic learning of importance factors for HTML tags,” in Proceedings of the Sixth Pacific Rim International Conference on AI Workshop on Text and Web Mining, pp. 13–23, 2000.

  13. M. Cutler, H. Deng, S. Maniccam, and W. Meng, “A new study on using HTML structures to improve retrieval,” in Proceedings of the Eleventh IEEE Conference on Tools with AI, pp. 406–409, 1999.

  14. O. Frieder and H.T. Siegelmann, “On the allocation of documents in multiprocessor information retrieval systems,” in Proceedings of the ACM SIGIR’91 Conference, pp. 230–239, 1991.

  15. M.D. Gordon, “User-based document clustering by redescribing subject descriptions with a genetic algorithm,” Journal of the American Society for Information Science, vol. 42, no.5, pp. 311–322, 1991.

    Google Scholar 

  16. F. Petry, B. Buckles, D. Prabhu, and D. Kraft, “Fuzzy information retrieval using genetic algorithms and relevance feedback,” in Proceedings of the ASIS Annual Meeting, pp. 122–125, 1993.

  17. M. Gordon, “Probabilistic and genetic algorithms for document retrieval,” Communications of the ACM, vol. 31, pp. 1208–1218, 1988.

    Google Scholar 

  18. J. Yang and R.R. Korfhage, “Effects of query term weights modification in document retrieval: A study base on a genetic algorithm,” in Proceedings of the Second Annual Symposium on Document Analysis and Information Retrieval, pp. 271–185, 1993.

  19. J. Yang, R.R. Korfhage, and E. Rasmussen, “Query improvement in information retrieval using genetic algorithms: A report on the experiments of the TREC project,” in Proceedings of the First Text Retrieval Conference (TREC-1), pp. 31–58, 1993.

  20. J.T. Horng and C.C. Yeh, “Applying genetic algorithms to query optimization in document retrieval,” Information Processing and Management, vol. 36, pp. 737–759, 2000.

    Google Scholar 

  21. NIST, Text REtrieval Conference homepage, http://trec.nist.gov.

  22. D.-H. Shin and B.-T. Zhang, “A two-stage retrieval model for the TREC-7 ad hoc task,” in Proceedings of the Seventh Text Retrieval Conference (TREC-7), pp. 501–507, 1998.

  23. G. Salton, A. Wong, and C.S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, pp. 613–620, 1975.

    Google Scholar 

  24. G. Salton,Automatic Text Processing, Addison-Wesley, pp. 279–281, 1989.

  25. J. Broglio, J.P. Callan, W.B. Croft, and D.W. Nachbar, “Document retrieval and routing using the INQUERY system,” in Proceedings of the Third Text REtrieval Conference (TREC-3), pp. 29–38, 1995.

  26. J.P. Callan, W.B. Croft, and S.M. Harding, “The INQUERY retrieval system,” in Proceedings of the Third International Conference on Database and Expert Systems Applications, pp. 78–83, 1992.

  27. H. Turtle and W.B. Croft, “Inference networks for document retrieval,” in Proceedings of the Thirteenth International Conference on Research and Development in Information Retrieval, pp. 1–24, 1990.

  28. S.E. Robertson and S. Walker, “Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval,” in Proceedings of the ACM SIGIR’94 Conference, pp. 232–241, 1994.

  29. S.E. Robertson et al., “Okapi at TREC-3,” in Proceedings of the Third Text Retrieval Conference (TREC-3), pp. 109–126, 1995.

  30. T. Bäck, Evolutionary Algorithms in Theory and Practice, Oxford University Press, 1996.

  31. J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.

  32. J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992.

  33. D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, 1989.

  34. T. Blickle and L. Thiele, “A mathematical analysis of tournament selection,” in Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 9–16, 1995.

  35. D.E. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” Foundations of Genetic Algorithms, pp. 69–93, Morgan Kaufmann, 1991.

  36. J.E. Baker, “Adaptive selection methods for genetic algorithms,” in Proceedings of the First International Conference on Genetic Algorithms and Their Applications, pp. 101–111, 1985.

  37. J.J. Grefenstette and J.E. Baker, “How genetic algorithms work: A critical look at implicit parallelism,” in Proceedings of the Third International Conference on Genetic Algorithms, pp. 20–27, 1989.

  38. D. Whitley, “The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best,” in Proceedings of the Third International Conference on Genetic Algorithms, pp. 116–121, 1989.

  39. H. Mühlenbein and D. Schlierkamp-Voosen, “Predictive models for the breeder genetic algorithm,” Evolutionary Computation, vol. 1, no.1, pp. 25–49, 1993.

    Google Scholar 

  40. G. Syswerda, “Uniform crossover in genetic algorithms,” in Proceedings of the Third International Conference on Genetic Algorithms and Their Applications, pp. 2–9, 1989.

  41. E.M. Voorhees and D. Harman, “Overview of the eighth text Retrieval conference,” in Proceedings of the Eighth Text Retrieval Conference (TREC-8), pp. 1–27, 1999.

  42. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolutionary Programs, Springer, pp. 104–105, 1992.

  43. Internet Archive, Building an Internet Library, http://www.archive.org.

  44. J. Zobel, “How Reliable are the Results of Large-Scale Information Retrieval Experiments?,” in Proceedings of the ACM SIGIR’98 Conference, pp. 307–314, 1998.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Zhang, BT. Genetic Mining of HTML Structures for Effective Web-Document Retrieval. Applied Intelligence 18, 243–256 (2003). https://doi.org/10.1023/A:1023293820057

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1023293820057

Navigation