skip to main content
10.1145/1458502.1458520acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Granular modeling of web documents: impact on information retrieval systems

Published:30 October 2008Publication History

ABSTRACT

One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performance.

References

  1. Fersini, E., Messina, E., Archetti, F. (2008). Enhancing web page classification through image-block importance analysis. Information Processing and Management, 44(4), pp. 1431--1447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V. M.(2002). Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proceedings of the 2002 IEEE International Conference on Data Mining, (pp. 250--257). Washington: IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cai, D., Yu, S., Wen, J. R. Ma, W. Y., Extracting content structure for web pages based on visual representation. In Zhou, X., Zhang, Y., Orlowska, M. E. (Eds.), Proceedings of the Pacific Web Conference, (pp. 406--417). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Salton, G., Wong, A. Yang, C., S. A vector space model for automatic indexing. Communications of the ACM, 18(11), 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Salton, G. Buckley, C. (1998). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nicholas, C., Dhillon, I. Kogan, J. (2003). Feature selection and document clustering. In Berry, M. W. (Ed.), A Comprehensive Survey of Text Mining. Springer-Verlag.Google ScholarGoogle Scholar
  7. John, G.-H., Langely, P. (1995). Estimating continuous distributions in {Bayesian} classifiers. In Besnard, P., Hanks, S. (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, (pp. 338--345). San Francisco: Morgan Kauffman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Platt, J., C. (1999). Fast training of support vector machines using sequential minimal optimization. In Schölkopf, B., Burges, C. J. C. Smola, A. J. (Eds.), Advances in kernel methods: support vector learning, (pp. 185--208). Cambridge: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Witten, I., H. Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kauffman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668--677, ACM Press, New York, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sergey Brin; Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, 2002, pages 517--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Borodin, C. O. Roberts, J.S. Rosenthal, P. Tsaparas. Link Analysis Ranking: Algorithms, Theory, and Experiments. ACM Transactions on Internet Technology, Volume 5 , Issue 1, Pages: 231 -- 297 , 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Quinlan, J., R. (1993). C4.5: programs for machine learning. San Francisco: Morgan Kauffman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Aha, D. W., Kibler, D., Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Song, R., Liu, H., Wen, J-R Ma, W.-Y. (2004). Learning block importance models for web pages. In Feldman, S. I., Uretsky, M., Najork, M., Wills, C. E. (Eds.), Proceedings of the 13th international conference on World Wide Web, (pp. 203--211). New York: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina. "WebBase : A repository of Web pages," In Proceedings of the 9th Internationall World Wide Web Conference (WWW9), Amsterdam, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Richardson, M., Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, 1441--1448. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  19. R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect.", Proc. 9th International World Wide Web Conference, 2000. http://citeseer.ist.psu.edu/lempel00stochastic.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning, pages 167--174, Stanford University, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gao, Y., Fan, J., Xue, X., Jain, R. (2006). Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers. In Nahrstedt, K., Turk, M., Rui, Y., Klas, W., Mayer-Patel, K. (Eds.), Proceedings of the 14th annual ACM International Conference on Multimedia, (pp. 901--910). New York: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Li, F., Perona, P. (2005). A Bayesian Hierarchical Model for Learning Natural Scene Categories. In Proceeding of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp. 524--531). San Diego: IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jarvelin, K., Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4), pp. 422--446. New York: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Cai, D., He, X., Wen, J. Ma, W. (2004). Block Level Link Analysis. In Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval, (pp. 440--447), New York: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Granular modeling of web documents: impact on information retrieval systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management
        October 2008
        164 pages
        ISBN:9781605582603
        DOI:10.1145/1458502

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 October 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader