ABSTRACT
One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performance.
- Fersini, E., Messina, E., Archetti, F. (2008). Enhancing web page classification through image-block importance analysis. Information Processing and Management, 44(4), pp. 1431--1447. Google ScholarDigital Library
- Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V. M.(2002). Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proceedings of the 2002 IEEE International Conference on Data Mining, (pp. 250--257). Washington: IEEE Computer Society. Google ScholarDigital Library
- Cai, D., Yu, S., Wen, J. R. Ma, W. Y., Extracting content structure for web pages based on visual representation. In Zhou, X., Zhang, Y., Orlowska, M. E. (Eds.), Proceedings of the Pacific Web Conference, (pp. 406--417). Google ScholarDigital Library
- Salton, G., Wong, A. Yang, C., S. A vector space model for automatic indexing. Communications of the ACM, 18(11), 613--620. Google ScholarDigital Library
- Salton, G. Buckley, C. (1998). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523. Google ScholarDigital Library
- Nicholas, C., Dhillon, I. Kogan, J. (2003). Feature selection and document clustering. In Berry, M. W. (Ed.), A Comprehensive Survey of Text Mining. Springer-Verlag.Google Scholar
- John, G.-H., Langely, P. (1995). Estimating continuous distributions in {Bayesian} classifiers. In Besnard, P., Hanks, S. (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, (pp. 338--345). San Francisco: Morgan Kauffman. Google ScholarDigital Library
- Platt, J., C. (1999). Fast training of support vector machines using sequential minimal optimization. In Schölkopf, B., Burges, C. J. C. Smola, A. J. (Eds.), Advances in kernel methods: support vector learning, (pp. 185--208). Cambridge: MIT Press. Google ScholarDigital Library
- Witten, I., H. Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kauffman. Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668--677, ACM Press, New York, 1998 Google ScholarDigital Library
- Sergey Brin; Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine Google ScholarDigital Library
- T. H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, 2002, pages 517--526. Google ScholarDigital Library
- A. Borodin, C. O. Roberts, J.S. Rosenthal, P. Tsaparas. Link Analysis Ranking: Algorithms, Theory, and Experiments. ACM Transactions on Internet Technology, Volume 5 , Issue 1, Pages: 231 -- 297 , 2005. Google ScholarDigital Library
- Quinlan, J., R. (1993). C4.5: programs for machine learning. San Francisco: Morgan Kauffman. Google ScholarDigital Library
- Aha, D. W., Kibler, D., Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37--66. Google ScholarDigital Library
- Song, R., Liu, H., Wen, J-R Ma, W.-Y. (2004). Learning block importance models for web pages. In Feldman, S. I., Uretsky, M., Najork, M., Wills, C. E. (Eds.), Proceedings of the 13th international conference on World Wide Web, (pp. 203--211). New York: ACM Press. Google ScholarDigital Library
- Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina. "WebBase : A repository of Web pages," In Proceedings of the 9th Internationall World Wide Web Conference (WWW9), Amsterdam, May 2000. Google ScholarDigital Library
- Richardson, M., Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, 1441--1448. Cambridge, MA: MIT Press.Google Scholar
- R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect.", Proc. 9th International World Wide Web Conference, 2000. http://citeseer.ist.psu.edu/lempel00stochastic.html Google ScholarDigital Library
- D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning, pages 167--174, Stanford University, 2000. Google ScholarDigital Library
- Gao, Y., Fan, J., Xue, X., Jain, R. (2006). Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers. In Nahrstedt, K., Turk, M., Rui, Y., Klas, W., Mayer-Patel, K. (Eds.), Proceedings of the 14th annual ACM International Conference on Multimedia, (pp. 901--910). New York: ACM Press. Google ScholarDigital Library
- Li, F., Perona, P. (2005). A Bayesian Hierarchical Model for Learning Natural Scene Categories. In Proceeding of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp. 524--531). San Diego: IEEE Computer Society. Google ScholarDigital Library
- Jarvelin, K., Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4), pp. 422--446. New York: ACM Press. Google ScholarDigital Library
- Cai, D., He, X., Wen, J. Ma, W. (2004). Block Level Link Analysis. In Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval, (pp. 440--447), New York: ACM Press. Google ScholarDigital Library
Index Terms
- Granular modeling of web documents: impact on information retrieval systems
Recommendations
Categorisation of web documents using extraction ontologies
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
An automatic approach to classify web documents using a domain ontology
PReMI'05: Proceedings of the First international conference on Pattern Recognition and Machine IntelligenceThis paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontologybased document classification involves ...
Automatic keyphrase extraction for Arabic news documents based on KEA system
A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document ...
Comments