ABSTRACT
In this paper, we study the problem of learning block classification models to estimate block functions. We distinguish general models, which are learned across multiple sites, and site-specific models, which are learned within individual sites. We further consider several factors that affect the learning process and model effectiveness. These factors include the layout features, the content features, the classifiers, and the term selection methods. We have empirically evaluated the performance of the models when the factors are varied. Our main results are that layout features do better than content features for learning both general and site-specific models.
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft, 2003.Google Scholar
- J. Chen, B. Zhou, J. Shi, H. Zhang, and Q. Fengwu. Function-based object model towards website adaptation. In WWW '01, pages 587--596, 2001. Google ScholarDigital Library
- R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In WWW '04, pages 203--211, 2004. Google ScholarDigital Library
Index Terms
- A comparative study on classifying the functions of web page blocks
Recommendations
Automatic Identification of Informative Sections of Web Pages
Web pages especially dynamically generated ones contain several items that cannot be classified as the "primary content, e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and ...
Web Page's Blocks Based Topical Crawler
SOSE '08: Proceedings of the 2008 IEEE International Symposium on Service-Oriented System EngineeringLink context has been widely used in information retrieval and classification. In topical crawlers or vertical crawlers, the link contexts are used to forecast whether the links are related to topics. The context of a link or link context usually ...
A novel feature selection framework for automatic web page classification
The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide ...
Comments