Abstract
Web mining has been applied to improve web-based learning. Content-based Web mining usually focuses on main contents of web page. This paper proposes a novel approach to automatically extract main contents from web pages. Compared with existed studies, the method may determine whether a web page contains main contents, and then extracts main contents without using DOM-Tree and template. Main contributions include: (1) Introducing a new concept of Block and proposing a method to partition web page to blocks. Main contents and noise contents may be well partitioned into different blocks. (2) Introducing a concept of Web Page Block Distribution and studying its feature. Based on Block Distribution, we may effectively determine whether the web page contain main contents, and then extract main contents via outlier analysis. Experiments demonstrate utility and feasibility of the method.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Yi, L., Liu, B.: Web Page Cleaning for Web Mining through Feature Weighting. In: IJCAI 2003 (2003)
Yin, X., Lee, W.S.: Using Link Analysis to Improve Layout on Mobile Devices. In: WWW 2004 (2004)
Li, Y., Meng, X., Li, Q., Wang, L.: Hybrid Method for Automated News Content Extraction from the Web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds.) WISE 2006. LNCS, vol. 4255. Springer, Heidelberg (2006)
Extracting Content for News Web Pages based on DOM, IJCSNS International Journal of Computer Science and Network Security 7(2) (February 2007)
Yi, L., Liu, B., Li, X.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of ACM SIGKDD (2003)
Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of ACM SIGKDD (2002)
Ramaswamy, L., Iyengar, A., Liu, L., Douglis, F.: Automatic Detection of Fragments in Dynamically Generated Web Pages. In: Proceedings of the 13th conference on World Wide Web (2004)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a visionbased page segmentation algorithm, Microsoft Technical Report, MSR TR 2003-79 (2003)
Wang, Q., Tang, S.-W., Yang, D.-Q., Wang, T.J.: DOM Based Automatic Extraction of Topical Information from Web Pages. Journal Of Computer Research And Development 141(110) (October 2004)
Debnath, S., et al.: Identifying Content Blocks from Web Documents. ISMIS (2005)
Chen, J., Zhou, B., Shi, J., Zhang, H.-J., Qiu, F.: Function-Based Object Model Towards Website Adaptation. In: The proceedings of WWW 2001 (2001)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In: Proceedings of 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002)
Han, J., Kamber, M.: Data Mining Concepts and Techniques (Second Edition). China Machine Press (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qiu, J., Tang, C., Xu, K., Luo, Q. (2008). Web Contents Extracting for Web-Based Learning. In: Li, F., Zhao, J., Shih, T.K., Lau, R., Li, Q., McLeod, D. (eds) Advances in Web Based Learning - ICWL 2008. ICWL 2008. Lecture Notes in Computer Science, vol 5145. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85033-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-85033-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85032-8
Online ISBN: 978-3-540-85033-5
eBook Packages: Computer ScienceComputer Science (R0)