Skip to main content

Web Contents Extracting for Web-Based Learning

  • Conference paper
  • 1488 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5145))

Abstract

Web mining has been applied to improve web-based learning. Content-based Web mining usually focuses on main contents of web page. This paper proposes a novel approach to automatically extract main contents from web pages. Compared with existed studies, the method may determine whether a web page contains main contents, and then extracts main contents without using DOM-Tree and template. Main contributions include: (1) Introducing a new concept of Block and proposing a method to partition web page to blocks. Main contents and noise contents may be well partitioned into different blocks. (2) Introducing a concept of Web Page Block Distribution and studying its feature. Based on Block Distribution, we may effectively determine whether the web page contain main contents, and then extract main contents via outlier analysis. Experiments demonstrate utility and feasibility of the method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yi, L., Liu, B.: Web Page Cleaning for Web Mining through Feature Weighting. In: IJCAI 2003 (2003)

    Google Scholar 

  2. Yin, X., Lee, W.S.: Using Link Analysis to Improve Layout on Mobile Devices. In: WWW 2004 (2004)

    Google Scholar 

  3. Li, Y., Meng, X., Li, Q., Wang, L.: Hybrid Method for Automated News Content Extraction from the Web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds.) WISE 2006. LNCS, vol. 4255. Springer, Heidelberg (2006)

    Google Scholar 

  4. Extracting Content for News Web Pages based on DOM, IJCSNS International Journal of Computer Science and Network Security 7(2) (February 2007)

    Google Scholar 

  5. Yi, L., Liu, B., Li, X.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of ACM SIGKDD (2003)

    Google Scholar 

  6. Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of ACM SIGKDD (2002)

    Google Scholar 

  7. Ramaswamy, L., Iyengar, A., Liu, L., Douglis, F.: Automatic Detection of Fragments in Dynamically Generated Web Pages. In: Proceedings of the 13th conference on World Wide Web (2004)

    Google Scholar 

  8. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a visionbased page segmentation algorithm, Microsoft Technical Report, MSR TR 2003-79 (2003)

    Google Scholar 

  9. Wang, Q., Tang, S.-W., Yang, D.-Q., Wang, T.J.: DOM Based Automatic Extraction of Topical Information from Web Pages. Journal Of Computer Research And Development 141(110) (October 2004)

    Google Scholar 

  10. Debnath, S., et al.: Identifying Content Blocks from Web Documents. ISMIS (2005)

    Google Scholar 

  11. Chen, J., Zhou, B., Shi, J., Zhang, H.-J., Qiu, F.: Function-Based Object Model Towards Website Adaptation. In: The proceedings of WWW 2001 (2001)

    Google Scholar 

  12. Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In: Proceedings of 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002)

    Google Scholar 

  13. Han, J., Kamber, M.: Data Mining Concepts and Techniques (Second Edition). China Machine Press (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Frederick Li Jianmin Zhao Timothy K. Shih Rynson Lau Qing Li Dennis McLeod

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qiu, J., Tang, C., Xu, K., Luo, Q. (2008). Web Contents Extracting for Web-Based Learning. In: Li, F., Zhao, J., Shih, T.K., Lau, R., Li, Q., McLeod, D. (eds) Advances in Web Based Learning - ICWL 2008. ICWL 2008. Lecture Notes in Computer Science, vol 5145. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85033-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85033-5_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85032-8

  • Online ISBN: 978-3-540-85033-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics