Skip to main content

Study on Method of Web Content Mining for Non-XML Documents

  • Conference paper
Information Computing and Applications (ICICA 2010)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 106))

Included in the following conference series:

  • 1513 Accesses

Abstract

Web content mining is an important way of Internet information collection and analysis, but most of web pages are non-XML documents, how to extract useful information efficiently from massive web pages is a interesting research topic. On the basis of analyzing the features of web content mining, a XML-based web content mining method is proposed. Firstly, it defines the authority web page using the HITS algorithms, then transforms the non-XML documents into structured XML documents after the data cleaning and extracting by HTML Tidy, finally does data mining on the XML document using text clustering techniques. A science paper web site is chosen as a case study for Web content extracting. Experimental results show that the proposed method works well, it can extract web content efficiently and automatically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hui, H.: Research on Key Problems in WEB Text Mining. Beijing University of Posts and Telecommunications, Beijing (2009)

    Google Scholar 

  2. LiGang, W.: Research on Web text Mining Base on XML. Southwest University, Chong Qing (2007)

    Google Scholar 

  3. DongXia, M.: Research on data mining technique to XML documents. Beijing University of Posts and Telecommunications, Beijing (2007)

    Google Scholar 

  4. Guo, X.: Distributed Data Mining Based on Grids. Computer Engineering & Science (2009)

    Google Scholar 

  5. Huijun, L., Qingsheng, Z., Cheng, Z.: Web log mining algorithm based on user interest. Computer Integrated Manufacturing Systems (2009) (in Chinese)

    Google Scholar 

  6. Tang, W., Cen, G., Cheng, J.-q.: Based on XML of Web Mining in Dynamic Dividing Level Instruction System. In: 2010 Second International Workshop on Education Technology and Computer Science, etcs, HuBei, vol. 3, pp. 468–472 (2010)

    Google Scholar 

  7. Mukthyar azam, S., Kiran Kumar, M., Rasool, S., Jakir Ajam, S.: Web data mining Using XML and Agent Framework. International Journal of Computer Science and Network Security (2010)

    Google Scholar 

  8. Jian, L., Chao, X., Shoubiao, T.: Design and Research of a Web Data Mining System. Computer Technology and Development 19(2), 70–72 (2009)

    Google Scholar 

  9. Ting, C., Xiao, N., Weiping, Y.: The Application of Web Data Mining Technique in Competitive Intelligence System of Enterprise Based on XML. In: 2009 Third International Symposium on Intelligent Information Technology Application, vol. 2, pp. 396–399 (2009)

    Google Scholar 

  10. Ying-song, H., Hai-xia, N.: A New Web Mining Data Integration Model Based on XML. Computer Engineering & Science (2007)

    Google Scholar 

  11. Li, L., Rong, Q.-m.: Research of Web Mining Technology Based on XML. In: Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing, vol. 2, pp. 653–656 (2009)

    Google Scholar 

  12. Li-jun, S., Fan-rong, M.: Research and design of XML-based web text mining model. College of Computer Science, Xuzhou (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, J., Chen, H., Guo, J. (2010). Study on Method of Web Content Mining for Non-XML Documents. In: Zhu, R., Zhang, Y., Liu, B., Liu, C. (eds) Information Computing and Applications. ICICA 2010. Communications in Computer and Information Science, vol 106. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16339-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16339-5_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16338-8

  • Online ISBN: 978-3-642-16339-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics