Skip to main content

Web Content Extraction Using Clustering with Web Structure

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10261))

Abstract

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Yang, L., Li, X., Geng, G.: Study of web pages content extraction based on layout similarity. Appl. Res. Comput. 32(9), 2581–2586 (2015)

    Google Scholar 

  2. Xiong, Z., Zhang, H., Lin, M.: An extraction algorithm of Chinese HTML content based on similarity. J. Southwest Univ. Sci. Technol. 25(1), 80–84 (2010)

    Google Scholar 

  3. Chang, Y., Zheng, Y., Chen, Y.: Content extraction technique for web pages based on HTML-tags. J. Comput. Eng. Des. 31(24), 5187–5191 (2010)

    Google Scholar 

  4. Cai, D., Yu, S., Wen, J., et al.: VIPS: a vision- based page segmentation algorithm (2003)

    Google Scholar 

  5. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). doi:10.1007/3-540-36901-5_42

    Chapter  Google Scholar 

  6. Mehta, R., Mitra, P., Karnick, H.: Extracting semantic structure of web document using content and visual information. In: Proceedings of the 14th Special Interest Tracks and Posters of International Conference on World Wide Web, pp. 928–929, ACM Press, New York (2005)

    Google Scholar 

  7. Aanshi, B., Veenu, M.: A novel approach for content extraction from web pages. In: Proceedings of 2014 RAECS UIET, pp. 6–8. Panjab University, Chandigarh (2014)

    Google Scholar 

  8. Peng, Q., Wang, Q., Li, Y., Zhang, J., et al.: Content extraction from chinese web pages based on punctuations distribution. In: International Conference on Computer Science and Service System, pp. 1351–1355 (2012)

    Google Scholar 

  9. Guo, Y., Tang, H., Song, L., et al.: ECON: an approach to extract content from web news page. In: International Asia-Pacific Web Conference, pp. 314–320 (2010)

    Google Scholar 

  10. Yang, Q., Yang, M.: A method of webpage content extraction based on point density. J. Intell. Comput. Appl. 5(4), 42–44 (2015)

    Google Scholar 

  11. Lin, S., Chen, J., Niu, Z.: Combining a segmentation-like approach and a density-based approach in content extraction. Tsinghua Sci. Technol. 17(3), 256–264 (2012)

    Article  Google Scholar 

  12. Xiong, Z., Lin, X., Zhang, Y., et al.: Content extraction method combining web page structure and text feature. Comput. Eng. 17(3), 256–264 (2013)

    Google Scholar 

  13. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 577–582. ACM Press, Washington (2003)

    Google Scholar 

  14. Liao, H., Yang, Y., Jia, Z., et al.: An improved web structure similarity based on matching algorithm of tree paths. J. Jilin Univ. 50(6), 1199–1203 (2012)

    Google Scholar 

  15. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: SIGKDD 2003 (2003)

    Google Scholar 

  16. Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds.) EP/RIDT-1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998). doi:10.1007/BFb0053296

    Chapter  Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under grants 61572221.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaotao Huang , Yan Gao or Liqun Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Huang, X. et al. (2017). Web Content Extraction Using Clustering with Web Structure. In: Cong, F., Leung, A., Wei, Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017. Lecture Notes in Computer Science(), vol 10261. Springer, Cham. https://doi.org/10.1007/978-3-319-59072-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59072-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59071-4

  • Online ISBN: 978-3-319-59072-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics