Web Content Extraction Using Clustering with Web Structure

Huang, Xiaotao; Gao, Yan; Huang, Liqun; Zhang, Zhizhao; Li, Yuhua; Wang, Fen; Kang, Ling

doi:10.1007/978-3-319-59072-1_12

Web Content Extraction Using Clustering with Web Structure

Xiaotao Huang¹⁶,
Yan Gao¹⁶,
Liqun Huang¹⁶,
Zhizhao Zhang¹⁶,
Yuhua Li¹⁶,
Fen Wang¹⁶ &
…
Ling Kang¹⁶

Conference paper
First Online: 31 May 2017

2551 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10261))

Abstract

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Yang, L., Li, X., Geng, G.: Study of web pages content extraction based on layout similarity. Appl. Res. Comput. 32(9), 2581–2586 (2015)
Google Scholar
Xiong, Z., Zhang, H., Lin, M.: An extraction algorithm of Chinese HTML content based on similarity. J. Southwest Univ. Sci. Technol. 25(1), 80–84 (2010)
Google Scholar
Chang, Y., Zheng, Y., Chen, Y.: Content extraction technique for web pages based on HTML-tags. J. Comput. Eng. Des. 31(24), 5187–5191 (2010)
Google Scholar
Cai, D., Yu, S., Wen, J., et al.: VIPS: a vision- based page segmentation algorithm (2003)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). doi:10.1007/3-540-36901-5_42
Chapter Google Scholar
Mehta, R., Mitra, P., Karnick, H.: Extracting semantic structure of web document using content and visual information. In: Proceedings of the 14th Special Interest Tracks and Posters of International Conference on World Wide Web, pp. 928–929, ACM Press, New York (2005)
Google Scholar
Aanshi, B., Veenu, M.: A novel approach for content extraction from web pages. In: Proceedings of 2014 RAECS UIET, pp. 6–8. Panjab University, Chandigarh (2014)
Google Scholar
Peng, Q., Wang, Q., Li, Y., Zhang, J., et al.: Content extraction from chinese web pages based on punctuations distribution. In: International Conference on Computer Science and Service System, pp. 1351–1355 (2012)
Google Scholar
Guo, Y., Tang, H., Song, L., et al.: ECON: an approach to extract content from web news page. In: International Asia-Pacific Web Conference, pp. 314–320 (2010)
Google Scholar
Yang, Q., Yang, M.: A method of webpage content extraction based on point density. J. Intell. Comput. Appl. 5(4), 42–44 (2015)
Google Scholar
Lin, S., Chen, J., Niu, Z.: Combining a segmentation-like approach and a density-based approach in content extraction. Tsinghua Sci. Technol. 17(3), 256–264 (2012)
Article Google Scholar
Xiong, Z., Lin, X., Zhang, Y., et al.: Content extraction method combining web page structure and text feature. Comput. Eng. 17(3), 256–264 (2013)
Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 577–582. ACM Press, Washington (2003)
Google Scholar
Liao, H., Yang, Y., Jia, Z., et al.: An improved web structure similarity based on matching algorithm of tree paths. J. Jilin Univ. 50(6), 1199–1203 (2012)
Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: SIGKDD 2003 (2003)
Google Scholar
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds.) EP/RIDT-1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998). doi:10.1007/BFb0053296
Chapter Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under grants 61572221.

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan City, Hubei Province, China
Xiaotao Huang, Yan Gao, Liqun Huang, Zhizhao Zhang, Yuhua Li, Fen Wang & Ling Kang

Authors

Xiaotao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Liqun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhizhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Fen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaotao Huang , Yan Gao or Liqun Huang .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Fengyu Cong
City University of Hong Kong, Kowloon Tong, Hong Kong
Andrew Leung
Chinese Academy of Sciences, Beijing, China
Qinglai Wei

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, X. et al. (2017). Web Content Extraction Using Clustering with Web Structure. In: Cong, F., Leung, A., Wei, Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017. Lecture Notes in Computer Science(), vol 10261. Springer, Cham. https://doi.org/10.1007/978-3-319-59072-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-59072-1_12
Published: 31 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59071-4
Online ISBN: 978-3-319-59072-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics