Abstract
Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Yang, L., Li, X., Geng, G.: Study of web pages content extraction based on layout similarity. Appl. Res. Comput. 32(9), 2581–2586 (2015)
Xiong, Z., Zhang, H., Lin, M.: An extraction algorithm of Chinese HTML content based on similarity. J. Southwest Univ. Sci. Technol. 25(1), 80–84 (2010)
Chang, Y., Zheng, Y., Chen, Y.: Content extraction technique for web pages based on HTML-tags. J. Comput. Eng. Des. 31(24), 5187–5191 (2010)
Cai, D., Yu, S., Wen, J., et al.: VIPS: a vision- based page segmentation algorithm (2003)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). doi:10.1007/3-540-36901-5_42
Mehta, R., Mitra, P., Karnick, H.: Extracting semantic structure of web document using content and visual information. In: Proceedings of the 14th Special Interest Tracks and Posters of International Conference on World Wide Web, pp. 928–929, ACM Press, New York (2005)
Aanshi, B., Veenu, M.: A novel approach for content extraction from web pages. In: Proceedings of 2014 RAECS UIET, pp. 6–8. Panjab University, Chandigarh (2014)
Peng, Q., Wang, Q., Li, Y., Zhang, J., et al.: Content extraction from chinese web pages based on punctuations distribution. In: International Conference on Computer Science and Service System, pp. 1351–1355 (2012)
Guo, Y., Tang, H., Song, L., et al.: ECON: an approach to extract content from web news page. In: International Asia-Pacific Web Conference, pp. 314–320 (2010)
Yang, Q., Yang, M.: A method of webpage content extraction based on point density. J. Intell. Comput. Appl. 5(4), 42–44 (2015)
Lin, S., Chen, J., Niu, Z.: Combining a segmentation-like approach and a density-based approach in content extraction. Tsinghua Sci. Technol. 17(3), 256–264 (2012)
Xiong, Z., Lin, X., Zhang, Y., et al.: Content extraction method combining web page structure and text feature. Comput. Eng. 17(3), 256–264 (2013)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 577–582. ACM Press, Washington (2003)
Liao, H., Yang, Y., Jia, Z., et al.: An improved web structure similarity based on matching algorithm of tree paths. J. Jilin Univ. 50(6), 1199–1203 (2012)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: SIGKDD 2003 (2003)
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds.) EP/RIDT-1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998). doi:10.1007/BFb0053296
Acknowledgements
This work is supported by National Natural Science Foundation of China under grants 61572221.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Huang, X. et al. (2017). Web Content Extraction Using Clustering with Web Structure. In: Cong, F., Leung, A., Wei, Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017. Lecture Notes in Computer Science(), vol 10261. Springer, Cham. https://doi.org/10.1007/978-3-319-59072-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-59072-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59071-4
Online ISBN: 978-3-319-59072-1
eBook Packages: Computer ScienceComputer Science (R0)