Abstract
A new method of extracting news information based on webpage segmentation and parsing DOM tree reversely is presented and implemented in this paper, which intends to effectively extract news information for data mining. The method is proposed to get webpages’ main DOM structure by segmenting webpages, further parse the main DOM structure reversely and finally extract news content, headlines, news agents and publication time. The experimental results show that the proposed method has achieved good performance on accuracy and meets the project demands.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Yin, B., Yang, H.Z.: Content extraction based on unknown structure web. Comput. Technol. Dev. 21(9), 111-113, 117 (2011)
Zou, Y.Q., Zhong, Z.N.: An efficient approach to reduce noise in news webpages. Microcomput. Appl. 30(16), 64–67, 71 (2011)
Chen, H.S., Zeng, J.P., Zhang, S.Y.: A position information-based web page segmentation method. Comput. Appl. Softw. 26(7), 155–159 (2009)
Zhang, R.X., Song, M.Q., Gong, Y.L.: Parsing DOM tree reversely and extracting web main page information. Comput. Sci. 38(4), 213–215, 225 (2011)
Li, J., Chen, J., Wang, L.F., Ni, H.: Approach to webpage segmentation and information extraction for vertical websites. Appl. Res. Comput. 30(3), 844–847, 852 (2013)
Jia, J., Zhang, S., Meng, F., Wang, Y., Cai, L.: Emotional audio-visual speech synthesis based on PAD. IEEE Trans. Audio, Speech, Lang. Process. 19(3), 570–582 (2011)
Acknowledgement
This work was supported by the Major Research Plan of the National Natural Science Foundation of China [91124002] and the Fundamental Research Funds for the Central Universities [2013RC0301].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, J., Lu, Y., Zhang, X. (2015). Extracting News Information Based on Webpage Segmentation and Parsing DOM Tree Reversely. In: Yueming, L., Xu, W., Xi, Z. (eds) Trustworthy Computing and Services. ISCTCS 2014. Communications in Computer and Information Science, vol 520. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47401-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-47401-3_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47400-6
Online ISBN: 978-3-662-47401-3
eBook Packages: Computer ScienceComputer Science (R0)