Abstract
Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Research Asia (2003)
Can, L., Qian, Z., Meng, X.F., Lin, W.Y.: Postal address detection from web documents. In: Proc. of WIRI 2005, pp. 40–45 (2005)
Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: Proc. of WWW 2001, pp. 681–688 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of VLDB 2001, pp. 109–118 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Proc. of SAC 2002, pp. 1108–1112 (2002)
Hu, Y.H., Xin, G.M., Song, R.H., Hu, G.P., Shi, S.M., Cao, Y.B., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: Proc. of SIGIR 2005, pp. 250–257 (2005)
Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Li, Q.Z., Moon, B.K.: Indexing and querying xml data for regular path expressions. In: Proc. of VLDB, pp. 361–370 (2001)
Li, Y.: Evaluation of hybrid extraction method, Available at: http://idke.ruc.edu.cn/hybrid
Liu, B.: WISE-2005 Tutorial: Web Content Mining. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, p. 763. Springer, Heidelberg (2005)
Liu, B., Grossman, R.L., Zhai, Y.H.: Mining data records in web pages. In: Proc. of KDD 2003, pp. 601–606 (2003)
Liu, B., Zhai, Y.: Net - a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Muslea, I., Minton, S., Knoblock, C.A.: A hierarchical approach to wrapper induction. In: Proc. of Agents 1999, pp. 190–197 (1999)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proc. of WWW 2004, pp. 502–511 (2004)
Udani, D.: Html parser project, Available at: http://sourceforge.net/projects/htmlparser
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proc. of WWW 2003, pp. 187–196 (2003)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of WWW 2005, pp. 76–85 (2005)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Proc. of WWW 2005, pp. 66–75 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Y., Meng, X., Li, Q., Wang, L. (2006). Hybrid Method for Automated News Content Extraction from the Web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds) Web Information Systems – WISE 2006. WISE 2006. Lecture Notes in Computer Science, vol 4255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11912873_34
Download citation
DOI: https://doi.org/10.1007/11912873_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48105-8
Online ISBN: 978-3-540-48107-2
eBook Packages: Computer ScienceComputer Science (R0)