Abstract
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The origin is the top-left corner of a web page, and (x, y) is the top-left corner of a text block.
References
Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). VIPS: A vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79.
Chang, C.-H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: Towards automatic data extraction from large web sites. In Proc. of VLDB 2001 (pp. 109–118).
George, C., & Edward, G. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
Hu, Q., & Huang, X. (2010). Passage extraction and result combination for genomics information retrieval. Journal of Intelligent Information Systems, 34(3), 249–274.
Laender, A., Ribeiro-Neto, B., da Silva, A., & Teixeira, J. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.
Lafferty, J. D., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML 2001 (pp. 282–289).
Liu, B., Grossman, R. L., & Zhai, Y. (2003). Mining data records in Web pages. In Proc. of KDD 2003 (pp. 601–606).
Lu, Y., He, H., Zhao, H., Meng, W., & Yu, C. T. (2007). Annotating structured data of the deep web. In Proc. of ICDE 2007 (pp. 376–385).
Luo, P., Fan, J., Liu, S., Lin, F., Xiong, Y., & Liu, J. (2009). Web article extraction for web printing: A DOM+visual based approach. In Proc. of ACM Symposium on Document Engineering 2009 (pp. 66–69).
Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proc. of WWW 2009 (pp. 971–980).
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann.
Reis, D., Golgher, P., & Silva, A. (2004). Automatic web news extraction using tree edit distance. In Proc. of WWW 2004 (pp. 502–511).
Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Shi, Z., Milios, E., & Zincir-Heywood, N. (2005). Post-supervised template induction for information extraction from lists and tables in dynamic web sources. Journal of Intelligent Information Systems, 25(1), 69–93.
Simon, K., & Lausen, G. (2005). ViPER: Augmenting automatic information extraction with visual perceptions. In Proc. of CIKM 2005 (pp. 381–388).
Singla, P., & Domingos, P. (2005). Discriminative training of Markov logic networks. In Proc. of AAAI 2005 (pp. 868–873).
Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proc. of WWW 2009 (pp. 1085–1086).
Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., & Ma, W.-Y. (2009). Incorporating site-level knowledge to extract structured data from web forums. In Proc. of WWW 2009 (pp. 181–190).
Zhai, Y., & Liu, B. (2005). Web data extraction based on partial tree alignment. In Proc. of WWW 2005 (pp. 76–85).
Zhao, H., Meng, W., & Wu, Z. (2005). Fully automatic wrapper generation for search engines. In Proc. of WWW 2005 (pp. 66–75).
Zheng, S., Song, R., & Wen, J.-R. (2007). Template-independent news extraction based on visual consistency. In Proc. of AAAI 2007 (pp. 1507–1511).
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., & Ma, W.-Y. (2005). 2D conditional random fields for Web information extraction. In Proc. of ICML 2005 (pp. 1044–1051).
Zhu, J., Nie, Z., & Wen, J.-R. (2006). Simultaneous record detection and attribute labeling in web data extraction. In Proc. of KDD 2006 (pp. 494–503).
Zhu, J., Nie, Z., & Zhang, B. (2007). Dynamic hierarchical Markov random fields and their application to web data extraction. In Proc. of ICML 2007 (pp. 1175–1182).
Acknowledgements
This research is supported by Advanced Research Fund of Institute of Scientific and Technical Information of China under grant YY-201005. The authors are also grateful to the anonymous reviewers for their constructive comments, which have helped improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, W., Yan, H. & Xiao, J. Extracting multiple news attributes based on visual features. J Intell Inf Syst 38, 465–486 (2012). https://doi.org/10.1007/s10844-011-0163-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-011-0163-6