News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm

Huang, Shilin; Zheng, Xiaolin; Wang, Xiaowei; Chen, Deren

doi:10.1007/978-3-642-23982-3_32

Shilin Huang²¹,
Xiaolin Zheng²¹,
Xiaowei Wang²¹ &
…
Deren Chen²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Included in the following conference series:

International Conference on Web Information Systems and Mining

1381 Accesses

Abstract

Information extraction is important in web information retrieval. In case of news information extraction, because news information does not have representative keywords pointing out its beginning and ending, it is difficult to specify the news title and body automatically. Our approach is based on an adaptive weighting factor using Bayesian algorithm to solve this problem. We divided a news page into text fragments, and represented them with a set of content features and layout features. We used an adaptive weighting factor to make features fit in different pages. Experiments show that our method results in a higher precision than the original algorithm without a weighting factor on the task of news information extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Article 08 July 2016

Multilingual news extraction via stopword language model scoring

Article 18 March 2016

Automatic news-roundup generation using clustering, extraction, and presentation

Article 09 November 2019

References

Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
Cai, D., Yu, S., Wen, J.-r., Ma W.-Y.: VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003)
Google Scholar
Carlson, A., et al.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)
Chapter Google Scholar
Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 1094–1098. ACM, Dijon (2006)
Chapter Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 624–624. ACM, Madison (2002)
Chapter Google Scholar
Junfang, S., Li, L.: Web information extraction based on news domain ontology theory. In: IEEE 2nd Symposium on Web Society SWS (2010)
Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Article MathSciNet MATH Google Scholar
Labský, M., Svátek, V., Nekvasil, M., Rak, D.: The ex project: Web information extraction using extraction ontologies. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. Studies in Computational Intelligence, vol. 220, pp. 71–88. Springer, Heidelberg (2009)
Chapter Google Scholar
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM, Edmonton (2002)
Chapter Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, D.C (2002)
Google Scholar
Ma, L., et al.: Extracting unstructured data from template generated web documents. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 512–515. ACM, New Orleans (2003)
Google Scholar
Miao, G., et al.: Extracting data records from the web using tag path clustering. In: Proceedings of the 18th International Conference on World Wide Web, pp. 981–990. ACM, Madrid (2009)
Chapter Google Scholar
Michal Mared, P.P., Spousta, M.: Web Page Cleaning with Conditional Random Fields. Calriers du Central 4, 155–162 (2007)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1), 93–114 (2001)
Article Google Scholar
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971–980. ACM, Madrid (2009)
Chapter Google Scholar
Shoubiao, T., Jin, F., Yuan, J.: Web Data Extraction Based on Label Library. In: 2009 WRI World Congress on Computer Science and Information Engineering, (2009)
Google Scholar
Shuyi, Z., et al.: Joint optimization of wrapper generation and template detection. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Jose (2007)
Google Scholar
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 381–388. ACM, Bremen (2005)
Google Scholar
Tak-Lam, W., Wai, L.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Technol. 7(1), 6 (2007)
Article Google Scholar
Vadrevu, S., Gelgi, F., Davulcu, H.: Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10(2), 157–179 (2007)
Article Google Scholar
Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345–1354. ACM, Paris (2009)
Chapter Google Scholar
Wang, J., et al.: News article extraction with template-independent wrapper. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1085–1086. ACM, Madrid (2009)
Chapter Google Scholar
Wong, T.-L., Lam, W.: An unsupervised method for joint information extraction and feature mining across different Web sites. Data & Knowledge Engineering 68(1), 107–125 (2009)
Article Google Scholar
Wong, T.-L., Lam, W., Chen, B.: Mining employment market via text block detection and adaptive cross-domain information extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 283–290. ACM, Boston (2009)
Google Scholar
Xiao, J.-P., Zhang, L.-S., Ren, X.: Web information extraction based on Transductive Support Vector Machine. Jisuanji Gongcheng yu Yingyong (Computer Engineering and Applications) 45, 147–149 (2009)
Google Scholar
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. World Wide Web 10(2), 113–132 (2007)
Article Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)
Chapter Google Scholar
Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 884–893. ACM, San Jose (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Shilin Huang, Xiaolin Zheng, Xiaowei Wang & Deren Chen

Authors

Shilin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Deren Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo
College of Computer and Software, Taiyuan University of Technology, 030024, Taiyuan, China
Junjie Chen
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, S., Zheng, X., Wang, X., Chen, D. (2011). News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-23982-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics