Abstract
Information extraction is important in web information retrieval. In case of news information extraction, because news information does not have representative keywords pointing out its beginning and ending, it is difficult to specify the news title and body automatically. Our approach is based on an adaptive weighting factor using Bayesian algorithm to solve this problem. We divided a news page into text fragments, and represented them with a set of content features and layout features. We used an adaptive weighting factor to make features fit in different pages. Experiments show that our method results in a higher precision than the original algorithm without a weighting factor on the task of news information extraction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Cai, D., Yu, S., Wen, J.-r., Ma W.-Y.: VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003)
Carlson, A., et al.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)
Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 1094–1098. ACM, Dijon (2006)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 624–624. ACM, Madison (2002)
Junfang, S., Li, L.: Web information extraction based on news domain ontology theory. In: IEEE 2nd Symposium on Web Society SWS (2010)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Labský, M., Svátek, V., Nekvasil, M., Rak, D.: The ex project: Web information extraction using extraction ontologies. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. Studies in Computational Intelligence, vol. 220, pp. 71–88. Springer, Heidelberg (2009)
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM, Edmonton (2002)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, D.C (2002)
Ma, L., et al.: Extracting unstructured data from template generated web documents. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 512–515. ACM, New Orleans (2003)
Miao, G., et al.: Extracting data records from the web using tag path clustering. In: Proceedings of the 18th International Conference on World Wide Web, pp. 981–990. ACM, Madrid (2009)
Michal Mared, P.P., Spousta, M.: Web Page Cleaning with Conditional Random Fields. Calriers du Central 4, 155–162 (2007)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1), 93–114 (2001)
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971–980. ACM, Madrid (2009)
Shoubiao, T., Jin, F., Yuan, J.: Web Data Extraction Based on Label Library. In: 2009 WRI World Congress on Computer Science and Information Engineering, (2009)
Shuyi, Z., et al.: Joint optimization of wrapper generation and template detection. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Jose (2007)
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 381–388. ACM, Bremen (2005)
Tak-Lam, W., Wai, L.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Technol. 7(1), 6 (2007)
Vadrevu, S., Gelgi, F., Davulcu, H.: Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10(2), 157–179 (2007)
Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345–1354. ACM, Paris (2009)
Wang, J., et al.: News article extraction with template-independent wrapper. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1085–1086. ACM, Madrid (2009)
Wong, T.-L., Lam, W.: An unsupervised method for joint information extraction and feature mining across different Web sites. Data & Knowledge Engineering 68(1), 107–125 (2009)
Wong, T.-L., Lam, W., Chen, B.: Mining employment market via text block detection and adaptive cross-domain information extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 283–290. ACM, Boston (2009)
Xiao, J.-P., Zhang, L.-S., Ren, X.: Web information extraction based on Transductive Support Vector Machine. Jisuanji Gongcheng yu Yingyong (Computer Engineering and Applications) 45, 147–149 (2009)
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. World Wide Web 10(2), 113–132 (2007)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)
Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 884–893. ACM, San Jose (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, S., Zheng, X., Wang, X., Chen, D. (2011). News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-23982-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)