Skip to main content

News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

Information extraction is important in web information retrieval. In case of news information extraction, because news information does not have representative keywords pointing out its beginning and ending, it is difficult to specify the news title and body automatically. Our approach is based on an adaptive weighting factor using Bayesian algorithm to solve this problem. We divided a news page into text fragments, and represented them with a set of content features and layout features. We used an adaptive weighting factor to make features fit in different pages. Experiments show that our method results in a higher precision than the original algorithm without a weighting factor on the task of news information extraction.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Cai, D., Yu, S., Wen, J.-r., Ma W.-Y.: VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003)

    Google Scholar 

  3. Carlson, A., et al.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)

    Chapter  Google Scholar 

  4. Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 1094–1098. ACM, Dijon (2006)

    Chapter  Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 624–624. ACM, Madison (2002)

    Chapter  Google Scholar 

  6. Junfang, S., Li, L.: Web information extraction based on news domain ontology theory. In: IEEE 2nd Symposium on Web Society SWS (2010)

    Google Scholar 

  7. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  8. Labský, M., Svátek, V., Nekvasil, M., Rak, D.: The ex project: Web information extraction using extraction ontologies. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. Studies in Computational Intelligence, vol. 220, pp. 71–88. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  9. Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM, Edmonton (2002)

    Chapter  Google Scholar 

  10. Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, D.C (2002)

    Google Scholar 

  11. Ma, L., et al.: Extracting unstructured data from template generated web documents. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 512–515. ACM, New Orleans (2003)

    Google Scholar 

  12. Miao, G., et al.: Extracting data records from the web using tag path clustering. In: Proceedings of the 18th International Conference on World Wide Web, pp. 981–990. ACM, Madrid (2009)

    Chapter  Google Scholar 

  13. Michal Mared, P.P., Spousta, M.: Web Page Cleaning with Conditional Random Fields. Calriers du Central 4, 155–162 (2007)

    Google Scholar 

  14. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1), 93–114 (2001)

    Article  Google Scholar 

  15. Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971–980. ACM, Madrid (2009)

    Chapter  Google Scholar 

  16. Shoubiao, T., Jin, F., Yuan, J.: Web Data Extraction Based on Label Library. In: 2009 WRI World Congress on Computer Science and Information Engineering, (2009)

    Google Scholar 

  17. Shuyi, Z., et al.: Joint optimization of wrapper generation and template detection. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Jose (2007)

    Google Scholar 

  18. Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 381–388. ACM, Bremen (2005)

    Google Scholar 

  19. Tak-Lam, W., Wai, L.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Technol. 7(1), 6 (2007)

    Article  Google Scholar 

  20. Vadrevu, S., Gelgi, F., Davulcu, H.: Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10(2), 157–179 (2007)

    Article  Google Scholar 

  21. Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345–1354. ACM, Paris (2009)

    Chapter  Google Scholar 

  22. Wang, J., et al.: News article extraction with template-independent wrapper. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1085–1086. ACM, Madrid (2009)

    Chapter  Google Scholar 

  23. Wong, T.-L., Lam, W.: An unsupervised method for joint information extraction and feature mining across different Web sites. Data & Knowledge Engineering 68(1), 107–125 (2009)

    Article  Google Scholar 

  24. Wong, T.-L., Lam, W., Chen, B.: Mining employment market via text block detection and adaptive cross-domain information extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 283–290. ACM, Boston (2009)

    Google Scholar 

  25. Xiao, J.-P., Zhang, L.-S., Ren, X.: Web information extraction based on Transductive Support Vector Machine. Jisuanji Gongcheng yu Yingyong (Computer Engineering and Applications) 45, 147–149 (2009)

    Google Scholar 

  26. Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. World Wide Web 10(2), 113–132 (2007)

    Article  Google Scholar 

  27. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)

    Chapter  Google Scholar 

  28. Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 884–893. ACM, San Jose (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, S., Zheng, X., Wang, X., Chen, D. (2011). News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23982-3_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23981-6

  • Online ISBN: 978-3-642-23982-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics