Skip to main content

A Cross-Lingual Framework for Web News Taxonomy Integration

  • Conference paper
Book cover Information Retrieval Technology (AIRS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

  • 944 Accesses

Abstract

There are currently many news sites providing online news articles, and many Web news portals arise to provide clustered news categories for users to browse more related news reports and realize the news events in depth. However, to the best of our knowledge, most Web news portals only provide monolingual news clustering services. In this paper, we study the cross-lingual Web news taxonomy integration problem in which news articles of the same news event reported in different languages are to be integrated into one category. Our study is based on cross-lingual classification research results and the cross-training concept to construct SVM-based classifiers for cross-lingual Web news taxonomy integration. We have conducted several experiments with the news articles from Google News as the experimental data sets. From the experimental results, we find that the proposed cross-training classifiers outperforms the traditional SVM classifiers in an all-round manner. We believe that the proposed framework can be applied to different bilingual environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altavista News (2006), http://www.altavista.com/news/default

  2. Google News (2006), http://news.google.com/

  3. BBC News: First impressions count for web (2006), English version available at, http://bbc.co.uk/2/hi/technology/4616700.stm , Chinese version available at, http://news.bbc.co.uk/chinese/trad/hi/newsid4610000/newsid4618500/4618552.stm

  4. Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of the 10th International Conference on World Wide Web, pp. 603–612 (2001)

    Google Scholar 

  5. Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 177–186 (2003)

    Google Scholar 

  6. Zhang, D., Lee, W.S.: Web Taxonomy Integration using Support Vector Machines. In: Proceedings of the 13th international conference on World Wide Web, pp. 472–481 (2004)

    Google Scholar 

  7. Zhang, D., Lee, W.S.: Web Taxonomy Integration Through Co-Bootstrapping. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 410–417 (2004)

    Google Scholar 

  8. Wu, C.W., Tsai, T.H., Hsu, W.L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 190–205. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  9. Chen, I.X., Ho, J.C., Yang, C.Z.: An Iterative Approach for Web Catalog Integration with Support Vector Machines. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 703–708. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Rogati, M., Yang, Y.: Resrouce Selection for Domain-Specific ross-Lingual IR. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 154–161 (2004)

    Google Scholar 

  11. Chen, H.H., Kuo, J.J., Su, T.C.: Clustering and Visualization in a Multi-lingual Multidocument Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Yahoo! News (2006), http://news.yahoo.com/

  13. Jenkins, C., Inman, D.: Adaptive Automatic Classification on the Web. In: Proc. of the 11th International Workshop on Database and Expert Systems Applications, Greenwich, London, UK, pp. 504–511 (2000)

    Google Scholar 

  14. Chen, I.X., Shih, C.H., Yang, C.Z.: Web Catalog Integration using Support Vector Machines. In: Proceedings of the 1st Workshop on Intelligent Web Technology (IWT 2004), Taipei, Taiwan, pp. 7–13 (2004)

    Google Scholar 

  15. Nie, J.Y., Ren, F.: Chinese Information Retrieval: Using Characters or Words. Information Processing and Management 35(4), 443–162 (1999)

    Google Scholar 

  16. Nie, J.Y., Gao, J., Zhang, J., Zhou, M.: On the Use of Words and N-grams for Chinese Information Retrieval. In: Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages, pp. 141–148 (2000)

    Google Scholar 

  17. Foo, S., Li, H.: Chinese Word Segmentation and Its Effect on Information Retrieval. Information Processing and Management 40(1), 161–190 (2004)

    Article  Google Scholar 

  18. Tseng, Y.H.: Automatic Thesaurus Generation for Chinese Documents. Journal of the American Society for Information Science and Technology 53(13), 1130–1138 (2002)

    Article  Google Scholar 

  19. The Association for Computational Linguistics and Chinese Language Processing (2006), http://www.aclclp.org.tw/use_ssc.php

  20. Thorsten Joachims: SVMlight (2006), http://svmlight.joachims.org/

  21. Linguistic Data Consortium (2006), http://projects.ldc.upenn.edu/Chinese/LDCch.htm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yang, CZ., Chen, CM., Chen, IX. (2006). A Cross-Lingual Framework for Web News Taxonomy Integration. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_21

Download citation

  • DOI: https://doi.org/10.1007/11880592_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45780-0

  • Online ISBN: 978-3-540-46237-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics