Skip to main content

Improving Vietnamese Web Page Classification by Combining Hybrid Feature Selection and Label Propagation with Link Information

  • Conference paper
Context-Aware Systems and Applications (ICCASA 2012)

Abstract

Classification of web pages is essential to many information management and retrieval tasks such as maintaining web directories and focused crawling. One problem in web page classification is that, unlabeled training examples are readily available, while labeled ones are often costly to obtain. Furthermore, the uncontrolled nature of web content presents additional challenges to web page classification, whereas the interconnected characteristic of hypertext can provide useful information for the process. To address these problems, we propose a graph-based semi-supervised classification framework which combines iteratively hybrid semi-supervised feature selection and Label Propagation learning using link information to improve the Vietnamese web page classification. The experimental results show that proposed method outperforms the state-of-the art methods applying to Vietnamese web page classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: SIGIR 2006 (2006)

    Google Scholar 

  2. Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998, pp. 307–318 (1998)

    Google Scholar 

  3. Ghani, R., Slattery, S., Yang, Y.: Hypertext Categorization using Hyperlink Patterns and Meta Data. In: ICML 2001 (2001)

    Google Scholar 

  4. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate Detection using Shallow Text Features. In: WSDM 2010 – The Third ACM International Conference on Web Search and Data Mining, New York, City, USA (2010)

    Google Scholar 

  5. Hông Phuong, L., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A Hybrid Approach to Word Segmentation of Vietnamese Texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Liu, R., Zhou, J., Liu, M.: A Graph-based Semi-supervised Learning Algorithm for Web Page Classification. In: Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications, ISDA 2006 (2006)

    Google Scholar 

  7. Lu, Q., Getoor, L.: Link-based classification. In: ICML (2003)

    Google Scholar 

  8. Trung, N.M., Tam, N.D., Phuong, N.H.: Using main content extraction to improve performance of Vietnamese web page classification. In: SoICT 2011, Hanoi, Vietnam, October 13-14 (2011)

    Google Scholar 

  9. Oh, H.J., Myaeng, S.H., Lee, M.H.: A practical hypertext categorization method using links and incrementally available class information. In: SIGIR, pp. 264–271 (2000)

    Google Scholar 

  10. Ren, J., Qiu, Z., Fan, W., Cheng, H., Yu, P.S.: Forward Semi-supervised Feature Selection. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 970–976. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  11. Shang, W., Huang, H., Zhu, H.: A Novel feature selection algorithm for text categorization. Expert System with Application 33, 1–5 (2007)

    Article  Google Scholar 

  12. Zhong, E., Xie, S., Fan, W., Ren, J., Peng, J., Zhang, K.: Graph-based Iterative Hybrid Feature Selection. In: Proceeding ICDM 2008 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (2008)

    Google Scholar 

  13. Strehl, A., Ghosh, J., Mooney, R.J.: Impact of similarity measures on web-page clustering. In: AAAI Workshop (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Van Linh, N., Thi Kim Anh, N., Dat, C.M. (2013). Improving Vietnamese Web Page Classification by Combining Hybrid Feature Selection and Label Propagation with Link Information. In: Vinh, P.C., Hung, N.M., Tung, N.T., Suzuki, J. (eds) Context-Aware Systems and Applications. ICCASA 2012. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36642-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36642-0_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36641-3

  • Online ISBN: 978-3-642-36642-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics