Skip to main content
Log in

Innovating Web page classification through reducing noise

  • Regular Papers
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones. We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, we can classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. InInternational Conference on Machine Learning (ICML), 1997.

  2. David D Lewis, Kimberly A Knowles. Threading electronic mail: A preliminary study.Information Processing and Management, 1997, 33(2): 209–217.

    Article  Google Scholar 

  3. Ken Lang. Newsweeder: Learning to filter net news. InInternational Conference on Machine Learning (ICML), 1995, pp.331–339.

  4. Wai Lam. Automatic text category and its application to text retrieval.IEEE Transactions on Knowledge and Data Engineering, 1999, 11(6): 865–879.

    Article  Google Scholar 

  5. Apte D, Damerau F, Weiss S. Automated learning of decision rules for text categorization.ACM Transactions on Information System, 1994, 12(3): 233–251.

    Article  Google Scholar 

  6. Yang Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. InProceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp.13–22.

  7. Lewis D Det al. Training algorithms for linear text classifiers. InProceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298–306.

  8. Cohen W W, Singer Y. Context-sensitive learning methods for text categorization. InProceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.307–315.

  9. Salton G. Associative document retrieval techniques using bibliographic information.J. ACM, 1963, pp.440–457.

  10. Krishna Bharat. Improved algorithms for topic distillation in a hyperlinked environment.ACM SIGIR’98, 1998, pp.104–111.

  11. Dharmendra S Modha. Clustering hypertext with applications to Web searching. IBM Research Report, 2000.

  12. Soumen Chakrabarti. Enhanced hypertext categorization using hyperlinks.ACM SIGMOD, 1998, pp.1–12.

  13. Sahami M. Web classification using Bayesian nets.Personal Communication, Oct., 1997.

  14. Salton Get al. Term weighing approached in automatic text retrieval.Information Processing and Management, 1988, 24(5): 513–523.

    Article  Google Scholar 

  15. Li Xiaoli, Liu Jimin, Shi Zhongzhi. Combine support vector machines with unsupervised clustering in text classification. InProceedings of Conference on Intelligent Information Processing: 16th World Computer Congress, 2000, pp.398–405.

  16. Salton G, Zhang Y. Enhancement of text representations using related document titles.Information Processing and Management, 1986, 22(5): 385–394.

    Article  Google Scholar 

  17. Krishna Bharat, Moniks R Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. InACM SIGIR’98, 1998, pp.104–111.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoli Li.

Additional information

This work is supported by the National Natural Science Foundation of China (No.60075019, and No.9010402) and the National Science Foundation of Beijing (No.4011003).

LI Xiaoli received his Ph.D. degree from the Institute of ComputingTechnology, The Chinese Academy of Sciences in 2001. He taught artificial intelligence in the Graduate School of the University of Science and Technology of China in 1999. His research interests include Web mining, information retrieval and natural language processing. He has published more than 20 papers in international conferences and journals Since 2000, he has been working as a research staff in the National University of Singapore.

SHI Zhongzhi received his B.E. and M.E. degrees from the University of Science and Technology of China in 1964 and 1968, respectively. He is currently the Executive Director of the Department of Intelligent Computer Science, Institute of Computing Technology. His research interests include artificial intelligence, neural computing, cognitive science, advanced database technology, new generation computer. He has published 10 books and more than 300 technical papers. He is a member of the Standing Steering Committee of PRICAI, Vice President of Chinese Artificial Intelligence Society, and Secretary-General of China Computer Federation. He is also the Vice President of the Chinese Society of Machine Learning and Vice President of the Chinese Society of Knowledge Engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Shi, Z. Innovating Web page classification through reducing noise. J. Comput. Sci. & Technol. 17, 9–17 (2002). https://doi.org/10.1007/BF02949820

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02949820

Keywords

Navigation