Innovating Web page classification through reducing noise

Li, Xiaoli; Shi, Zhongzhi

doi:10.1007/BF02949820

Innovating Web page classification through reducing noise

Regular Papers
Published: January 2002

Volume 17, pages 9–17, (2002)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiaoli Li^1,2 &
Zhongzhi Shi¹

81 Accesses
Explore all metrics

Abstract

This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones. We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, we can classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification

Multi-layer Filtering Webpage Classification Method Based on SVM

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. InInternational Conference on Machine Learning (ICML), 1997.
David D Lewis, Kimberly A Knowles. Threading electronic mail: A preliminary study.Information Processing and Management, 1997, 33(2): 209–217.
Article Google Scholar
Ken Lang. Newsweeder: Learning to filter net news. InInternational Conference on Machine Learning (ICML), 1995, pp.331–339.
Wai Lam. Automatic text category and its application to text retrieval.IEEE Transactions on Knowledge and Data Engineering, 1999, 11(6): 865–879.
Article Google Scholar
Apte D, Damerau F, Weiss S. Automated learning of decision rules for text categorization.ACM Transactions on Information System, 1994, 12(3): 233–251.
Article Google Scholar
Yang Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. InProceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp.13–22.
Lewis D Det al. Training algorithms for linear text classifiers. InProceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298–306.
Cohen W W, Singer Y. Context-sensitive learning methods for text categorization. InProceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.307–315.
Salton G. Associative document retrieval techniques using bibliographic information.J. ACM, 1963, pp.440–457.
Krishna Bharat. Improved algorithms for topic distillation in a hyperlinked environment.ACM SIGIR’98, 1998, pp.104–111.
Dharmendra S Modha. Clustering hypertext with applications to Web searching. IBM Research Report, 2000.
Soumen Chakrabarti. Enhanced hypertext categorization using hyperlinks.ACM SIGMOD, 1998, pp.1–12.
Sahami M. Web classification using Bayesian nets.Personal Communication, Oct., 1997.
Salton Get al. Term weighing approached in automatic text retrieval.Information Processing and Management, 1988, 24(5): 513–523.
Article Google Scholar
Li Xiaoli, Liu Jimin, Shi Zhongzhi. Combine support vector machines with unsupervised clustering in text classification. InProceedings of Conference on Intelligent Information Processing: 16th World Computer Congress, 2000, pp.398–405.
Salton G, Zhang Y. Enhancement of text representations using related document titles.Information Processing and Management, 1986, 22(5): 385–394.
Article Google Scholar
Krishna Bharat, Moniks R Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. InACM SIGIR’98, 1998, pp.104–111.

Download references

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, The Chinese Academy of Sciences, 100080, Beijing, P.R. China
Xiaoli Li & Zhongzhi Shi
School of Computing, National University of Singapore, 117543, Singapore
Xiaoli Li

Authors

Xiaoli Li
View author publications
You can also search for this author inPubMed Google Scholar
Zhongzhi Shi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaoli Li.

Additional information

This work is supported by the National Natural Science Foundation of China (No.60075019, and No.9010402) and the National Science Foundation of Beijing (No.4011003).

LI Xiaoli received his Ph.D. degree from the Institute of ComputingTechnology, The Chinese Academy of Sciences in 2001. He taught artificial intelligence in the Graduate School of the University of Science and Technology of China in 1999. His research interests include Web mining, information retrieval and natural language processing. He has published more than 20 papers in international conferences and journals Since 2000, he has been working as a research staff in the National University of Singapore.

SHI Zhongzhi received his B.E. and M.E. degrees from the University of Science and Technology of China in 1964 and 1968, respectively. He is currently the Executive Director of the Department of Intelligent Computer Science, Institute of Computing Technology. His research interests include artificial intelligence, neural computing, cognitive science, advanced database technology, new generation computer. He has published 10 books and more than 300 technical papers. He is a member of the Standing Steering Committee of PRICAI, Vice President of Chinese Artificial Intelligence Society, and Secretary-General of China Computer Federation. He is also the Vice President of the Chinese Society of Machine Learning and Vice President of the Chinese Society of Knowledge Engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Shi, Z. Innovating Web page classification through reducing noise. J. Comput. Sci. & Technol. 17, 9–17 (2002). https://doi.org/10.1007/BF02949820

Download citation

Received: 14 November 2000
Revised: 22 June 2001
Issue Date: January 2002
DOI: https://doi.org/10.1007/BF02949820

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Innovating Web page classification through reducing noise

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification

Multi-layer Filtering Webpage Classification Method Based on SVM

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now