Conferences >2010 Seventh International Co...

A feature selection method for document clustering based on part-of-speech and word co-occurrence

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and pr...Show More

Metadata

Abstract:

Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features' context information. Based on this analysis, we propose a novel feature selection method based on part-of-speech and word co-occurrence. According the components of Chinese document text, we utilize the words' part-of-speech attributes to filter lots of meaningless terms. Then we define and use co-occurrence words by their part-of-speech to select features. In the evaluating process, we use the text corpus from Sogou Lab to do some experiments and use Entropy and Precision as criteria to give an objective evaluation of document clustering performance. The results show that our method can select better features and get a more pleasant clustering performance.

Published in: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery

Date of Conference: 10-12 August 2010

Date Added to IEEE Xplore: 09 September 2010

ISBN Information:

DOI: 10.1109/FSKD.2010.5569827

Conference Location: Yantai, China