A feature selection method for document clustering based on part-of-speech and word co-occurrence | IEEE Conference Publication | IEEE Xplore

A feature selection method for document clustering based on part-of-speech and word co-occurrence


Abstract:

Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and pr...Show More

Abstract:

Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features' context information. Based on this analysis, we propose a novel feature selection method based on part-of-speech and word co-occurrence. According the components of Chinese document text, we utilize the words' part-of-speech attributes to filter lots of meaningless terms. Then we define and use co-occurrence words by their part-of-speech to select features. In the evaluating process, we use the text corpus from Sogou Lab to do some experiments and use Entropy and Precision as criteria to give an objective evaluation of document clustering performance. The results show that our method can select better features and get a more pleasant clustering performance.
Date of Conference: 10-12 August 2010
Date Added to IEEE Xplore: 09 September 2010
ISBN Information:
Conference Location: Yantai, China

Contact IEEE to Subscribe

References

References is not available for this document.