Abstract:
Document categorization is an important topic that is central to many applications that demand reasoning about and organisation of text documents, web pages, and so forth...Show MoreMetadata
Abstract:
Document categorization is an important topic that is central to many applications that demand reasoning about and organisation of text documents, web pages, and so forth. Document classification is commonly achieved by choosing appropriate features (terms) and building a term-frequency inerse-document frequency (TFIDF) feature vector. In this process, feature selection is a key factor in the accuracy and effectiveness of resulting classifications. For a given task, the right choice of features means accurate classification with suitable levels of computational efficiency. Meanwhile, most document classification work is based on English language documents. In this paper we make three main contributions: (i) we demonstrate successful document classification in the context of Arabic documents (although previous work has demonstrated text classification in Arabic, the datasets used, and the experimental setup, have not been revealed); (ii) we offer our datasets to enable other researchers to compare directly with our results; (iii) we demonstrate a combination of Binary PSO and K nearest neighbour that performs well in selecting good sets of features for this task.
Date of Conference: 19-21 October 2011
Date Added to IEEE Xplore: 01 December 2011
ISBN Information: