Generating Category Hierarchy for Classifying Large Corpora

Fumiyo FUKUMOTO
Yoshimi SUZUKI

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E89-D    No.4    pp.1543-1554
Publication Date: 2006/04/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e89-d.4.1543
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
Keyword: 
category hierarchies,  k-means,  log loss function,  

Full Text: PDF(323.1KB)>>
Buy this Article



Summary: 
We address the problem of dealing with large collections of data, and investigate the use of automatically constructing domain specific category hierarchies to improve text classification. We use two well-known techniques, the partitioning clustering method called k-means and loss function, to create the category hierarchy. The k-means method involves iterating through the data that the system is permitted to classify during each iteration and construction of a hierarchical structure. In general, the number of clusters k is not given beforehand. Therefore, we used a loss function that measures the degree of disappointment in any differences between the true distribution over inputs and the learner's prediction to select the appropriate number of clusters k. Once the optimal number of k is selected, the procedure is repeated for each cluster. Our evaluation using the 1996 Reuters corpus, which consists of 806,791 documents, showed that automatically constructing hierarchies improves classification accuracy.


open access publishing via