Abstract:
Research in science and engineering has resulted in the generation of voluminous datasets. For instance, biological databases such as PubMed now have millions of articles...Show MoreMetadata
Abstract:
Research in science and engineering has resulted in the generation of voluminous datasets. For instance, biological databases such as PubMed now have millions of articles. Given this growth in data, the problem of retrieving information relevant to a specific topic has become a big challenge. In this paper we focus on the problem of retrieving articles pertaining to a given topic from among a huge collection of articles. In particular, we investigate the problem of classifying articles. Though numerous techniques and tools are available for documents classification, a shortcoming in them is that they take too much time. In this paper we present generic computational techniques that can classify articles efficiently. Our algorithms are based on algorithms that have been proposed for a related problem called gene selection. Gene selection is the problem of identifying a minimum set of genes that are responsible for certain events (for example the presence of cancer). Even though gene selection was originally proposed for biological data analysis, the technique itself is generic. For example, `genes' can be thought of as generic variable. A typical tool that we envision will take as input a set of keywords (that characterize the information of interest) and will develop a learner that will identify a small subset of the keywords that are capable of classifying papers into two types. A paper is of the first type if it has information of interest and a paper is of the second type if the paper does not have information of interest. Experiments show that the new algorithm obtains a higher classification accuracy using a smaller number of selected keywords when compared to one of the best algorithms reported in the literature.
Date of Conference: 01-04 July 2012
Date Added to IEEE Xplore: 26 July 2012
ISBN Information: