Abstract:
Next generation sequencing (NGS) technology has increasingly become the backbone of transcriptomics analysis, but sequencer error causes biases in the read counts. In thi...Show MoreMetadata
Abstract:
Next generation sequencing (NGS) technology has increasingly become the backbone of transcriptomics analysis, but sequencer error causes biases in the read counts. In this paper we establish a framework for predicting true sequences from NGS data. We formulate this task as a classification problem. We define several features, such as log likelihood ratio of estimated true counts, error probability and observed count of the reads. Using a Support Vector Machine (SVM) classifier, we show that on simulated reads these features can achieve 96.35% classification accuracy in discriminating true sequences. Using this framework we provide a way for users to select sequences with a desired precision and recall for their analysis. The feature generation software and the simulated data set can be obtained from (http://seq.cbrc.jp/NGSFeatGen).
Published in: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)
Date of Conference: 18-18 December 2010
Date Added to IEEE Xplore: 28 January 2011
ISBN Information: