Abstract
Principal among knowledge discovery tasks is recognition of insightful patterns or features from data that can inform otherwise challenging decisions. For the costly future decisions, there is little room for error. Features must provide substantial evidence to be robust for classification and dependable for important decisions. Here we seek statistical evidence for feature selection, that feature signals are of sufficient magnitude and frequency to be generalizable for classification. The Bayesian false discovery rate (bFDR) error control procedure is powerfully suited for this task. In realistic situations often encountered in practice, the bFDR procedure is biased, yielding a greater than desired FDR. In other less typical cases, the FDR is less than desired. We investigate the sources of bias in the bFDR procedure, and predict the direction of bias. A new algorithm has been developed to recover the bias in the bFDR control procedure. In simulation and real data mining examples, the new bFDR control algorithm shows promise. The strengths and limitations of the new approach are presented with examples and discussed.
Similar content being viewed by others
Abbreviations
- y :
-
Observed data continuous response for
- θ :
-
Mean of y
- σ 2 :
-
Variance of y
- μ :
-
Assumed, prior mean of, θ
- ω, ν 2 :
-
Additional parameters of prior distributions of (θ, σ 2)
- H 0/H 1 :
-
Null/alternative hypothesis
- p j :
-
p value for j th test
- t :
-
p value threshold for rejecting or failing to reject H 0
- U 0 :
-
Posterior probability of H 0 given data y
- α :
-
Rate at which error is controlled, or desired, i.e. FDR
- f ():
-
Probability density function
- F():
-
Probability distribution function, or, cumulative density function
- M :
-
Number of attributes, features, i.e. tests
- π :
-
Probability among M tests that the alternative hypothesis is true
References
Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14: 137
Niculescu-Mizil A, Perlich C, Swirszcz G et al (2009) Winning the KDD cup orange challenge with ensemble selection. JMLR 7: 23–34
Blalock EM, Geddes JW, Chen KC, Porter NM et al (2004) Incipient Alzheimer’s disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci 101(7): 2173–2178
Yang Y, Hao C (2010) Product selection for promotion planning. Knowl Inf Syst. doi:10.1007/s10115-010-0326-8
Wozniak M (2010) A hybrid decision tree training method using data Streams. Knowl Inf Syst. doi:10.1007/s10115-010-0345-5
Czarnowski I (2011) Cluster-based instance selection for machine classification. Knowl Inf Syst. doi:10.1007/s10115-010-0375-z
Salam A, Khayal MSH (2010) Mining top-k frequent patterns without minimum support threshold. Knowl Inf Syst. doi:10.1007/s10115-010-0363-3
Kong X, Yu PS (2011) gMLC: a multi-label feature selection framework for graph classification. Knowl Inf Syst. doi:10.1007/s10115-011-0407-3
Cheng C, Pounds S (2007) False discovery rate paradigms for statistical analyses of microarray gene expression data. Bioinformation 1(10): 436–446
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57(1): 289–300
Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc Ser B Methodol 64: 479–498
Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(1): 1236–1242
Efron E, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96: 1151–1160
Genovese C, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32: 1035–1061
Gold D, Miecznikowski JC, Liu S (2009) Error control variability in pathway-based microarray analysis. Bioinformatics 25: 2216–2221
Whittmore AS (2007) A Bayesian false discovery rate for multiple testing. J Appl Stat 34(1): 1–9
Gelman A, Carlin JB, Stern AS, Rubin DB (2003) Bayesian data analysis, 2nd edn. Chapman & Hall, CRC Texts in Statistical Science, Boca Raton, FL
Wachi S, Yoneda K, Wu R et al (2005) Interactome–transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics 21(23): 4205–4208
Danziger SA, Zeng J, Wang Y, Brachmann RK, Lathrop RH (2007) Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics 23(13): 104–114
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
The Below is the Electronic Supplementary Material.
Rights and permissions
About this article
Cite this article
Gold, D.L. Restoring coverage to the Bayesian false discovery rate control procedure. Knowl Inf Syst 33, 401–417 (2012). https://doi.org/10.1007/s10115-012-0503-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0503-z