Abstract:
One of the most important problems in large-scale inference problems is the identification of variables that are highly dependent on several other variables. When depende...Show MoreMetadata
Abstract:
One of the most important problems in large-scale inference problems is the identification of variables that are highly dependent on several other variables. When dependence is measured by partial correlations, these variables identify those rows of the partial correlation matrix that have several entries with large magnitudes, i.e., hubs in the associated partial correlation graph. This paper develops theory and algorithms for discovering such hubs from a few observations of these variables. We introduce a hub screening framework in which the user specifies both a minimum (partial) correlation \rho and a minimum degree \delta to screen the vertices. The choice of \rho and \delta can be guided by our mathematical expressions for the phase transition correlation threshold \rho _{c} governing the average number of discoveries. They can also be guided by our asymptotic expressions for familywise discovery rates under the assumption of large number p of variables, fixed number n of multivariate samples, and weak dependence. Under the null hypothesis that the dispersion (covariance) matrix is sparse, these limiting expressions can be used to enforce familywise error constraints and to rank the discoveries in order of increasing statistical significance. For n\ll p, the computational complexity of the proposed partial correlation screening method is low and is therefore highly scalable. Thus, it can be applied to significantly larger problems than previous approaches. The theory is applied to discovering hubs in a high-dimensional gene microarray dataset.
Published in: IEEE Transactions on Information Theory ( Volume: 58, Issue: 9, September 2012)