Abstract
Reshef et al. (Science 334:1518–1523, 2011) introduce the maximal information coefficient, or MIC, which captures a wide range of relationships between pairs of variables. We derive a useful property which can be employed either to substantially reduce the computer time to determine MIC, or to obtain a series of MIC values for different resolutions. Through studying the dependence of the MIC scores on the maximal resolution, employed to partition the data, we show that relationships of different natures can be discerned more clearly. We also provide an iterative greedy algorithm, as an alternative to the ApproxMaxMI proposed by Reshef et al., to determine the value of MIC through iterative optimization, which can be conducted parallelly.
Similar content being viewed by others
Notes
In our work, we adopt st≤B instead of st<B for the convenience to produce our figures.
For any st<B not on the shell, one has I ∗(D,s,t)≤I ∗(D,s,t′) and I ∗(D,s,t)≤I ∗(D,s′,t) for both st′≈B and s′t≈B on the shell. It follows that either m s,t ≤m s,t′ or m s,t ≤m s′,t .
As the partition line μ 2i−1 is moved by one lattice point (say, from j to j+1), only a small number (only one for most of the cases studied in this paper) of data pairs are moved across this partition line. We calculated the difference in the mutual information I(μ 2i−1=j+1)−I(μ 2i−1=j), instead of I(μ 2i−1=j+1) and I(μ 2i−1=j), to reduce the computation time.
In practice, deadlocked loops may be encountered in simulation due to accumulation of truncation errors. We employ a parameter tolerence=10−35 (or one may adopt a even smaller dynamic value, say 2×10−36, and gradually raise the value as deadlocked loops are encountered), within which differences (of quad-precision) between mutual information are regarded insignificant. For two different positions of a gridline with even mutual information, the one closer to the choice of the last iteration is favored.
By equipartition on the x- (y-) axis or the columns (rows), we mean that approximately the same number of data points are assigned to each column (row).
An “OutOfMemoryError” signal comes out when conducted on a data set of 218 pairs with the default value c=15, but no errors occur with a lower value c=10.
The average of ζ is about 20 and 30 for relationships of class (ii) and class (iii), respectively.
Strictly speaking, this instance is not of class (ii) because it contains a component of a flat segment. Hence, our results here works for a larger group than class (ii), allowing of flat or vertical lines.
References
Gray, R.: Entropy and Information Theory, 2nd edn. Springer, Boston (2011)
Mackay, D.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Reshef, D., Reshef, Y., Finucane, H., Grossman, R., McVean, G., Turnbaugh, P., Lander, E., Mitzenmacher, M., Sabeti, P.: Detecting novel associations in large data sets. Science 334, 1518–1523 (2011)
Speed, T.: A correlation for the 21st Century. Science 334, 1502–1503 (2011)
Acknowledgements
The work is supported in part by the National Science Council of the Republic of China under Grants No. NSC-100-2112-M002-007, and NSC-100-2112-M032-002-MY3.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, SC., Pang, NN. & Tzeng, WJ. Resolution dependence of the maximal information coefficient for noiseless relationship. Stat Comput 24, 845–852 (2014). https://doi.org/10.1007/s11222-013-9405-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-013-9405-5