Abstract
In the recent years, the application of mutual information based measures has received broad popularity. The mutual information MINE measure is asserted to be the best strategy for identification of relationships in challenging data sets. A major weakness of the MINE similarity metric concerns its high execution time. To address the performance issue numerous approaches are suggested both with respect to improvement of software implementations and with respect to the application of simplified heuristics. However, none of the approaches manage to address the high execution-time of MINE computation.
In this work, we address the latter issue. This paper presents a novel MINE implementation which manages a 530x+ performance increase when compared to established approaches. The novel high-performance approach is the result of a structural evaluation of 30+ different MINE software implementations, implementations which do not make use of simplified heuristics. Hence, the proposed strategy for computation of MINE mutual information is both accurate and fast. The novel mutual information MINE software is available at https://bitbucket.org/oekseth/mine-data-analysis/downloads/. To broaden the applicability the high-performance MINE metric is integrated into the hpLysis machine learning library (https://bitbucket.org/oekseth/hplysis-cluster-analysis-software).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
To use low-level assembly instructions for hardware parallel computations (SSE) to reduce execution time.
References
Ehsani, R., Drabløs, F.: TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinform. 17(1), 296 (2016)
Faith, J.J., Hayete, B., Thaden, J.T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J.J., Gardner, T.S.: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5(1), 8 (2007)
Leach, S.M., Tipney, H., Feng, W., Baumgartner Jr., W.A., Kasliwal, P., Schuyler, R.P., Williams, T., Spritz, R.A., Hunter, L.: Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5(3), 1000215 (2009)
Fraser, A.M., Swinney, H.L.: Independent coordinates for strange attractors from mutual information. Phys. Rev. A 33(2), 1134 (1986)
Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011)
Liepe, J., Filippi, S., Komorowski, M., Stumpf, M.P.: Maximizing the information content of experiments in systems biology. PLoS Comput. Biol. 9(1), 1002888 (2013)
Villaverde, A.F., Ross, J., Morán, F., Banga, J.R.: MIDER: network inference with mutual information distance and entropy reduction. PLoS ONE 9(5), 96732 (2014)
Tang, D., Wang, M., Zheng, W., Wang, H.: RapidMic: rapid computation of the maximal information coefficient. Evol. Bioinform. 10, 11 (2014)
Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., Furlanello, C.: Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 707 (2012)
Chen, Y., Zeng, Y., Luo, F., Yuan, Z.: A new algorithm to optimize maximal information coefficient. PLoS ONE 11(6), 0157567 (2016)
Wang, K., Phillips, C.A., Saxton, A.M., Langston, M.A.: EntropyExplorer: an R package for computing and comparing differential Shannon entropy, differential coefficient of variation and differential expression. BMC Res. Notes 8(1), 832 (2015)
Hausser, J., Strimmer, K.: Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 10(July), 1469–1484 (2009)
Marcon, E., Hérault, B.: Entropart: an R package to measure and partition diversity. J. Stat. Softw. 67(8), 1–26 (2015)
Guevara, M.R., Hartmann, D., Mendoza, M.: diverse: an R package to analyze diversity in complex systems. R J. 8(2), 60–78 (2016)
Ince, R.A., Mazzoni, A., Petersen, R.S., Panzeri, S.: Open source tools for the information theoretic analysis of neural data. Front. Neurosci. 3, 11 (2010)
Mazandu, G.K., Mulder, N.J.: Information content-based gene ontology functional similarity measures: which one to use for a given biological data type? PLoS ONE 9(12), 113859 (2014)
Morgan, H.D., Sutherland, H.G., Martin, D.I., Whitelaw, E.: Epigenetic inheritance at the agouti locus in the mouse. Nat. Genet. 23(3), 314–318 (1999)
Lee, H.-S., Chen, Z.J.: Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc. Nat. Acad. Sci. 98(12), 6753–6758 (2001)
Carro, M., Lim, W., Alvarez, M., Bollo, R., Zhao, X., Snyder, E., Sulman, E., Anne, S., Doetsch, F., Colman, H., et al.: The transcriptional network for mesenchymal transformation of brain tumours. Nature 463(7279), 318 (2010)
Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R.Y., Alon, U., Margalit, H.: Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc. Nat. Acad. Sci. U.S.A. 101(16), 5934–5939 (2004)
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004)
Sommerfelt, R.M., Feuerherm, A.J., Jones, K., Johansen, B.: Cytosolic phospholipase A2 regulates TNF-induced production of joint destructive effectors in synoviocytes. PLoS ONE 8(12), 83555 (2013)
Lee, W.-P., Tzou, W.-S.: Computational methods for discovering gene networks from expression data. Brief. Bioinform. 10(4), 408–423 (2009)
Riccadonna, S., Jurman, G., Visintainer, R., Filosi, M., Furlanello, C.: DTW-MIC coexpression networks from time-course data. PLoS ONE 11(3), 0152648 (2016)
Ekseth, K., Hvasshovd, S.: hpLysis similarity: a high-performance software-approach for computation of 320+ simliarty-metrics (2017)
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
Lord, E., Diallo, A.B., Makarenkov, V.: Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms. BMC Bioinform. 16(1), 1 (2015)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Ekseth, O.K., Hvasshovd, S.-O.: How an optimized DB-SCAN implementation reduce execution-time and memory-requirements for large data-sets (2017)
Intel: SSE computer-hardware-low-level parallelism. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 06 June 2017
Chao, A., Shen, T.-J.: Nonparametric estimation of Shannons index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10(4), 429–443 (2003)
Frery, A.C., Cintra, R.J., Nascimento, A.D.: Entropy-based statistical analysis of PolSAR data. IEEE Trans. Geosci. Remote Sens. 51(6), 3733–3743 (2013)
Moon, Y.-I., Rajagopalan, B., Lall, U.: Estimation of mutual information using kernel density estimators. Phys. Rev. E 52(3), 2318 (1995)
Jiao, J., Venkat, K., Han, Y., Weissman, T.: Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 61(5), 2835–2885 (2015)
Jourdan, J.-H.: Vectorizable, approximated, portable implementations of some mathematical functions. https://github.com/jhjourdan/SIMD-math-prims. Accessed 06 June 2017
Open-MP: Open-MP: a parallel software-wrapper. http://www.openmp.org/. Accessed 17 Nov 2017
Acknowledgements
The authors would like to thank MD K.I. Ekseth at UIO, Dr. O.V. Solberg at SINTEF, Dr. S.A. Aase at GE Healthcare, MD B.H. Helleberg at NTNU–medical, Dr. Y. Dahl, Dr. T. Aalberg, and K.T. Dragland at NTNU, and Professor P. Sætrom and the High Performance Computing Group at NTNU for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Ekseth, O.K., Hvasshovd, SO. (2018). An Empirical Study of Strategies Boosts Performance of Mutual Information Similarity. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10842. Springer, Cham. https://doi.org/10.1007/978-3-319-91262-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-91262-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91261-5
Online ISBN: 978-3-319-91262-2
eBook Packages: Computer ScienceComputer Science (R0)