Skip to main content

Advertisement

Log in

Massive datasets and machine learning for computational biomedicine: trends and challenges

  • S.I.: Computational Biomedicine
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

This survey paper attempts to cover a broad range of topics related to computational biomedicine. The field has been attracting great attention due to a number of benefits it can provide the society with. New technological and theoretical advances have made it possible to progress considerably. Traditionally, problems emerging in this field are challenging from many perspectives. In this paper, we considered the influence of big data on the field, problems associated with massive datasets in biomedicine and ways to address these problems. We analyzed the most commonly used machine learning and feature mining tools and several new trends and tendencies such as deep learning and biological networks for computational biomedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In the appendix, see Table 3 for the full list of research areas available in the Web of Knowledge that we related to biomedical research

References

  • Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

    Google Scholar 

  • Abeyratne, U. R., Tun, A. K., Lye, N. T., Guanglan, Z., & Saratchandran, P. (2000). RBF networks for source localization in quantitative electrophysiology. Critical Reviews in Biomedical Engineering, 28(3&4), 463–472.

    Google Scholar 

  • Acharya, U. R., Faust, O., Kadri, N. A., Suri, J. S., & Yu, W. (2013). Automated identification of normal and diabetes heart rate signals using nonlinear measures. Computers in Biology and Medicine, 43(10), 1523–1529.

    Google Scholar 

  • Acharya, U. R., Sree, S. V., Ang, P. C. A., Yanti, R., & Suri, J. S. (2012). Application of non-linear and wavelet based features for the automated identification of epileptic EEG signals. International Journal of Neural Systems, 22(02), 1250002.

    Google Scholar 

  • Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of potential function method in pattern recognition. Automation and Remote Control, 25, 917–936.

    Google Scholar 

  • Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., & Navab, N. (2016). Aggnet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging, 35(5), 1313–1321.

    Google Scholar 

  • Albert, R., Jeong, H., & Barabási, A.-L. (1999). Internet: Diameter of the world-wide web. Nature, 401(6749), 130.

    Google Scholar 

  • Almeida, L. B. (2003). Misep-linear and nonlinear ica based on mutual information. Journal of Machine Learning Research, 4, 1297–1318.

    Google Scholar 

  • Azevedo, F. A. C., Carvalho, L. R. B., Grinberg, L. T., Farfel, J. M., Ferretti, R. E. L., Leite, R. E. P., et al. (2009). Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain. Journal of Comparative Neurology, 513(5), 532–541.

    Google Scholar 

  • Balasubramanian, M., & Schwartz, E. L. (2002). The isomap algorithm and topological stability. Science, 295(5552), 7–7.

    Google Scholar 

  • Baldi, P. (2012). Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning (pp. 37–49).

  • Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.

    Google Scholar 

  • Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405–425.

    Google Scholar 

  • Batal, I., Cooper, G. F., Fradkin, D., Harrison, J., Moerchen, F., & Hauskrecht, M. (2016). An efficient pattern mining approach for event detection in multivariate temporal data. Knowledge and Information Systems, 46(1), 115–150.

    Google Scholar 

  • Bock, D. D., Lee, W.-C. A., Kerlin, A. M., Andermann, M. L., Hood, G., Wetzel, A. W., et al. (2011). Network anatomy and in vivo physiology of visual cortical neurons. Nature, 471(7337), 177–182.

    Google Scholar 

  • Boginski, V., & Commander, C. W. (2009). Identifying critical nodes in protein–protein interaction networks. In Clustering challenges in biological networks (pp. 153–167). World Scientific.

  • Borghini, G., Astolfi, L., Vecchiato, G., Mattia, D., & Babiloni, F. (2014). Measuring neurophysiological signals in aircraft pilots and car drivers for the assessment of mental workload, fatigue and drowsiness. Neuroscience & Biobehavioral Reviews, 44, 58–75.

    Google Scholar 

  • Boser, B. E., Guyon, I. M., Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152). ACM.

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Google Scholar 

  • Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC press.

    Google Scholar 

  • Brosch, T., Tang, L. Y. W., Yoo, Y., Li, D. K. B., Traboulsee, A., & Tam, R. (2016). Deep 3d convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation. IEEE Transactions on Medical Imaging, 35(5), 1229–1239.

    Google Scholar 

  • Butenko, S., Chaovalitwongse, W. A., & Pardalos, P. M. (2009). Clustering challenges in biological networks. Singapore: World Scientific.

    Google Scholar 

  • Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., et al. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365.

    Google Scholar 

  • Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 15.

    Google Scholar 

  • Chan, H.-P., Lo, S.-C. B., Sahiner, B., Lam, K. L., & Helvie, M. A. (1995). Computer-aided detection of mammographic microcalcifications: Pattern recognition with an artificial neural network. Medical Physics, 22(10), 1555–1567.

    Google Scholar 

  • Chang, H.-H., & Moura, J. M. F. (2010). Biomedical signal processing. Biomedical Engineering and Design Handbook, 2, 559–579.

    Google Scholar 

  • Chang, R. L., Ghamsari, L., Manichaikul, A., Hom, E. F. Y., Balaji, S., Weiqi, F., et al. (2011). Metabolic network reconstruction of chlamydomonas offers insight into light-driven algal metabolism. Molecular Systems Biology, 7(1), 518.

    Google Scholar 

  • Chang, Y. D. C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific Reports, 6, 21689.

    Google Scholar 

  • Chaovalitwongse, W. A., & Pardalos, P. M. (2008). On the time series support vector machine using dynamic time warping kernel for brain activity classification. Cybernetics and Systems Analysis, 44(1), 125–138.

    Google Scholar 

  • Charles, D., Gabriel, M., & Furukawa, M. F. (2013). Adoption of electronic health record systems among us non-federal acute care hospitals: 2008–2012. ONC Data Brief, 9, 1–9.

    Google Scholar 

  • Chawla, M. P. S. (2011). Pca and ica processing methods for removal of artifacts and noise in electrocardiograms: A survey and comparison. Applied Soft Computing, 11(2), 2216–2226.

    Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Google Scholar 

  • Chou, K.-C., & Shen, H.-B. (2007). Recent progress in protein subcellular location prediction. Analytical Biochemistry, 370(1), 1–16.

    Google Scholar 

  • CireşAn, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333–338.

    Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Google Scholar 

  • Crookston, N. L., Finley, A. O., et al. (2008). yaimpute: An R package for kNN imputation. Journal of Statistical Software, 23(10), 1–16.

    Google Scholar 

  • Csermely, P., Korcsmáros, T., Kiss, H. J. M., London, G., & Nussinov, R. (2013). Structure and dynamics of molecular networks: A novel paradigm of drug discovery: A comprehensive review. Pharmacology & Therapeutics, 138(3), 333–408.

    Google Scholar 

  • de Rooij, M., Crienen, S., Witjes, J. A., Barentsz, J. O., Rovers, M. M., & Grutters, J. P. C. (2014). Cost-effectiveness of magnetic resonance (mr) imaging and mr-guided targeted biopsy versus systematic transrectal ultrasound-guided biopsy in diagnosing prostate cancer: A modelling study from a health care perspective. European Urology, 66(3), 430–436.

    Google Scholar 

  • De Solla Price, D. J. (1965). Networks of scientific papers. Science, 149, 510–515.

    Google Scholar 

  • Dehzangi, A., Paliwal, K., Sharma, A., Dehzangi, O., & Sattar, A. (2013). A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 10(3), 564–575.

    Google Scholar 

  • Delorme, A., Sejnowski, T., & Makeig, S. (2007). Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis. Neuroimage, 34(4), 1443–1449.

    Google Scholar 

  • Donoho, D. L., & Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10), 5591–5596.

    Google Scholar 

  • Drummond, C., Holte, R. C., et al. (2003). C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11, pp. 1–8). Citeseer.

  • Duarte, N. C., Becker, S. A., Jamshidi, N., Thiele, I., Mo, M. L., Vo, T. D., et al. (2007). Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences, 104(6), 1777–1782.

    Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

    Google Scholar 

  • Eguiluz, V. M., Chialvo, D. R., Cecchi, G. A., Baliki, M., & Apkarian, A. V. (2005). Scale-free brain functional networks. Physical Review Letters, 94(1), 018102.

    Google Scholar 

  • Eisenstein, M. (2015). Big data: The power of petabytes. Nature, 527(7576), S2–S4.

    Google Scholar 

  • Elbuni, A., Kanoun, S., Elbuni, M., & Ali, N. (2009). ECG parameter extraction algorithm using (dwtae) algorithm. In International conference on computer engineering & systems, 2009. ICCES 2009 (pp. 315–320). IEEE.

  • Elkan, C. (2001). The foundations of cost-sensitive learning. In International joint conference on artificial intelligence (Vol. 17, pp. 973–978). Lawrence Erlbaum Associates Ltd.

  • Enders, C. K. (2010). Applied missing data analysis. Guilford Press.

  • Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). Adacost: Misclassification cost-sensitive boosting. In Icml (Vol. 99, pp. 97–105).

  • Faust, O., Acharya, U. R., Adeli, H., & Adeli, A. (2015). Wavelet-based EEG processing for computer-aided seizure detection and epilepsy diagnosis. Seizure-European Journal of Epilepsy, 26, 56–64.

    Google Scholar 

  • Ferrari, M., & Quaresima, V. (2012). A brief review on the history of human functional near-infrared spectroscopy (fnirs) development and fields of application. Neuroimage, 63(2), 921–935.

    Google Scholar 

  • Freeman, L. (1977). A set of measures of centrality based on betweenness. Sociometry, 40(1), 35–41. https://doi.org/10.2307/3033543.

    Google Scholar 

  • Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. In Icml (Vol. 96, pp. 148–156). Bari, Italy.

  • Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.

    Google Scholar 

  • Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.

    Google Scholar 

  • Furnival, G. M., & Wilson, R. W. (1974). Regressions by leaps and bounds. Technometrics, 16(4), 499–511.

    Google Scholar 

  • Gao, Z.-K., Cai, Q., Yang, Y.-X., Dang, W.-D., & Zhang, S.-S. (2016). Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series. Scientific Reports, 6, 35622.

    Google Scholar 

  • Gardner, A. B., Worrell, G. A., Marsh, E., Dlugos, D., & Litt, B. (2007). Human and automated detection of high-frequency oscillations in clinical intracranial EEG recordings. Clinical Neurophysiology, 118(5), 1134–1143.

    Google Scholar 

  • Gilchrist, J., Ennett, C.M., Frize, M., & Bariciak, E. (2011). Neonatal mortality prediction using real-time medical measurements. In 2011 IEEE international workshop on medical measurements and applications proceedings (MeMeA) (pp. 65–70). IEEE.

  • Glasser, M. F., Coalson, T. S., Robinson, E. C., Hacker, C. D., Harwell, J., Yacoub, E., et al. (2016). A multi-modal parcellation of human cerebral cortex. Nature, 536(7615), 171–178.

    Google Scholar 

  • Goel, S., Tomar, P., & Kaur, G. (2016). An optimal wavelet approach for ECG noise cancellation. International Journal of Bio-Science and Bio-Technology, 8(4), 39–52.

    Google Scholar 

  • Gong, G., He, Y., Concha, L., Lebel, C., Gross, D. W., Evans, A. C., et al. (2008). Mapping anatomical connectivity patterns of human cerebral cortex using in vivo diffusion tensor imaging tractography. Cerebral Cortex, 19(3), 524–536.

    Google Scholar 

  • Gorber, S. C., Tremblay, M., Moher, D., & Gorber, B. (2007). A comparison of direct vs. self-report measures for assessing height, weight and body mass index: A systematic review. Obesity Reviews, 8(4), 307–326.

    Google Scholar 

  • Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (icassp) (pp. 6645–6649). IEEE.

  • Grech, R., Cassar, T., Muscat, J., Camilleri, K. P., Fabri, S. G., Zervakis, M., et al. (2008). Review on solving the inverse problem in eeg source analysis. Journal of Neuroengineering and Rehabilitation, 5(1), 25.

    Google Scholar 

  • Green, W. J. F., Ball, G., Hulman, G., Johnson, C., Van Schalwyk, G., Ratan, H. L., et al. (2016). KI67 and DLX2 predict increased risk of metastasis formation in prostate cancer-a targeted molecular approach. British Journal of Cancer, 115(2), 236.

    Google Scholar 

  • Greenspan, H., van Ginneken, B., & Summers, R. M. (2016). Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5), 1153–1159.

    Google Scholar 

  • Grossi, E., Veggo, F., Narzisi, A., Compare, A., & Muratori, F. (2016). Pregnancy risk factors in autism: A pilot study with artificial neural networks. Pediatric Research, 79(2), 339.

    Google Scholar 

  • Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The databoost-im approach. ACM Sigkdd Explorations Newsletter, 6(1), 30–39.

    Google Scholar 

  • Hajian-Tilaki, K. (2013). Receiver operating characteristic (roc) curve analysis for medical diagnostic test evaluation. Caspian Journal of Internal Medicine, 4(2), 627.

    Google Scholar 

  • Halford, J. J., Sabau, D., Drislane, F. W., Tsuchida, T. N., & Sinha, S. R. (2016). American clinical neurophysiology society guideline 4: Recording clinical eeg on digital media. The Neurodiagnostic Journal, 56(4), 261–265.

    Google Scholar 

  • Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Springer.

  • Harrison, R. R., Kier, R. J., Chestek, C. A., Gilja, V., Nuyujukian, P., Ryu, S., et al. (2009). Wireless neural recording with single low-power integrated circuit. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 17(4), 322–329.

    Google Scholar 

  • He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In IEEE international joint conference on neural networks, 2008. IJCNN 2008 (IEEE world congress on computational intelligence) (pp. 1322–1328). IEEE.

  • He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

    Google Scholar 

  • Helmstaedter, M. (2013). Cellular-resolution connectomics: Challenges of dense neural circuit reconstruction. Nature Methods, 10(6), 501.

    Google Scholar 

  • Hess, K. R., Keith Anderson, W., Symmans, F., Valero, V., Ibrahim, N., Mejia, J. A., et al. (2006). Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. Journal of Clinical Oncology, 24(26), 4236–4244.

    Google Scholar 

  • Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

    Google Scholar 

  • Hoffmann, A., Huang, Y., Suetsugu-Maki, R., Ringelberg, C. S., Tomlinson, C. R., Rio-Tsonis, K. D., et al. (2012). Implication of the mir-184 and mir-204 competitive rna network in control of mouse secondary cataract. Molecular Medicine, 18(1), 528.

    Google Scholar 

  • Hormozdiari, F., Penn, O., Borenstein, E., & Eichler, E. E. (2015). The discovery of integrated gene networks for autism and related disorders. Genome Research, 25(1), 142–154.

    Google Scholar 

  • Huang, P.-S., Boyken, S. E., & Baker, D. (2016). The coming of age of de novo protein design. Nature, 537(7620), 320–327.

    Google Scholar 

  • Hughes, C., Henderson, A., Kansiz, M., Dorling, K. M., Jimenez-Hernandez, M., Brown, Michael D., et al. (2015). Enhanced ftir bench-top imaging of single biological cells. Analyst, 140(7), 2080–2085.

    Google Scholar 

  • Hyvärinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis (Vol. 46). Wiley.

  • Hyvärinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439.

    Google Scholar 

  • Iasemidis, L. D., Shiau, D.-S., Pardalos, P. M., Chaovalitwongse, W., Narayanan, K., Prasad, A., et al. (2005). Long-term prospective on-line real-time seizure prediction. Clinical Neurophysiology, 116(3), 532–544.

    Google Scholar 

  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.

    Google Scholar 

  • Jeong, H., Mason, S. P., Barabási, A.-L., & Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature, 411(6833), 41.

    Google Scholar 

  • Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., & Barabási, A.-L. (2000). The large-scale organization of metabolic networks. Nature, 407(6804), 651.

    Google Scholar 

  • Jia, J., Liu, Z., Xiao, X., Liu, B., & Chou, K.-C. (2015). ippi-esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac. Journal of Theoretical Biology, 377, 47–56.

    Google Scholar 

  • Jia, Y., Wei, E., Wang, X., Zhang, X., Morrison, J. C., Parikh, M., et al. (2014). Optical coherence tomography angiography of optic disc perfusion in glaucoma. Ophthalmology, 121(7), 1322–1332.

    Google Scholar 

  • Johnson, A. E. W., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., et al. (2016). Mimic-III, a freely accessible critical care database. Scientific Data, 3, 160035.

    Google Scholar 

  • Johnsson, P., Ackley, A., Vidarsdottir, L., Lui, W.-O., Corcoran, M., Grandér, D., et al. (2013). A pseudogene long-noncoding-rna network regulates pten transcription and translation in human cells. Nature Structural and Molecular Biology, 20(4), 440.

    Google Scholar 

  • Jombart, T., Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: A new method for the analysis of genetically structured populations. BMC Genetics, 11(1), 94.

    Google Scholar 

  • Kabir, M. A., & Shahnaz, C. (2012). Denoising of ECG signals based on noise reduction algorithms in EMD and wavelet domains. Biomedical Signal Processing and Control, 7(5), 481–489.

    Google Scholar 

  • Kasthuri, N., Hayworth, K. J., Berger, D. R., Schalek, R. L., Conchello, J. A., Knowles-Barley, S., et al. (2015). Saturated reconstruction of a volume of neocortex. Cell, 162(3), 648–661.

    Google Scholar 

  • Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology, 10(11), e1003915.

    Google Scholar 

  • Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11(1), 51.

    Google Scholar 

  • Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.

    Google Scholar 

  • Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21(1–3), 1–6.

    Google Scholar 

  • Korenkevych, D., Chien, J.-H., Zhang, J., Shiau, D.-S., Sackellares, C., & Pardalos, P. M. (2013). Small world networks in computational neuroscience. In Handbook of combinatorial optimization (pp. 3057–3088). Springer.

  • Korenkevych, D., Ozrazgat-Baslanti, T., Thottakkara, P., Hobson, C. E., Pardalos, P., Momcilovic, P., et al. (2016). The pattern of longitudinal change in serum creatinine and ninety-day mortality after major surgery. Annals of Surgery, 263(6), 1219.

    Google Scholar 

  • Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25–36.

    Google Scholar 

  • Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced data sets: One sided sampling. In Proceedings of the fourteenth international conference on machine learning (pp. 179–186).

  • Latora, V., & Marchiori, M. (2003). Economic small-world behavior in weighted networks. The European Physical Journal B-Condensed Matter and Complex Systems, 32(2), 249–263.

    Google Scholar 

  • Lee, D.-S., Park, J., Kay, K. A., Christakis, N. A., Oltvai, Z. N., & Barabási, A.-L. (2008). The implications of human metabolic network topology for disease comorbidity. Proceedings of the National Academy of Sciences, 105(29), 9880–9885.

    Google Scholar 

  • Ling, C. X., & Li, C. (1998). Data mining for direct marketing: Problems and solutions. In KDD (Vol. 98, pp. 73–79).

  • Ling, C. X., & Sheng, V. S. (2011). Cost-sensitive learning. In Encyclopedia of machine learning (pp. 231–235). Springer.

  • Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (2004). Decision trees with minimal costs. In Proceedings of the twenty-first international conference on Machine learning (p.  69). ACM.

  • Liu, B., Wei, Y., Zhang, Y., & Yang, Q. (2017). Deep neural networks for high dimension, low sample size data. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 2287–2293).

  • Liu, W., Liu, C., Chen, F., Yang, J., & Zheng, L. (2016). Discrimination of transgenic soybean seeds by terahertz spectroscopy. Scientific Reports, 6, 35799.

    Google Scholar 

  • Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550.

    Google Scholar 

  • Liu, X.-Y., & Zhou, Z.-H. (2006). The influence of class imbalance on cost-sensitive learning: An empirical study. In Sixth international conference on data mining, 2006. ICDM’06 (pp. 970–974). IEEE.

  • Lorente, D., Aleixos, N., Gómez-Sanchis, J., Cubero, S., García-Navarrete, Or L., & Blasco, J. (2012). Recent advances and applications of hyperspectral imaging for fruit and vegetable quality assessment. Food and Bioprocess Technology, 5(4), 1121–1142.

    Google Scholar 

  • Lowery, A. J., Miller, N., Devaney, A., McNeill, R. E., Davoren, P. A., Lemetre, C., et al. (2009). Microrna signatures predict oestrogen receptor, progesterone receptor and her2/neu receptor status in breast cancer. Breast Cancer Research, 11(3), R27.

    Google Scholar 

  • Luo, J., Min, W., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: A literature review. Biomedical Informatics Insights, 8, 1.

    Google Scholar 

  • Mangasarian, O. L., & Wild, E. W. (2006). Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 69–74.

    Google Scholar 

  • Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).

  • Manjón, J. V., Coupé, P., & Buades, A. (2015). Mri noise estimation and denoising using non-local pca. Medical Image Analysis, 22(1), 35–47.

    Google Scholar 

  • Mardis, E. R. (2011). A decades perspective on DNA sequencing technology. Nature, 470(7333), 198.

    Google Scholar 

  • Martis, R. J., Acharya, U. R., Lim, C. M., Mandana, K. M., Ray, A. K., & Chakraborty, C. (2013). Application of higher order cumulant features for cardiac health diagnosis using ECG signals. International Journal of Neural Systems, 23(04), 1350014.

    Google Scholar 

  • McCarthy, K., Zabar, B., & Weiss, G. (2005). Does cost-sensitive learning beat sampling for classifying rare classes? In Proceedings of the 1st international workshop on Utility-based data mining (pp. 69–77). ACM.

  • Mika, S., Ratsch, G., Weston, J., Scholkopf, B., & Mullers, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing IX, 1999. Proceedings of the 1999 IEEE signal processing society workshop (pp. 41–48). IEEE.

  • Mikula, S. (2016). Progress towards mammalian whole-brain cellular connectomics. Frontiers in Neuroanatomy, 10, 62.

    Google Scholar 

  • Ming, L., Zhang, Q., Deng, M., Miao, J., Guo, Y., Gao, W., et al. (2008). An analysis of human microrna and disease associations. PloS ONE, 3(10), e3420.

    Google Scholar 

  • Miranda, H., Gilja, V., Chestek, C. A., Shenoy, K. V., & Meng, T. H. (2010). Hermesd: A high-rate long-range wireless transmission system for simultaneous multichannel neural recording applications. IEEE Transactions on Biomedical Circuits and Systems, 4(3), 181–191.

    Google Scholar 

  • Moore, G. E., et al. (1975). Progress in digital integrated electronics. Electron Devices Meeting, 21, 11–13.

    Google Scholar 

  • Murray, C. J. L., Lozano, R., Flaxman, A. D., Serina, P., Phillips, D., Stewart, A., et al. (2014). Using verbal autopsy to measure causes of death: The comparative performance of existing methods. BMC Medicine, 12(1), 5.

    Google Scholar 

  • Naimi, H., Adamou-Mitiche, A. B. H., & Mitiche, L. (2015). Medical image denoising using dual tree complex thresholding wavelet transform and wiener filter. Journal of King Saud University-Computer and Information Sciences, 27(1), 40–45.

    Google Scholar 

  • Naseer, N., Hong, M. J., & Hong, K.-S. (2014). Online binary decision decoding using functional near-infrared spectroscopy for the development of brain-computer interface. Experimental Brain Research, 232(2), 555–564.

    Google Scholar 

  • Newman, M. E. J. (2012). Communities, modules and large-scale structure in networks. Nature Physics, 8(1), 25.

    Google Scholar 

  • Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.

    Google Scholar 

  • Ng, M., Fleming, T., Robinson, M., Thomson, B., Graetz, N., Margono, C., et al. (2014). Global, regional, and national prevalence of overweight and obesity in children and adults during 1980–2013: A systematic analysis for the global burden of disease study 2013. The Lancet, 384(9945), 766–781.

    Google Scholar 

  • Nguyen, T. B., Wang, S., Anugu, V., Rose, N., McKenna, M., Petrick, N., et al. (2012). Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography. Radiology, 262(3), 824–833.

    Google Scholar 

  • Niedermeyer, E., & da Silva, F. L. (Eds.). (2005). Electroencephalography: Basic principles, clinical applications, and related fields. Lippincott Williams & Wilkins.

  • Nunez, P. L., & Pilgreen, K. L. (1991). The spline-laplacian in clinical neurophysiology: A method to improve EEG spatial resolution. Journal of Clinical Neurophysiology: Official Publication of the American Electroencephalographic Society, 8(4), 397–413.

    Google Scholar 

  • Oberhardt, M. A., Palsson, B. Ø., & Papin, J. A. (2009). Applications of genome-scale metabolic reconstructions. Molecular Systems Biology, 5(1), 320.

    Google Scholar 

  • Oh, S., Lee, M. S., & Zhang, B.-T. (2011). Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2), 316–325.

    Google Scholar 

  • Orth, J. D., Conrad, T. M., Na, J., Lerman, J. A., Nam, H., Feist, A. M., et al. (2011). A comprehensive genome-scale reconstruction of escherichia coli metabolism2011. Molecular Systems Biology, 7(1), 535.

    Google Scholar 

  • Pappu, V., Panagopoulos, O. P., Xanthopoulos, P., & Pardalos, P. M. (2015). Sparse proximal support vector machines for feature selection in high dimensional datasets. Expert Systems with Applications, 42(23), 9183–9191.

    Google Scholar 

  • Pardalos, P. M., Chaovalitwongse, W., Iasemidis, L. D., Sackellares, J. C., Shiau, D.-S., Carney, P. R., et al. (2004). Seizure warning algorithm based on optimization and nonlinear dynamics. Mathematical Programming, 101(2), 365–385.

    Google Scholar 

  • Park, Y. S., Choi, Y. H., Lee, H. S., Moon, D. J., Kim, S. G., Lee, J. H., et al. (2013). The impact of laser doppler imaging on the early decision-making process for surgical intervention in adults with indeterminate burns. Burns, 39(4), 655–661.

    Google Scholar 

  • Peng, Y., Jiang, Y., Yang, C., Brown, J. B., Antic, T., Sethi, I., et al. (2013). Quantitative analysis of multiparametric prostate mr images: Differentiation between prostate cancer and normal tissue and correlation with gleason scorea computer-aided diagnosis development study. Radiology, 267(3), 787–796.

    Google Scholar 

  • Picard, D. (1985). Testing and estimating change-points in time series. Advances in Applied Probability, 17(4), 841–867.

    Google Scholar 

  • Quinlan, J. R. (1993). Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning (pp. 236–243).

  • Quinlan, J. R, et al. (1992). Learning with continuous classes. In 5th Australian joint conference on artificial intelligence (Vol. 92, pp. 343–348). Singapore.

  • Raghunathan, T., & Siscovick, D. (1996). A multiple-imputation analysis of a case-control study of the risk of primary cardiac arrest among pharmacologically treated hypertensives. Journal of the Royal Statistical Society. Series C (Applied Statistics), 45, 335–352.

  • Ramgopal, S., Thome-Souza, S., Jackson, M., Kadish, N. E., Fernández, I. S., Klehm, J., et al. (2014). Seizure detection, seizure prediction, and closed-loop warning systems in epilepsy. Epilepsy & behavior, 37, 291–307.

    Google Scholar 

  • Robb, R. A. (1999). Biomedical imaging, visualization, and analysis. Wiley.

  • Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33(1–2), 1–39.

    Google Scholar 

  • Romero, I. (2011). PCA and ICA applied to noise reduction in multi-lead ECG. In Computing in cardiology, 2011 (pp. 613–616). IEEE.

  • Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

    Google Scholar 

  • Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81). Wiley.

  • Saeys, Y., Inza, I., & Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.

    Google Scholar 

  • Salam, M. T., Sawan, M., & Nguyen, D. K. (2011). A novel low-power-implantable epileptic seizure-onset detector. IEEE Transactions on Biomedical Circuits and Systems, 5(6), 568–578.

    Google Scholar 

  • Salathé, M., Kazandjieva, M., Lee, J. W., Levis, P., Feldman, M. W., & Jones, J. H. (2010). A high-resolution human contact network for infectious disease transmission. Proceedings of the National Academy of Sciences, 107(51), 22020–22025.

    Google Scholar 

  • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.

    Google Scholar 

  • Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., & Selbig, J. (2005). Non-linear PCA: A missing data approach. Bioinformatics, 21(20), 3887–3895.

    Google Scholar 

  • Shaw, L. J., Raggi, P., Berman, D. S., & Callister, T. Q. (2006). Coronary artery calcium as a measure of biologic age. Atherosclerosis, 188(1), 112–119.

    Google Scholar 

  • Shin, H.-C., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., et al. (2016). Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5), 1285–1298.

    Google Scholar 

  • Shivaswamy, P. K., Bhattacharyya, C., & Smola, A. J. (2006). Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7, 1283–1314.

    Google Scholar 

  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.

    Google Scholar 

  • Sinha, S. R., Sullivan, L. R., Sabau, D., Orta, D. S. J., Dombrowski, K. E., Halford, J. J., et al. (2016). American clinical neurophysiology society guideline 1: Minimum technical requirements for performing clinical electroencephalography. The Neurodiagnostic Journal, 56(4), 235–244.

    Google Scholar 

  • Skidmore, F., Korenkevych, D., Liu, Y., He, G., Bullmore, E., & Pardalos, P. M. (2011). Connectivity brain networks based on wavelet correlation analysis in parkinson fmri data. Neuroscience Letters, 499(1), 47–51.

    Google Scholar 

  • Sosenko, J. M., Mahon, J., Rafkin, L., Lachin, J. M., Krause-Steinrauf, H., Krischer, J. P., et al. (2011). A comparison of the baseline metabolic profiles between diabetes prevention trial-type 1 and trialnet natural history study participants. Pediatric Diabetes, 12(2), 85–90.

    Google Scholar 

  • Sporns, O., Honey, C. J., & Kötter, R. (2007). Identification and classification of hubs in brain networks. PloS ONE, 2(10), e1049.

    Google Scholar 

  • Sporns, O., Tononi, G., & Edelman, G. M. (2000). Theoretical neuroanatomy: Relating anatomical and functional connectivity in graphs and cortical connection matrices. Cerebral Cortex, 10(2), 127–141.

    Google Scholar 

  • Statnikov, A. (2011). A gentle introduction to support vector machines in biomedicine: Theory and methods (Vol. 1). World Scientific.

  • Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., et al. (2014). String v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43(D1), D447–D452.

    Google Scholar 

  • Tan, M., Wang, L., & Tsang, I. W. (2010). Learning sparse svm for feature selection on very high dimensional datasets. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 1047–1054).

  • Tang, G., & Qin, A. (2008). ECG de-noising based on empirical mode decomposition. In The 9th international conference for young computer scientists, 2008. ICYCS 2008 (pp. 903–906). IEEE.

  • Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprintarXiv:1603.08029.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.

    Google Scholar 

  • Tsirka, V., Simos, P. G., Vakis, A., Kanatsouli, K., Vourkas, M., Erimaki, S., et al. (2011). Mild traumatic brain injury: Graph-model characterization of brain networks for episodic memory. International Journal of Psychophysiology, 79(2), 89–96.

    Google Scholar 

  • van Buuren, S., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–68.

    Google Scholar 

  • van Grinsven, M. J. J. P., van Ginneken, B., Hoyng, C. B., Theelen, T., & Sánchez, C. I. (2016). Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images. IEEE Transactions on Medical Imaging, 35(5), 1273–1284.

    Google Scholar 

  • Vapnik, V. N., & Lerner, A. Y. (1963). Recognition of patterns with help of generalized portraits. Avtomat. i Telemekh, 24(6), 774–780.

    Google Scholar 

  • Vasconcelos, C. N., & Vasconcelos, B. N. (2017). Increasing deep learning melanoma classification by classical and expert knowledge based image transforms. CoRR, arXiv:abs/1702.07025.

  • Waldrop, M. M. (2016). More than moore. Nature, 530(7589), 144–148.

    Google Scholar 

  • Wang, W., Liu, Q.-H., Cai, S.-M., Tang, M., Braunstein, L. A., & Stanley, H. E. (2016). Suppressing disease spreading by using information diffusion on multiplex networks. Scientific Reports, 6, 29259.

    Google Scholar 

  • Wang, X., Fan, N., & Pardalos, P. M. (2018). Robust chance-constrained support vector machines with second-order moment information. Annals of Operations Research, 263(1–2), 45–68.

    Google Scholar 

  • Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of small-worldnetworks. Nature, 393(6684), 440.

    Google Scholar 

  • Webb, A., & Kagadis, G. C. (2003). Introduction to biomedical imaging. Medical Physics, 30(8), 2267–2267.

    Google Scholar 

  • White, J. G., Southgate, E., Thomson, J. N., & Brenner, S. (1986). The structure of the nervous system of the nematode caenorhabditis elegans. Philosophical Transaction of the Royal Society of London B Biology Science, 314(1165), 1–340.

    Google Scholar 

  • Wong, H. R., Lindsell, C. J., Pettilä, V., Meyer, N. J., Thair, S. A., Karlsson, S., et al. (2014). A multibiomarker-based outcome risk stratification model for adult septic shock. Critical Care Medicine, 42(4), 781.

    Google Scholar 

  • Wong, S. C., Gatt, A., Stamatescu, V., & McDonnell, M. D. (2016). Understanding data augmentation for classification: When to warp? In 2016 international conference on digital image computing: techniques and applications (DICTA) (pp. 1–6). IEEE.

  • Xu, Y., Jia, R., Mou, L., Li, G., Chen, Y., Lu, Y., & Jin, Z. (2016). Improved relation classification by deep recurrent neural networks with data augmentation. In COLING.

  • Yao, D. (2001). A method to standardize a reference of scalp EEG recordings to a point at infinity. Physiological Measurement, 22(4), 693.

    Google Scholar 

  • Yu, Y., Su, R., Wang, L., Qi, W., & He, Z. (2010). Comparative QSAR modeling of antitumor activity of ARC-111 analogues using stepwise MLR, PLS, and ANN techniques. Medicinal Chemistry Research, 19(9), 1233–1244.

    Google Scholar 

  • Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D., Initiative, A. D. N., et al. (2011). Multimodal classification of alzheimer’s disease and mild cognitive impairment. Neuroimage, 55(3), 856–867.

    Google Scholar 

  • Zhao, X.-M., Li, X., Chen, L., & Aihara, K. (2008). Protein classification with imbalanced data. Proteins: Structure, Function, and Bioinformatics, 70(4), 1125–1132.

    Google Scholar 

  • Zhou, J., Greicius, M. D., Gennatas, E. D., Growdon, M. E., Jang, J. Y., Rabinovici, G. D., et al. (2010). Divergent network connectivity changes in behavioural variant frontotemporal dementia and alzheimers disease. Brain, 133(5), 1352–1367.

    Google Scholar 

Download references

Acknowledgements

Panos Pardalos was partially supported by Laboratory of Algorithm and Technologies for Network Analysis, Nizhny Novgorod, Russia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Kocheturov.

Appendix

Appendix

See Tables 2 and 3.

Table 2 Key words used for the searches
Table 3 Research areas defined in Web of Science that we marked as related to Biomedicine

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kocheturov, A., Pardalos, P.M. & Karakitsiou, A. Massive datasets and machine learning for computational biomedicine: trends and challenges. Ann Oper Res 276, 5–34 (2019). https://doi.org/10.1007/s10479-018-2891-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-018-2891-2

Keywords

Navigation