Abstract
In this work we present SPICY (SPecialized Classification sYstem) application for a supervised data analysis (feature selection, classification, model validation and model selection) with the structure preventing the data processing work-flow from so called information leak. The information leak may result in optimistically biased classification quality assessment, especially for large-scale, small-sample data sets. The application uses the Galaxy Server environment that originally allows the user to manual data processing and is not prevented from the information leak. The way how the classification model is built by the user and the specific structure of all implemented methods makes the information leak impossible. The lack of information leak in the presented supervised data analysis tool is demonstrated on numerical examples, where synthetic and real data sets are used.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Eberhard, C., Grning, B., Guerler, A., Hillman-Jackson, J., Von Kuster, G., Rasche, E., Soranzo, N., Turaga, N., Taylor, J., Nekrutenko, A.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucl. Accids Res. 44, w3–w10 (2016)
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99(10), 6562–6566 (2002)
Eszlinger, M., Wiench, M., Jarzab, B., Krohn, K., Beck, M., Luter, J., Gubaa, E., Fujarewicz, K., Swierniak, A., Paschke, R.: Meta- and reanalysis of gene expression profiles of hot and cold thyroid nodules and papillary thyroid carcinoma for gene groups. J. Clin. Endocrinol. Metab. 91, 1934–1942 (2006)
Fujarewicz, K., Kimmel, M., Rzeszowska-Wolny, J., Swierniak, A.: A note on classification of gene expression data using support vector machines. J. Biol. Syst. 11(1), 43–56 (2003)
Fujarewicz, K., Jarzab, M., Eszlinger, M., Krohn, K., Paschke, R., Oczko-Wojciechowska, M., Wiench, M., Kukulska, A., Jarzab, B., Swierniak, A.: A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping. Endocr. Relat. Cancer 14, 809–826 (2007)
Jarzab, B., Wiench, M., Fujarewicz, K., Simek, K., Jarzab, M., Oczko-Wojciechowska, M., Wloch, J., Czarniecka, A., Chmielik, E., Lange, D., Pawlaczek, A., Szpak, S., Gubala, E., Swierniak, A.: Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications. Cancer Res. 65, 1587–1597 (2005)
Michiels, S., Koscielny, S., Hill, C.: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365(9458), 488–492 (2005)
Ntzani, E.E., Ioannidis, J.P.A.: Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362(9394), 1439–1444 (2003)
Nutt, C.L., Mani, D.R., Betensky, R.A., Tamayo, P., Cairncross, J.G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M.E., Batchelor, T.T., Black, P.M., von Deimling, A., Pomeroy, S.L., Golub, T.R., David Louis, D.N.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 63(7), 1602–1607 (2003)
Psiuk-Maksymowicz, K., Placzek, A., Jaksik, R., Student, S., Borys, D., Mrozek, D., Fujarewicz, K., Swierniak, A.: A holistic approach to testing biomedical hypotheses and analysis of biomedical data. Commun. Comput. Inf. Sci. 616, 449–462 (2016)
Ruschhaupt, M., Huber, W., Poustka, A., Mansmann, U.: A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat. Appl. Genet. Mol. Biol. 3(1), 1–26 (2004)
Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B., Wiench, M., Rzeszowska, J.: Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data. Eng. Appl. Artif. Intell. 17, 417–427 (2004)
Simon, R., Radmacher, M.D., Dobbin, K., McShane, L.M.: Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95(1), 14–18 (2003)
Student, S., Fujarewicz, K.: Stable feature selection and classification algorithms for multiclass microarray data. Biol. Direct 7 (2012). Article ID. 33
Student, S., Pieter, J., Fujarewicz, K.: Multiclass classification problem of large-scale biomedical meta-data. Proc. Technol. 22, 938–945 (2016)
Student, S.: Breast cancer prognostic 2-class classification of multidimensional molecular data. In: Prusty, R.M. (eds.) IRAJ, Hungary, pp. 59–62 (2016)
Wessels, L.F., Reinders, M.J., Hart, A.A., Veenman, C.J., Dai, H., He, Y.D., van’t Veer, L.J.: A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21(19), 3755–3762 (2005)
Acknowledgment
This work was supported by the NCBiR under Grants DZP/PBS3/2441/2014 (KF, SS, TZ, MJ, AS) and Strategmed2/267398/4/NCBR/2015 (JP, KP). Calculations were performed using the infrastructure supported by the computer cluster Ziemowit (www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA project No. POIG.02.01.00-00-166/08 and expanded in the POIG.02.03.01-00-040/13 in the Computational Biology and Bioinformatics Laboratory of the Biotechnology Centre at the Silesian University of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Fujarewicz, K. et al. (2017). Large-Scale Data Classification System Based on Galaxy Server and Protected from Information Leak. In: Nguyen, N., Tojo, S., Nguyen, L., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2017. Lecture Notes in Computer Science(), vol 10192. Springer, Cham. https://doi.org/10.1007/978-3-319-54430-4_73
Download citation
DOI: https://doi.org/10.1007/978-3-319-54430-4_73
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54429-8
Online ISBN: 978-3-319-54430-4
eBook Packages: Computer ScienceComputer Science (R0)