Skip to main content

Large-Scale Data Classification System Based on Galaxy Server and Protected from Information Leak

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10192))

Abstract

In this work we present SPICY (SPecialized Classification sYstem) application for a supervised data analysis (feature selection, classification, model validation and model selection) with the structure preventing the data processing work-flow from so called information leak. The information leak may result in optimistically biased classification quality assessment, especially for large-scale, small-sample data sets. The application uses the Galaxy Server environment that originally allows the user to manual data processing and is not prevented from the information leak. The way how the classification model is built by the user and the specific structure of all implemented methods makes the information leak impossible. The lack of information leak in the presented supervised data analysis tool is demonstrated on numerical examples, where synthetic and real data sets are used.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Eberhard, C., Grning, B., Guerler, A., Hillman-Jackson, J., Von Kuster, G., Rasche, E., Soranzo, N., Turaga, N., Taylor, J., Nekrutenko, A.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucl. Accids Res. 44, w3–w10 (2016)

    Article  Google Scholar 

  2. Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99(10), 6562–6566 (2002)

    Article  MATH  Google Scholar 

  3. Eszlinger, M., Wiench, M., Jarzab, B., Krohn, K., Beck, M., Luter, J., Gubaa, E., Fujarewicz, K., Swierniak, A., Paschke, R.: Meta- and reanalysis of gene expression profiles of hot and cold thyroid nodules and papillary thyroid carcinoma for gene groups. J. Clin. Endocrinol. Metab. 91, 1934–1942 (2006)

    Article  Google Scholar 

  4. Fujarewicz, K., Kimmel, M., Rzeszowska-Wolny, J., Swierniak, A.: A note on classification of gene expression data using support vector machines. J. Biol. Syst. 11(1), 43–56 (2003)

    Article  MATH  Google Scholar 

  5. Fujarewicz, K., Jarzab, M., Eszlinger, M., Krohn, K., Paschke, R., Oczko-Wojciechowska, M., Wiench, M., Kukulska, A., Jarzab, B., Swierniak, A.: A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping. Endocr. Relat. Cancer 14, 809–826 (2007)

    Article  Google Scholar 

  6. Jarzab, B., Wiench, M., Fujarewicz, K., Simek, K., Jarzab, M., Oczko-Wojciechowska, M., Wloch, J., Czarniecka, A., Chmielik, E., Lange, D., Pawlaczek, A., Szpak, S., Gubala, E., Swierniak, A.: Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications. Cancer Res. 65, 1587–1597 (2005)

    Article  Google Scholar 

  7. Michiels, S., Koscielny, S., Hill, C.: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365(9458), 488–492 (2005)

    Article  Google Scholar 

  8. Ntzani, E.E., Ioannidis, J.P.A.: Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362(9394), 1439–1444 (2003)

    Article  Google Scholar 

  9. Nutt, C.L., Mani, D.R., Betensky, R.A., Tamayo, P., Cairncross, J.G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M.E., Batchelor, T.T., Black, P.M., von Deimling, A., Pomeroy, S.L., Golub, T.R., David Louis, D.N.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 63(7), 1602–1607 (2003)

    Google Scholar 

  10. Psiuk-Maksymowicz, K., Placzek, A., Jaksik, R., Student, S., Borys, D., Mrozek, D., Fujarewicz, K., Swierniak, A.: A holistic approach to testing biomedical hypotheses and analysis of biomedical data. Commun. Comput. Inf. Sci. 616, 449–462 (2016)

    Google Scholar 

  11. Ruschhaupt, M., Huber, W., Poustka, A., Mansmann, U.: A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat. Appl. Genet. Mol. Biol. 3(1), 1–26 (2004)

    MathSciNet  MATH  Google Scholar 

  12. Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B., Wiench, M., Rzeszowska, J.: Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data. Eng. Appl. Artif. Intell. 17, 417–427 (2004)

    Article  Google Scholar 

  13. Simon, R., Radmacher, M.D., Dobbin, K., McShane, L.M.: Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95(1), 14–18 (2003)

    Article  Google Scholar 

  14. Student, S., Fujarewicz, K.: Stable feature selection and classification algorithms for multiclass microarray data. Biol. Direct 7 (2012). Article ID. 33

    Google Scholar 

  15. Student, S., Pieter, J., Fujarewicz, K.: Multiclass classification problem of large-scale biomedical meta-data. Proc. Technol. 22, 938–945 (2016)

    Article  Google Scholar 

  16. Student, S.: Breast cancer prognostic 2-class classification of multidimensional molecular data. In: Prusty, R.M. (eds.) IRAJ, Hungary, pp. 59–62 (2016)

    Google Scholar 

  17. Wessels, L.F., Reinders, M.J., Hart, A.A., Veenman, C.J., Dai, H., He, Y.D., van’t Veer, L.J.: A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21(19), 3755–3762 (2005)

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by the NCBiR under Grants DZP/PBS3/2441/2014 (KF, SS, TZ, MJ, AS) and Strategmed2/267398/4/NCBR/2015 (JP, KP). Calculations were performed using the infrastructure supported by the computer cluster Ziemowit (www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA project No. POIG.02.01.00-00-166/08 and expanded in the POIG.02.03.01-00-040/13 in the Computational Biology and Bioinformatics Laboratory of the Biotechnology Centre at the Silesian University of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Fujarewicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Fujarewicz, K. et al. (2017). Large-Scale Data Classification System Based on Galaxy Server and Protected from Information Leak. In: Nguyen, N., Tojo, S., Nguyen, L., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2017. Lecture Notes in Computer Science(), vol 10192. Springer, Cham. https://doi.org/10.1007/978-3-319-54430-4_73

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54430-4_73

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54429-8

  • Online ISBN: 978-3-319-54430-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics