skip to main content
10.1145/3309129.3309133acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbraConference Proceedingsconference-collections
research-article

Breast Cancer Prediction Using Spark MLlib and ML Packages

Published:27 December 2018Publication History

ABSTRACT

Nowadays, Machine Learning has been applied in variety aspects of life especially in health care. Classifications using Machine learning has been greatly improved in order to make predictions and to support doctors making diagnoses. Furthermore, human lives are changing with Big Data covering a wide of array of science knowledge and with Data Mining solving problems by analyzing data and discovering patterns in present databases. The prediction process is heavily data driven and therefore advanced machine learning techniques are often utilized. In this paper, we will take a look at what types experiment data are typically used, do preliminary analysis on them, and generate breast cancer prediction models - all with PySpark and its machine learning frameworks. Using a database with more than a hundred sets of data gathered in routine blood analysis, the accuracy rates of detection and classification are about 72% and 83% respectively.

References

  1. Hwa, H. L., Kuo, W. H., Chang, L. Y. et al. 2008. Prediction of breast cancer and lymph node metastatic status with tumour markers using logistic regression models. J Eval Clin Pract. 2008 Apr;14(2):275--80.Google ScholarGoogle Scholar
  2. Crisóstomo, J., Matafome, P., Santos-Silva, D. et al. 2016. Hyperresistinemia and metabolic dysregulation: a risky crosstalk in obese breast cancer. Endocrine. 2016 Aug;53(2):433--42.Google ScholarGoogle Scholar
  3. Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Using resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer. 2018 Jan 4;18(1):29.Google ScholarGoogle Scholar
  4. Cruz, J. A., & Wishart, D. S. 2006. Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2006; 2: 59--77.Google ScholarGoogle Scholar
  5. Gayathri, B.M, Sumathi, C.P., Santhanam, T. 2013. Breast cancer diagnosis using machine learning algorithms - a survey. International Journal of Distributed and Parallel systems. 2013 May;4(3).Google ScholarGoogle Scholar
  6. De Mauro, A., Greco, M., Grimaldi, M. 2015. What is big data? A consensual definition and a review of key research topics. AIP Conference Proceedings, Vol. 1644, 97--104, 2015.Google ScholarGoogle Scholar
  7. Witten, I., Frank, E., Hall, M. et al. 2016. Data Mining: Practical Machine Learning Tools and Technique 4th Edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Huang, W., Meng, L., Zhang, D. et al. 2017. In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop YARN model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1):3--19, Jan.2017.Google ScholarGoogle ScholarCross RefCross Ref
  9. Liu, T., Fang, Z., Zhao, C. et al.. 2016. Parallelization of a series of extreme learning machine algorithms based on spark, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), Okayama, 2016, pp. 1--5.Google ScholarGoogle Scholar
  10. Armbrust, M., Das, T., Davidson, A. et al. 2015. Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment - Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii, 8(12), August 2015, Pages: 1840--1843. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Luu, H. 2018. Machine Learning with Spark. Beginning Apache Spark 2, 327--383.Google ScholarGoogle ScholarCross RefCross Ref
  12. Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Breast Cancer Coimbra Data Set.Google ScholarGoogle Scholar
  13. 'The pandas project', 2018. {Online}. Available: http://pandas.pydata.org/pandas-docs/stable/Google ScholarGoogle Scholar
  14. 'Matplotlib', 2018. {Online}. Available: https://matplotlib.org/2.2.3/index.htmlGoogle ScholarGoogle Scholar
  15. Tanha, J., Someren, M., Bullet, S. et al 2015. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics. February 2017, 8(1):355--370.Google ScholarGoogle Scholar
  16. Blockeel, H., Raedt, L. D. 1998. Top-down induction of first-order logical decision trees. Journal Artificial Intelligence archive, May 1998, 101(1--2): 285--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.htmlGoogle ScholarGoogle Scholar
  18. 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/latest/ml-features.htmlGoogle ScholarGoogle Scholar
  19. 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/ml/evaluation.html.Google ScholarGoogle Scholar
  20. Hung, P. D. (2018). Central Sleep Apnea Detection Using an Accelerometer. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 106--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nam, N. T., Hung, P. D. 2018. Pest detection on traps using deep convolutional neural networks. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hung, P. D., Linh, D. Q. 2019. Implementing an android application for automatic vietnamese business card recognition. Pattern Recognition and Image Analysis, ISSN 1054--6618 29 (1), 203--213.Google ScholarGoogle Scholar

Index Terms

  1. Breast Cancer Prediction Using Spark MLlib and ML Packages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and Applications
        December 2018
        111 pages
        ISBN:9781450366113
        DOI:10.1145/3309129

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 December 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader