research-article

Breast Cancer Prediction Using Spark MLlib and ML Packages

Authors:
Phan Duy Hung

FPT University, Hanoi, Vietnam

FPT University, Hanoi, Vietnam
View Profile

,
Tran Duc Hanh

FPT University, Hanoi, Vietnam

FPT University, Hanoi, Vietnam
View Profile

,
Vu Thu Diep

Hanoi University of Science and Technology, Hanoi, Vietnam

Hanoi University of Science and Technology, Hanoi, Vietnam
View Profile

ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and ApplicationsDecember 2018Pages 52–59https://doi.org/10.1145/3309129.3309133

Published:27 December 2018Publication History

ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and Applications

Pages 52–59

ABSTRACT

Nowadays, Machine Learning has been applied in variety aspects of life especially in health care. Classifications using Machine learning has been greatly improved in order to make predictions and to support doctors making diagnoses. Furthermore, human lives are changing with Big Data covering a wide of array of science knowledge and with Data Mining solving problems by analyzing data and discovering patterns in present databases. The prediction process is heavily data driven and therefore advanced machine learning techniques are often utilized. In this paper, we will take a look at what types experiment data are typically used, do preliminary analysis on them, and generate breast cancer prediction models - all with PySpark and its machine learning frameworks. Using a database with more than a hundred sets of data gathered in routine blood analysis, the accuracy rates of detection and classification are about 72% and 83% respectively.

References

Hwa, H. L., Kuo, W. H., Chang, L. Y. et al. 2008. Prediction of breast cancer and lymph node metastatic status with tumour markers using logistic regression models. J Eval Clin Pract. 2008 Apr;14(2):275--80.Google Scholar
Crisóstomo, J., Matafome, P., Santos-Silva, D. et al. 2016. Hyperresistinemia and metabolic dysregulation: a risky crosstalk in obese breast cancer. Endocrine. 2016 Aug;53(2):433--42.Google Scholar
Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Using resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer. 2018 Jan 4;18(1):29.Google Scholar
Cruz, J. A., & Wishart, D. S. 2006. Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2006; 2: 59--77.Google Scholar
Gayathri, B.M, Sumathi, C.P., Santhanam, T. 2013. Breast cancer diagnosis using machine learning algorithms - a survey. International Journal of Distributed and Parallel systems. 2013 May;4(3).Google Scholar
De Mauro, A., Greco, M., Grimaldi, M. 2015. What is big data? A consensual definition and a review of key research topics. AIP Conference Proceedings, Vol. 1644, 97--104, 2015.Google Scholar
Witten, I., Frank, E., Hall, M. et al. 2016. Data Mining: Practical Machine Learning Tools and Technique 4th Edition. Google ScholarDigital Library
Huang, W., Meng, L., Zhang, D. et al. 2017. In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop YARN model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1):3--19, Jan.2017.Google ScholarCross Ref
Liu, T., Fang, Z., Zhao, C. et al.. 2016. Parallelization of a series of extreme learning machine algorithms based on spark, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), Okayama, 2016, pp. 1--5.Google Scholar
Armbrust, M., Das, T., Davidson, A. et al. 2015. Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment - Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii, 8(12), August 2015, Pages: 1840--1843. Google ScholarDigital Library
Luu, H. 2018. Machine Learning with Spark. Beginning Apache Spark 2, 327--383.Google ScholarCross Ref
Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Breast Cancer Coimbra Data Set.Google Scholar
'The pandas project', 2018. {Online}. Available: http://pandas.pydata.org/pandas-docs/stable/Google Scholar
'Matplotlib', 2018. {Online}. Available: https://matplotlib.org/2.2.3/index.htmlGoogle Scholar
Tanha, J., Someren, M., Bullet, S. et al 2015. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics. February 2017, 8(1):355--370.Google Scholar
Blockeel, H., Raedt, L. D. 1998. Top-down induction of first-order logical decision trees. Journal Artificial Intelligence archive, May 1998, 101(1--2): 285--297. Google ScholarDigital Library
'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.htmlGoogle Scholar
'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/latest/ml-features.htmlGoogle Scholar
'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/ml/evaluation.html.Google Scholar
Hung, P. D. (2018). Central Sleep Apnea Detection Using an Accelerometer. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 106--111. Google ScholarDigital Library
Nam, N. T., Hung, P. D. 2018. Pest detection on traps using deep convolutional neural networks. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 33--38. Google ScholarDigital Library
Hung, P. D., Linh, D. Q. 2019. Implementing an android application for automatic vietnamese business card recognition. Pattern Recognition and Image Analysis, ISSN 1054--6618 29 (1), 203--213.Google Scholar

Index Terms

Breast Cancer Prediction Using Spark MLlib and ML Packages
1. Applied computing
  1. Life and medical sciences
    1. Bioinformatics
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Term Deposit Subscription Prediction Using Spark MLlib and ML Packages
ICEBA 2019: Proceedings of the 2019 5th International Conference on E-Business and Applications

In recent years, more and more data are being collected from a variety of sources for scientific researches. At the same time, data mining (DM) and machine learning (ML) are being utilized to analyze special features from the data. Meanwhile in business,...
Read More
MLlib: machine learning in apache spark

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient ...
Read More
On Scalability of Distributed Machine Learning with Big Data on Apache Spark
Big Data – BigData 2018
Abstract
Performance of traditional machine learning systems does not scale up while working in the world of Big Data with training sets that can easily contain petabytes of data. Thus, new technologies and approaches are needed that can efficiently ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and Applications
December 2018
111 pages
ISBN:9781450366113
DOI:10.1145/3309129

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 December 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Apache spark
Breast cancer
ML packages
MLlib
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 465
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Breast Cancer Prediction Using Spark MLlib and ML Packages

ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Term Deposit Subscription Prediction Using Spark MLlib and ML Packages

MLlib: machine learning in apache spark

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Breast Cancer Prediction Using Spark MLlib and ML Packages

ICBRA '18: Proceedings of the 5th International Conference on Bioinformatics Research and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Term Deposit Subscription Prediction Using Spark MLlib and ML Packages

MLlib: machine learning in apache spark

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media