ABSTRACT
Nowadays, Machine Learning has been applied in variety aspects of life especially in health care. Classifications using Machine learning has been greatly improved in order to make predictions and to support doctors making diagnoses. Furthermore, human lives are changing with Big Data covering a wide of array of science knowledge and with Data Mining solving problems by analyzing data and discovering patterns in present databases. The prediction process is heavily data driven and therefore advanced machine learning techniques are often utilized. In this paper, we will take a look at what types experiment data are typically used, do preliminary analysis on them, and generate breast cancer prediction models - all with PySpark and its machine learning frameworks. Using a database with more than a hundred sets of data gathered in routine blood analysis, the accuracy rates of detection and classification are about 72% and 83% respectively.
- Hwa, H. L., Kuo, W. H., Chang, L. Y. et al. 2008. Prediction of breast cancer and lymph node metastatic status with tumour markers using logistic regression models. J Eval Clin Pract. 2008 Apr;14(2):275--80.Google Scholar
- Crisóstomo, J., Matafome, P., Santos-Silva, D. et al. 2016. Hyperresistinemia and metabolic dysregulation: a risky crosstalk in obese breast cancer. Endocrine. 2016 Aug;53(2):433--42.Google Scholar
- Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Using resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer. 2018 Jan 4;18(1):29.Google Scholar
- Cruz, J. A., & Wishart, D. S. 2006. Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2006; 2: 59--77.Google Scholar
- Gayathri, B.M, Sumathi, C.P., Santhanam, T. 2013. Breast cancer diagnosis using machine learning algorithms - a survey. International Journal of Distributed and Parallel systems. 2013 May;4(3).Google Scholar
- De Mauro, A., Greco, M., Grimaldi, M. 2015. What is big data? A consensual definition and a review of key research topics. AIP Conference Proceedings, Vol. 1644, 97--104, 2015.Google Scholar
- Witten, I., Frank, E., Hall, M. et al. 2016. Data Mining: Practical Machine Learning Tools and Technique 4th Edition. Google ScholarDigital Library
- Huang, W., Meng, L., Zhang, D. et al. 2017. In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop YARN model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1):3--19, Jan.2017.Google ScholarCross Ref
- Liu, T., Fang, Z., Zhao, C. et al.. 2016. Parallelization of a series of extreme learning machine algorithms based on spark, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), Okayama, 2016, pp. 1--5.Google Scholar
- Armbrust, M., Das, T., Davidson, A. et al. 2015. Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment - Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii, 8(12), August 2015, Pages: 1840--1843. Google ScholarDigital Library
- Luu, H. 2018. Machine Learning with Spark. Beginning Apache Spark 2, 327--383.Google ScholarCross Ref
- Patrício, M., Pereira, J., Crisóstomo, J. et al. 2018. Breast Cancer Coimbra Data Set.Google Scholar
- 'The pandas project', 2018. {Online}. Available: http://pandas.pydata.org/pandas-docs/stable/Google Scholar
- 'Matplotlib', 2018. {Online}. Available: https://matplotlib.org/2.2.3/index.htmlGoogle Scholar
- Tanha, J., Someren, M., Bullet, S. et al 2015. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics. February 2017, 8(1):355--370.Google Scholar
- Blockeel, H., Raedt, L. D. 1998. Top-down induction of first-order logical decision trees. Journal Artificial Intelligence archive, May 1998, 101(1--2): 285--297. Google ScholarDigital Library
- 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.htmlGoogle Scholar
- 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/latest/ml-features.htmlGoogle Scholar
- 'PySpark', 2018. {Online}. Available: https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/ml/evaluation.html.Google Scholar
- Hung, P. D. (2018). Central Sleep Apnea Detection Using an Accelerometer. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 106--111. Google ScholarDigital Library
- Nam, N. T., Hung, P. D. 2018. Pest detection on traps using deep convolutional neural networks. In Proceedings of the 2018 International Conference on Control and Computer Vision (ICCCV '18). ACM, New York, NY, USA, 33--38. Google ScholarDigital Library
- Hung, P. D., Linh, D. Q. 2019. Implementing an android application for automatic vietnamese business card recognition. Pattern Recognition and Image Analysis, ISSN 1054--6618 29 (1), 203--213.Google Scholar
Index Terms
- Breast Cancer Prediction Using Spark MLlib and ML Packages
Recommendations
Term Deposit Subscription Prediction Using Spark MLlib and ML Packages
ICEBA 2019: Proceedings of the 2019 5th International Conference on E-Business and ApplicationsIn recent years, more and more data are being collected from a variety of sources for scientific researches. At the same time, data mining (DM) and machine learning (ML) are being utilized to analyze special features from the data. Meanwhile in business,...
MLlib: machine learning in apache spark
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient ...
On Scalability of Distributed Machine Learning with Big Data on Apache Spark
Big Data – BigData 2018AbstractPerformance of traditional machine learning systems does not scale up while working in the world of Big Data with training sets that can easily contain petabytes of data. Thus, new technologies and approaches are needed that can efficiently ...
Comments