1 Introduction

Along with the explosion of big data on the Internet, the interest in the field of Data Science is rapidly increasing. As effective tools, Data Science utilizes artificial intelligence (AI) and machine learning (ML) to efficiently and accurately discover meaningful patterns hidden in large volumes of data. In other words, Data Science uses AI and ML to figure out best solutions to real-world problems in novel ways [3].

ML, as a subfield of AI, teaches a machine how to learn. More specifically, with its use of methodologies from many various disciplines such as statistics, operations research, mathematics, etc., it automatically builds analytical models to discover hidden patterns and relations in data without being explicitly programmed [24].

Nowadays, ML and statistical analysis are converging more and more. Both of these methodologies heavily utilize pattern recognition and data mining. In fact, over the past decades, these two data-driven disciplines have complemented each other and will most likely continue the same trend for years to come [1].

In addition to predictive analytics, data visualization tools are important to help data scientists better understand the significance of data. The underlying patterns and correlations buried deep underneath numbers and words can be easily revealed by the power of data visualization software like Tableau, Qlik and D3.js [12].

Cloud computing is a form of service-oriented computation, and it has allowed an ever increasing number of applications, services and platforms to be available to the general public [5]. ML benefits well from cloud computing because of the advantages of low cost of operations, scalability, and the necessary processing power to analyze large volumes of data. Databricks is an example of a just-in-time cloud-based platform. It was created to help users with carrying out the processes from data preparation to experimentation, and also with quick deployment of ML application. Databricks is a hundred times faster than the open source Apache Spark [13].

Databricks allows access to a rich set of ML algorithms ranging from clustering to SVM and various forms of linear/non-linear regression. As an example, K-means is a common unsupervised learning algorithm that has the tendency to get stuck in local optima. But, hierarchical clustering is often described as a better clustering approach in quality. Bisecting K-Means is a combination of K-Means and hierarchical clustering, as it combines the run-time efficiency of “regular” K-means with the higher quality of hierarchical clustering [6]. In ML, a kernelized SVM (Support Vector Machine), can perform inefficiently when processing large data sets (over 100K). The main cause is that kernelized SVM requires the computation of a distance function between each point in the dataset, which could require  \( {\text{O}}(n_{features} * n_{observations}^{2} ) \) operations. To improve that, techniques such as data normalization, Stochastic Gradient Descent (SGD), as well as kernel approximation are often applied [2].

One of the most challenging problems for novice data scientists is to determine which algorithms are best suited for which data sets. Many aspects of the task such as the size of the data, the needed accuracy of the results, and the available computational time and resources must be well taken into consideration. The main contribution of this paper is to offer guidelines and concrete case studies to data scientists who are interested in working with Databricks. Specifically, the SAS Algorithm Flowchart is discussed which provides useful tips for solving specific problems. The main goal is to allow the user to easily find the appropriate algorithm depending on the speed, accuracy, and to evaluate the significance of the obtained results [10]. Further, the developed guidelines for using the scikit suite are included. These guidelines offer a clear view on how to select an algorithm based on the size of the dataset.

The remainder of the paper is organized as follows. Section 2 is a brief introduction to ML model building. Section 3 discusses the conducted machine learning experiments conducted on two big data sets in Databricks. Specifically, the experiments show how to deal with a dataset that has no labels by using an unsupervised learning algorithm. They also demonstrate the performance of a standard (unoptimized) kernelized SVM, comparing it with other optimization techniques. Finally, Sect. 4 is the conclusions and future directions.

2 Building ML Models

Figure 1 depicts a simple Data Science workflow from the original dataset to analyzing the obtained classification accuracy. First, Data Cleaning is performed, which is an essential process before training. During this phase, necessary actions take place to deal with missing or corrupt values or to detect outliers in order to improve overall model accuracy. Second, the dataset is split for model training and accuracy testing. Depending on the specific dataset, operations such as StringIndexer, VectorAssembler, and OneHotEncoder can be applied to dataset. StringIndexer encodes a string column to a column of label indices. VectorAssembler is a transformer that combines a given list of columns into a single vector column. OneHotEncoder maps a column of label indices to a column of binary vectors, allowing algorithms which expect continuous features, such as Logistic Regression, to use categorical features [9]. The next step is to feed the ML model with the training set, and compute the confusion matrix, which reports the obtained classification accuracy. If the accuracy doesn’t meet the requirements of the problem and the domain, try adjusting the corresponding parameters of ML model.

Fig. 1.
figure 1

Workflow of conducting a ML model training

For Machine Learning beginners, many times it is difficult to find the right estimator to solve a specific classification problem. When dealing with different types of data, different estimators have their own strengths and weaknesses. The hierarchical graph below is designed to give them a bit of a rough guide on which estimators to use to perform classification.

Similarly, Fig. 3 shows which algorithms work best when performing regression. Both Figs. 2 and 3 are based on the scikit-learn algorithm cheat-sheet, which includes various ML techniques such as regression, clustering, and dimensionality reduction [4, 20].

Fig. 2.
figure 2

Classification map guide

Fig. 3.
figure 3

Regression map guide

3 The Experiments

The experiments discussed in this section demonstrate the typical process of training a big data ML model and evaluating the obtained accuracy in Databricks. Specifically, the experiments were designed using two very large datasets and three machine learning algorithms. The following sections provide the details of the utilized algorithms, the necessary configurations, and the analysis of the obtained results.

3.1 Machine Learning Algorithms

Bisecting K-means is an unsupervised learning algorithm, which is a combination of the traditional K-Means and hierarchical clustering. Each cluster split results in the lowest aggregated Sum of Squared Errors (SSE). The key point is that Bisecting K-means can converge to a global optimum, instead of getting stuck in local optimum as it happens in K-means. Additionally, Bisecting K-means is much more efficient than basic K-means [14, 19].

Back Propagation Neural Networks (BPNN) can learn complex nonlinear function mappings even with a large number of features. Decision Trees (DT) can provide a clear explanation of the learned decision process (the tree). However, Neural Networks cannot provide any explanation for the extracted classification or regression model. They are in a sense “black boxes” because of lack of transparency in the nature of the obtained results [16].

A Support Vector Machine (SVM) is a supervised machine learning model, which discriminates different categories by a separating line, or a hyperplane, in non-linear classification problems. It differs from other classification algorithms in the way that it chooses the decision boundary which maximizes the distance from the nearest data points of all the classes. Despite its advantages in classification accuracy, kernelized SVM can perform inefficiently on large data sets. To improve the learner’s efficiency, techniques such as data normalization, Stochastic Gradient Descent (SGD), as well as kernel approximation can be applied.

When using the kernel trick, the size of the kernel matrix increases along with the growing size of the input data points. Conversely, by using kernel approximation, those data points are instead projected onto some approximated lower dimensional space, which saves a lot of time since the matrix is now much smaller [2].

In many ML implementations, Gradient Descent is applied in order to minimize the cost function of a certain learner. Stochastic Gradient Descent (SGD) is a stochastic approximation of gradient descent optimization. When comparing to its ancestor, it randomly shuffles the data, and then instead of waiting for the algorithm to go through each and every training example, in every iteration SGD optimizes and fits just one example a little bit better [21].

In ML, DTs are primarily used for classification and regression. DT creates a predictive model, which performs classification on the dataset by iteratively learning decision rules inferred from the input features. At each iteration, a parent node represents a test on some input feature, and its leaves represent the classification result. DT performs well with large data sets, as the cost is logarithmic in the number of data points used to train the tree [7].

Scikit-learn offers various classification metrics for model evaluation. In this work, both the confusion matrix and classification report are used to evaluate the actual versus the predicted outcomes. The classification report is a built-in tool, which includes main classification metrics, consisting of precision, recall, f1-score and support values (number of predicted instances).

Both sets of experiments were conducted on Databricks clusters configured as shown below (Table 1):

Table 1. Cluster configurations

3.2 Experiment 1

The dataset used in the first experiment relates to individual household electric power consumption, including over two million measurements gathered in a house located in Sceaux (near Paris, France) between December 2006 and November 2010 (47 months), with a one-minute sampling rate [11]. In this dataset, the date and time columns are removed, since they had no relevance on the performance of the learning model, thus, the remaining 7 features are shown in Table 2.

Table 2. Individual household power consumption features

The Bisecting K-means algorithm requires that the dataset has no missing values. Thus, data cleaning was performed to remove all instances containing nulls. When deploying Bisecting K-means algorithm, there is one primary parameter needed to be adjusted, K, which is the desired number of clusters. The obtained clustering results generated by different values of K are shown in Table 3.

Table 3. Sum of squared errors of different values of K

The elbow method is the oldest visual method for determining the appropriate number of clusters in a data set. But, sometimes it results in a curve that is continuously descending, and that make it much more ambiguous to find the elbow point [22]. Figure 4 is the line chart of SSE of different values of K. Since the spot where K = 7, looks mostly like an elbow, while 7 is selected as the number of clusters.

Fig. 4.
figure 4

Elbow method

After clustering, classification is performed on the resulting dataset (7 classification labels) using the modified Kernel SVM (approximated with RBFSampler and fed into SGDClassifier). The SGDClassifier built in with the scikit-learn library, contains linear classifiers – SVM, logistic regression and a.o., with SGD training [21].

In detail, the training data (80% of total) was scaled to the range [− 1, 1], in order to improve SVM’s performance. RBFSampler was utilized to approximate the feature map of an RBF kernel, since the dataset was too massive for traditional Kernel SVM learning. Finally, the prepared data was fed to a linear SVM (with SGD) and learning was performed [23]. The obtained classification accuracy evaluated by a confusion matrix on a testing set of size 200,000 is shown in Table 4.

Table 4. Confusion matrix

In comparison to standard Kernel SVM, the enhanced method took only 15.68 s in average for training and testing, whereas the former needed more than 3 h of training. All 5 runs along with the time consumed are recorded in Table 5.

Table 5. Enhanced SVM run-times

For visualization purposes, dimensionality reduction (from 7 features to 2) is essential, therefore a 2-dimensional graph was drawn by using PCA (Principal Component Analysis) [18]. As shown in Fig. 5, the resulting classification plot looks somewhat different from the original one; this is due to the fact that only 60% of the original information is retained, after the process of PCA [8].

Fig. 5.
figure 5

Scatter plot before (left) and after (right) classification

3.3 Experiment 2

The dataset used in the second experiment is Physical Activity Monitoring dataset, which contains 3.8 million rows of data collected from 9 subjects wearing 3 inertial measurement units and a heart rate monitor [17]. The dataset’s null records were removed, and a total of 53 features were selected along with the classification outcome ActivityID as shown in Table 6.

Table 6. Physical activity monitoring features

The unnecessary columns that had no relevance on the performance of the model (columns 14–17 in IMU sensory data) were removed, and the rest of the columns are shown in Table 7.

Table 7. IMU sensory data

Three different machine learning algorithms were trained on 80% of the overall dataset: SVM with SGDClassifier, Decision Tree and Neural Network. Again, data cleaning was conducted by dropping rows which contained null cells.

As done in Experiment 1, SVM with SGDClassifier was applied instead of utilizing Kernel SVM, and the confusion matrix was computed on a 20% testing dataset is shown in Table 8. Because of the limited space, only 8 of 25 classifications are shown here. The average training of enhanced SVM over 5 statistically independent runs took 6.60 min (Table 9).

Table 8. Confusion matrix of SVM with SGDClassifier
Table 9. SVM with SGDClassifier run-times

Since the obtained overall accuracy was only 63%, K-fold validation was used but that resulted in 41% accuracy. The two other ML algorithms were applied to the same dataset as following.

Multi-layer Perceptron (MLP) is a built-in Neural Network model in scikit-learn library, which trains using back propagation. The model optimizes the log-loss function using LBFGS or stochastic gradient descent (SGD) [15]. Before training the model, MinMaxScaler was used to increase SVM speed and eliminate outlier data in the original dataset. The cleaned data was then fed into the MLP with two hidden layers, one has 30 neurons and another has 6 neurons. The obtained classification accuracy evaluated by confusion matrix used 20% of the testing dataset as shown in Table 10. The average training of MLP over 5 statistically independent runs took 5.40 min (Table 11).

Table 10. Confusion matrix of MLP
Table 11. MLP run-times

Scikit-learn library also has a model for Decision Tree, called DecisionTreeClassifier which is capable of both performing binary and multi-class classification. DecisionTreeClassifier takes two arrays as input – the features and the class labels for training samples. Before fitting the training samples, the parameters “max_depth = 20, min_samples_split = 2, criterion = ‘entropy’” were set for high classification accuracy. The average training of DT over 5 statistically independent runs took 1.89 s, which was the fastest among all three classification algorithms, and the resulting confusion matrix is shown in Table 12 (Table 13).

Table 12. Confusion matrix of decision tree
Table 13. Decision tree run-times

To compare, the Neural Network model was the slowest because the back propagation requires lengthy computations. Decision Tree performed well on the highly non-linear data, and it was the fastest to build and test. Table 14 summarizes the accuracy measures and time requirements for the three classifiers.

Table 14. Positive predictive value (PPV)

As shown in Table 14, the best accuracy was obtained by the decision tree learner while Kernel SVM with SGD produced the least favorable result. In term of future investigations, another study will have to be conducted to determine the exact cause of performance variation among different ML techniques.

And finally, to provide a visualization aid to designers, PCA was conducted for dimensionality reduction from 41 features to 2. This resulted in 38% loss of original information, and the results are shown in Fig. 6.

Fig. 6.
figure 6

Scatter plot of decision tree prediction

4 Conclusions

This paper provided general guidelines for utilizing a variety of machine learning algorithms on the cloud computing platform, Databricks. Visualization is an important means for users to understand the significance of the underlying data. Therefore, it was also demonstrated how graphical tools such as Tableau can be used to efficiently examine results of classification or clustering. The dimensionality reduction techniques such as Principal Component Analysis (PCA), which help reduce the number of features in a learning experiment, were also discussed.

To demonstrate the utility of Databricks tools, two big data sets were used for performing clustering and classification. The first experiment compared regular kernel SVM with optimized kernel SVM, where there was a significant difference in classification performance. To highlight differences in obtained accuracies and time requirements of various algorithms, another very large dataset and three different supervised classification algorithms were used. The obtained results confirmed that the Decision Tree algorithm was significantly more accurate on highly non-linear data.

In order to better visualize highly dimensional data sets, the PCA technique was utilized to reduce the number of dimensions from 7 to 2. However, this dimension reduction resulted in 40% loss of accuracy in overall classification. In terms of future improvements, the field of Descriptive Analytics offers other dimension reductions solutions, which are not necessarily limited to the 2 as used in this work.

In terms of future directions, the field of descriptive analytics offers techniques that can help designers visualize highly dimensional data sets without having to resort to dimensional reduction methods such as PCA. Further, another possible investigation could involve examining the low classification accuracy obtained by SVM on the physical activity monitor data set reported in the second experiment.