Keywords

1 Introduction

Lack of trust in big data (analytics) is an important issue in industry according to studies from companies such as KPMG [20]. In fact, a study [21] found that just one-third of CEOs trust data analytics. Trust has also been recognized by academia as an important concern for the success of analytics [1].

Trust is the willingness to rely on another [8]. Multiple factors ranging from more subjective reasons such as an individual’s predisposition to more objective reasons underlie the formation of trust [34]. Here we focus more on providing objective reasons for increasing trust namely knowledge-based reasons and reasons based on interaction, e.g. the possibility for verification.

Trust in the field of big data is often related to data quality [13] that can be increased by means of data quality assurance processes. There has been less emphasis on the relation of trust and models or computation [29], e.g. the impact of data changes on deployed models. In this work, we elaborate on trust in the outcomes of big data analytics. Outcomes are of course impacted by the nature of the data but also by the model chosen during the analysis process. Big data is often characterized using 4Vs [7]: (i) volume relating to the large amount of data; (ii) velocity expressing the speed of data processing (often in the form of streams), (iii) variety highlighting the diversity of data types covering both unstructured and structured data; and (iv) veracity emphasizing data uncertainty and imprecision. In particular, veracity is associated with credibility, truthfulness and objectivity as well as trust [25]. Uncertainty can stem from data inconsistencies and incompleteness, ambiguities, latency, deception and model approximations [7]. Uncertainty also has a time dimension, meaning that data distributions are not necessary stable over time. Concepts found in big data might change over time. For example, text from social media might show a strong variation of topics across time. Even the meaning of individual words might change over time or by region. Data might also be impacted by various sources of noise. For example, sensor data might vary due to environmental factors such as weather that are not recorded in the data. Furthermore, data might be created intentionally with the goal of deception. For instance, credit loan applications might contain forged data to obtain a loan. Veracity embraces the idea that despite poor data quality one might obtain valuable insights. Therefore, to account for variation of data (and its quality), model assessment should include an analysis of sensitivity of the proposed models with respect to veracity. How much will the model deteriorate if the data distribution changes? Is the model robust to random noise or carefully crafted deception attempts? To this end, we propose to compute metrics covering various aspects of veracity, such as expected deviation of performance metrics of models due to noise. We also propose to adjust methods intended to handle deception and concept drifts to compute validation metrics that allow to address a model’s robustness to attacks and changes in concepts. Empirical evaluation using eight classifiers and four datasets helps in deepening the understanding of the metrics and provides valuable detailed guidelines for practitioners. Whereas our noise metrics enhance those by Garcia et al. [14], our other metrics are novel to the best of our knowledge.

To counteract mistrust, opening up the black-box of analytics such as machine learning techniques is one way [20]. Many models such as deep learning have a reputation of being uninterpretable. These models are often very complex, encoding knowledge using millions of parameters and are only poorly understood conceptually [15]. Despite being seemingly uninterpretable deep learning, for instance, has been regarded as breakthrough technology [18]. Deep neural nets do not just outperform other more interpretable techniques by large margins but sometimes even humans [15]. Therefore, relying on simpler but more intuitive models might come with a significant degradation in desired performance. Fortunately, the boom of big data, analytics and the surge in performance has also led to an increased effort in providing better intuition on how such models work [19, 22, 24, 28, 36]. Though some ideas for classifier explanation date back almost 20 years, the field is currently undergoing rapid evolution with many potential approaches for interpretation emerging. In this paper, we provide a brief overview of existing work on model interpretation. Model interpretation might be a suitable means to assess whether decision making is corresponding to policies or laws such as being fair and ethical [16]. For example, European legislation has granted the right to explanation for individuals with respect to algorithmic decisions. Thus, interpretability might not just increase trust but might even be a legal requirement.

Trust in a model might also be improved, if the model can be used to solve other related tasks. For example, one might expect that a tennis player might also play table-tennis better than the average person, since both sports require hitting balls with a racket. In a machine learning context, an image classifier that is being able to distinguish between bicycles and cats, should also be able to classify between motorbikes and dogs. The idea of transferability of outcomes is well-known to establish trust in qualitative research [31]. It might refer to the idea that the finding of a small population under study can be transferred to a wider population. The idea of transferability of results encoded in models (and its parameters) is captured by transfer learning [27]. Roughly, transfer learning for a classifier might also relate to the scientific concept of external validity, i.e. content validity, which assesses to what extend a measure represents all meanings of a given construct [3]. Some classifiers might be seen as learning features (measures). Transfer learning assesses the usefulness of these features (measures) in other contexts. Though the idea of transfer learning has been around for a long time, to the best of our knowledge using it as an operationalization for assessing validity in the context of analytics is novel.

The paper is structured as follows: First, we discuss general aspects of input sensitivity in Sect. 2 outlining key ideas for metrics for different changes of inputs (and outputs). The state-of-the-art in blackbox model interpretability and transfer learning as well as their relevance to model validation is discussed in Sects. 3 and 4. Section 5 shows detailed implementation and evaluation of some measures.

2 Validation Measure: Input Sensitivity

Robustness refers to the ability of tolerating perturbations that might affect a system’s performance. Sensitivity analysis is a common tool to assess robustness with respect to input changes. The main goal is often to assess how the change of certain (input) parameters impacts the (stability) of outcomes. In control theory, if minor changes in the input lead to large changes in the output a feedback system is deemed to be unstable [2]. In the context of machine learning robustness is often related to over- and underfitting. Overfitting resorts to the situation, where a machine learning algorithm mimics too closely the input data and fails to abstract from unnecessary details in the data. Underfitting covers the opposite effect, where a model is too general and does not cover important characteristic of the given input data [10]. Measuring robustness is typically done by partitioning the entire data set into a training and validation set using techniques such as cross-validation. The former data set is used to learn a model, i.e. infer model parameters. The latter is used to assess model performance on novel unseen data, e.g. by computing metrics such as classification accuracy. This split of the dataset supports the choice of a model that balances between over- and underfitting. However, model assessments based on the standard splitting procedure into training and validation sets, do not answer the question how the model behaves given changes to the input data. In the context of BDA and supervised learning, one can distinguish between robustness with respect to label perturbations and attribute perturbations. The source of the perturbation can be classified as:

  • Malicious, deceptive: Data might be crafted by an attacker as an attempt to trick the system into making wrong decisions. The current data set used to create the model might not contain any form of such deceptive data. Therefore, the exact nature of the deceptive data might be unknown. This is analogous to conventional cyber security risks on computer systems, i.e. zero-day attacks [6] where a system is confronted with an unexpected attack that exploits a previously unknown security hole, i.e. vulnerability. However, even if the general nature of attacks or data manipulation are known, preventing such attacks might be difficult. Deep learning applied to image recognition has shown to be susceptible to such attacks as illustrated in Fig. 1. Such data might sometimes be outliers, since they are often distinct from the majority of input data.

    Fig. 1.
    figure 1

    Left shows the original, correctly classified image. The center shows perturbations that are added to the original image to yield the right image that is visually identical for a human. Still it is classified as ostrich [32].

  • Concept drift: It refers to a change (over time) of the relationship between input and output values in unforeseen ways. This might lead to a deterioration of predictive performance. Reasons of concept drift might be unknown hidden contexts or unpredicted events. For example, a user’s interest in online news might change, while the distribution of incoming news documents remains the same [12].

  • Noise: There are a variety of noise sources that might impact data samples. Noise might also change over time. For example, classification labels might be noisy, i.e. due to human errors. Sensor values could exhibit random fluctuations visible as white noise in the data. Neither of these sources might have an impact on the relationship between input and output.

2.1 Measuring Robustness Against Concept Drift

Concept drift refers to the phenomenon that the relation between input and output changes over time. Changes can occur sudden or gradual. A simple strategy to cope with concept drift is to periodically retrain the model on new training data. This corresponds to forgetting all prior data. However, it could be that forgetting old data is either not feasible or non-desirable, e.g. since novel data is only unlabeled and relatively sparse.

While some models can adapt to drastic concept drifts, others might only handle minor changes in concept drift well and some models might behave a lot worse due to small changes in the relationship between input and output. Therefore, it is important to assess the robustness with respect to such changes. One can distinguish three types of drifts [12]. Two of them are illustrated in Fig. 2. The (real) concept drift refers to the changes of the conditional distribution of the output given the input, while the input distribution might remain the same. Population drift refers to change of the population from which samples are drawn. Virtual drift corresponds to a change of the distribution from which samples are drawn but not of the conditional distribution of the output given the input.

Fig. 2.
figure 2

Illustration of concept drift: (a) shows the original data, (b) real concept drift due to changes of the conditional distribution, and (c) virtual drift where the input distribution changes but not the conditional distribution of the output given the input

Our goal is to assess robustness of a model against concept drift. To this end, we propose to derive multiple data sets from the original data sets based on the assumed changes over time, e.g. sudden, incremental and gradual. These data sets can be used to measure the difference in performance between the original data and data where concepts have altered. A detailed case study will be given in Sect. 5.

2.2 Measuring Robustness Against Noise

Noise in the data can also cause a deterioration of model performance. Noise can be handled during the preprocessing phase, e.g. using filtering algorithms. Noisy data can also directly be used for training [14]. Noise might impact labels and attributes. It has been shown [37] that (in general) attribute noise is more harmful than label noise and that noise on attributes strongly correlating with labels has more impact than noise on weakly correlating attributes.

Frénay and Verleysen [11] provide the following taxonomy of (label) noise:

  • Noisy completely at random model (NCAR): In a setting with binary labels an error occurs uniformly at random, independent of all data, i.e. class labels and/or attributes are altered with constant error probability.

  • Noisy at random model (NAR): This model includes error dependencies on the class label but does not account for dependency of input data. In the case of binary labels NAR equals NCAR.

  • Noisy not at random model (NNAR) is the most general noise model. It enables to capture the dependency of the error based on the inputs. For example, one can account for the frequently observed phenomenon that label error rates are larger near the decision boundary.

To assess the robustness against noise the first step is to create multiple data sets from the original data through perturbation according to the above noise models. As suggested in [14] one can compute a measure for the expected deterioration of the classifier. We also suggest to compute the standard deviation. This provides a coarse estimate of the risk that for a noisy dataset the deviation is significantly higher than the expectation. A detailed case study will be given in Sect. 5.

2.3 Measuring Robustness Against Malicious Data

Big data is often associated with large economic value. Systems based on big data might make decisions that can have a profound impact on individuals. For that reason, these individuals have an interest of manipulating the big data system. For example, credit loan applications decide whether a customer is supposed to get a loan. Face recognition such as Facebook’s DeepFace using deep learning [33] might not only tag users but also be used to identify criminals. Thus, a party might conduct attacks by injecting tampered data. Such attacks can be categorized along three dimensions [4]:

  • The influence dimension characterizes the degree of control of the malicious party. Causative attacks assume some control of the attacker over the training data. An attacker attempts to manipulate learning itself. Exploratory attacks only exploit misclassifications but do not alter training.

  • Security violations can occur in the form of reduced availability of the system. For example, denial of service attacks often rely on forging false positive data. Integrity attacks target assets using false negatives, e.g. by obfuscating spam messages.

  • Specificity describes the scope of the attack ranging from a single instance to a wide class of instances.

For example, an exploratory attack might seek to trick a spam filter by crafting a message that appears benign for a static spam filter that does not evolve over time. If an attacker successfully poisons a mail inbox with millions of messages he might render the system temporary unusable (denial of service). An attacker might send a single message that is being classified as false positive multiple times.

Perfect security might be impossible to achieve, e.g. due to the possibility of unforeseen attacks. Thus, quantifying the exact level of security on a finite scale might be very difficult. Still, metrics can support the assessment of the resilience towards (some) attacks. Our proposed metrics are rooted on two existing strategies to handle and identify malicious data [4]:

i) General robustness measures from statistics might limit the impact that a small fraction of training data can have. The idea underlies the assumption that the majority of training data is correct and only a small amount of training data stems from adversaries. Furthermore, these data are distinguishable from benign data. In scenarios where most labels are incorrect, training is generally very difficult. Two common concepts used in safeguarding algorithms against attacks are breakdown point and influence function [26]. They could both be used to assess the robustness of the trained model to attacks. Loosely speaking the breakdown point (BP) measures the amount of contamination (proportion of atypical points) that the data may contain up to which the learned parameters still contain information about the distribution of benign data. The influence function describes the asymptotic behavior of the sensitivity curve when the data contains a small fraction of outliers. The idea is to create (artificial) malicious data and compute the breakdown point and the influence function. A large breakdown point and a small sensitivity to outliers expressed by the influence function indicate robustness.

ii) The Reject on Negative Impact (RONI) defense disregards single training examples if they profoundly increase the number of misclassifications. To this end, two classifiers are trained. One classifier is trained on a base training set and another one on the same base training set enhanced with the training example under consideration. If the accuracy of the classifier including the examined sample is lower, the sample is removed from the training data.

To compute a metric the idea is to identify the samples that are supposed to be rejected first. Then a metric for a classifier can be calculated where the training data contains all samples that are supposed to be rejected and the training data without those samples. The deterioration of the classification accuracy serves as a measure for resilience to malicious data. A detailed case study will be given in Sect. 5.

3 Validation: Model Interpretability

Interpretability is seen as a prerequisite for trust [28]. In research, internal validity serves as a source of trust in a scientific study. Internal validation is a vital part in verifying that a research study was done right. It highlights the extent to which the ideas about cause and effect are supported by the study [3]. However, supervised learning is optimized towards making associations between the target and input variable and not towards causality. For instance, the common fallacy of neglecting a hidden factor explaining input and output is possible to happen in supervised learning. Still, associations might at least provide some indication whether certain assumptions upon causality might hold. Associations could also be deemed as unintuitive or unlikely, discrediting the learnt model. Models might not only be used to make decisions (e.g. decide to which class a data sample belongs), but they might also provide information to humans for their own decision making. This information might be based on an interpretation of a model’s decision. If this interpretation matches the intuition of a human decision maker, it might increase trust.

Interpretable models can be characterized by the following properties [23]:

  • Post-hoc interpretability refers to interpreting decisions based on natural language explanations, visualizations or by examples. Using a separate process for decision making and explaining is similar to humans where the two processes might also be distinct. Post-hoc interpretations are also most suitable to provide intuition on complex models such as deep learning that (at the current stage of research) are deemed non-transparent.

  • Transparency: A transparent model allows to understand the inner workings of the model in contrast to a black box model. It might be looked at in three different ways: (i) in terms of simulatability suggesting that the model should be simple so that a person can contemplate the entire model at once; (ii) in terms of decomposability referring to the idea that each aspect of the model has an intuitive explanation; (iii) with respect to algorithmic transparency relating to the learning mechanism covering aspects such as being able to prove convergence of the algorithm.

There exist multiple strategies for explaining models [17]. Examples of transparent models that are deemed relatively easy to interpret include decision trees, rules and linear models. One strategy to explain black-box models involves approximating the complex model by one of these three simpler and more transparent models. For example, neural networks can be explained using decision trees [19, 22]. Rules might also be extracted from neural networks [36]. There are also methods that are agnostic to the classification method such as Generalized Additive Models (GAM) estimating the contribution of individual features to a decision [24].

Rather than explaining the entire model, the focus can be on explaining individual outcomes, i.e. predictions. For example, for deep learning algorithms saliency masks show area of attention. Figure 3 provides a saliency mask for images [9].

Fig. 3.
figure 3

Learnt saliency mask (left) for an image of the class “flute” shown on the right [9]

Multiple agnostic methods exist to explain predictions. An example is LIME [28] which uses local approximations based on generated random samples near the sample which outcome is supposed to be explained.

Black box inspection dealing with the goal of understanding of how the black box works or why it prefers certain decisions to others. For instance, one might measure the impact of features on the outcomes using sensitivity analysis.

All these approaches might be used to improve model interpretability and, therefore, trust. Since the field witnesses very rapid progress, better tool support can be expected in the near future.

4 Validation: Model Transferability

The idea of model transferability is to assess a model designed for one problem based on its ability to solve other but related tasks. This relates to the idea of external validity in research, e.g. in social sciences. External validity describes whether results of the study can be used for related populations. One would expect that this is the case, thus, transferability while not being a requirement for trust, demonstrating transferability can increase trust.

Conventional machine learning is characterized by solving a single task where training and test data have the same distribution. In transfer learning, one might consider one or several domains where the two distributions might vary called source and a target domain (distribution). In homogeneous transfer learning the two domains have the same input space but different labels. In heterogeneous transfer learning the input space of the source and target domain are different. For illustration, a strategy for transfer learning for deep learning networks is to train a network on the source dataset (potentially even in an unsupervised manner). Then one might only keep lower layers that capture general features and remove higher layers that might be too specific to the source dataset. Finally, the model is trained (additionally) using the target dataset [5]. This is referred to parameter sharing. However, there is a multitude of transfer learning strategies [27, 35]. These are often specific to certain machine learning techniques.

To assess transferability quantitatively, a first step is to identify suitable tasks and datasets to which a model can be transferred to. Then the method of transfer has to be defined. For many classifiers there exist variants that are tailored towards transfer learning [27, 35]. The final step is to compute the gain in performance of a model using transfer learning compared to a model that is only trained on the target domain data without any form of transfer.

5 Implementation and Evaluation of Measures

In this section we show how robustness against noise (Sect. 2.2) can be specified in detail, implemented and interpreted. Detailed assessments of other measures are deferred to future work. We use several datasets and classifiers. Our performance metric that should be maximized is accuracy. Other metrics such as F1-Score might be used in the same manner.

5.1 Datasets and Classifiers

We conducted the experimental evaluation involving four datasets (MNIST, Artificial, Spambase, Phoneme) and seven classifiers (KNN, SVM, logistic regression, random forests, neural networks, decision trees, Naïve Bayes). The four considered datasets are frequently used within the data mining community. We used 5000 samples of the MNIST database of handwritten digits each consisting of 16 × 16 pixels. The Spambase collection of about 4000 spam e-mails with 57 attributes. The artificial dataset of 20000 samples was generated using the make_classification function from the scikit-learn package. The phoneme dataset consists of 5 real-valued features and 5404 samples. SVM was implemented based on [30]. All other algorithms stem from the scikit-learn package version 0.19.1 implemented in Python 2.7.12. We relied largely on default parameters and did not perform any parameter tuning. Thus, there was no need for a validation set. We used 5-KNN, a neural network with two hidden layers of 30 neurons each, a random forest with 50 trees and multinomial Naïve Bayes using 100 bins for discretizing continuous values.

5.2 Implementation of Robustness Against Noise

The detailed steps to compute the metrics capturing robustness against noise outlined which were briefly described in Sect. 2.2 are as follows:

  1. (1)

    Choose a noise model: An analyst should choose an adequate noise model with the right amount of detail, i.e. an analyst might incorporate detailed anticipation of how noise might impact individual attributes or he might rely on general models. The latter requires substantially less effort, but probably leads to less accurate estimates.

    We use the NCAR noise model described in Sect. 2.2. We consider two scenarios, one for label noise and one for attribute noise. To cover label noise, we exchange a label of a data point with another label of a randomly chosen sample with a probability of 0.25. Thus, an exchange does not necessarily imply the label is altered, i.e. if the chosen sample for exchange has the same label. Therefore, it is unclear, how many samples will obtain incorrect labels. This is a disadvantage of the method that could of course be addressed. However, requiring that every sample obtains a different label than it originally had might lead to pathologies, i.e. for a binary classification task samples from class A would become class B and samples from class A would become class B, i.e. the dataset would not change from the point of view of the classification algorithm. Compared to choosing a random class uniformly at random of all classes, our permutation approach has the advantage that the overall number of labels of each class stays the same. In turn, imbalanced datasets remain imbalanced and the other way around. This allows for easier comparison of certain performance metrics such as accuracy. The choice of the probability is arbitrary to a certain degree. The probability should be larger than the standard deviation of the performance metric (using cross-validation) on the unmodified data. Otherwise it might be difficult to reliably judge the sensitivity to noise for individual classifiers as well as to conduct a meaningful comparison among different classifiers. A large probability indicating very dramatic changes, essentially saying that (new) data will have almost no relation to the past, might be unrealistic. Attribute noise for a continuous attribute was computed by adding Gaussian noise with half the standard deviation of all instances of the attribute. Again, the choice of half the standard deviation is somewhat arbitrary. The same reasoning as for label noise applies with respect to avoiding extreme values, i.e. close to 0 or 1. Gaussian noise is a common model occurring frequently in engineering. But depending on the scenario, other models might be more appropriate.

  2. (2)

    Create multiple datasets with noise:

    Based on the original dataset multiple noisy datasets are computed. In our case we computed 10. The more datasets the more reliable are the estimates.

  3. (3)

    The performance metric is computed on all datasets:

    We computed the accuracy on seven different classifiers and four datasets using all 10 noisy datasets and the original datasets. We employed cross-validation with 10 folds.

  4. (4)

    Computation of the robustness metrics:

    We computed the mean and standard deviation of the accuracy of all folds of all datasets, i.e. the original dataset (DO), the datasets with label noise (DL) and with attribute noise (DA). Denote the average values for the accuracy computed using cross validation for each dataset and each classifier separately as E[DO], E[DL] and E[DA] and the standard deviations as std[DO], std[DL] and std[DA]. We computed the sensitivity SE as the relative change of each noise model compared to original dataset: SE(X) = (O − X)/X. For example, the sensitivity with respect to attribute noise becomes:

    • SE(E[DA]) = (E[DO] − E[DA])/E[DA]

    • SE(std[DA]) = (std[DO] − std[DA])/std[DA]

    Garcia et al. [14] computed the sensitivity SE for noise for averages as well but did not include a study on the standard deviations and used a different model for creating noisy data. Furthermore, their noise model was different. The work also did not discuss other aspects such as concept drift or transferability.

    The sensitivity of the average values indicates how much the accuracy is impacted by the noise on average. One expects this value to be in [0,1]. Zero indicates absolute resilience against noise. One indicates a very strong impact of noise. Generally, the value will be less than one, since random guessing achieves a non-zero accuracy, e.g. E[DA] will generally not be zero. Negative values are possible. They indicate that noise improved the classifier. An improvement could stem from a very unlikely coincidence (the noise simplified the problem by chance) or it might hint that the classifier is far from optimal, so that noise improves its performance. The sensitivity of the standard deviation indicates how much the accuracy fluctuates for noisy data relative to the original data. A value close to zero means that we can expect the same absolute deviation of the classification accuracy as for the original dataset. A positive value means that the deviation is smaller for the noisy dataset than for the original dataset and a negative value indicates that it is larger.

5.3 Results

Overall our findings suggest that no algorithm outperforms all others on all metrics with respect to sensitivity. Given the no free lunch theorem saying that no classifier dominates all others, this is expected. Good performance in traditional measures such as accuracy do often not imply good performance with respect to sensitivity. For instance, we have identified scenarios, where classifiers show similar outcomes with respect to accuracy but show a quite different behaviour with respect to noise sensitivity. Thus, sensitivity metrics should be taken into account in addition to other metrics when choosing a classifier.

Table 1 shows all metrics using the setup discussed in Sect. 5.2. The first column E[DO] shows that classifiers perform differently with respect to the accuracy metric on the original datasets. Some models such as neural networks also show extremely poor behaviour in some cases (e.g. the MNIST dataset). The suggested architecture of the neural network might not contain sufficient parameters to learn the task and thus underfit or learning rates might not be set optimally. It is known that more complex neural network architectures behave better on this dataset. No single classifier dominates on all benchmarks though overall random forests perform best. The standard deviation std[DO] in column 2 of the accuracy for some classifiers on some datasets is rather large (e.g. 10% for SVMs and the artificial dataset) but generally below 3%.

Table 1. Validation metrics for attribute and label noise

Label Noise Sensitivity: The label noise sensitivity SE(E[DL]) is also quite different among classifiers and datasets. For example, for the first two datasets Logistic Regression, SVMs and Multinomial Naïve Bayes are quite robust. On the third dataset SVM does not show good robustness against noise. Decision trees are most impacted by label noise on all datasets. Label noise on the MNIST dataset has drastic impact on model performance, in particular on Logistic Regression and SVMs. A practical finding is that while Random Forests perform best on the Spambase data, they deteriorate more quickly with label noise than neural networks, which achieve comparable performance on the unmodified data. Thus, neural networks might be preferable under the assumption that label noise is likely to be a concern. Even more extremely the same reasoning applies to logistic regression and SVMs on the MNIST dataset compared to random forests. Although all datasets are heavily impacted by label noise for this dataset, the extreme degradation of the former two classifiers is surprising. It might hint convergence problems or unsuitable parameter setting of the algorithms (in the presence of noise). The variation in outcomes in the presence of label noise (SE(std[DL])) is, for example, much larger for neural networks than for random forests for the artificial dataset. Otherwise the methods behave almost identical (E[DO], std[DO]) and SE(E[DL]). This suggests that random forests are a safer bet to ensure a certain level of model performance.

Attribute Noise Sensitivity: For attribute noise there seems to be a correlation with label noise but there are several cases where attribute noise and label noise show different behaviour for the same classifier. For example, for Random Forests on the Spambase dataset the attribute noise sensitivity is among the highest, whereas the label noise is comparable to most other methods. For the Phoneme dataset Decision Trees and Random Forests behave comparably with respect to attribute noise except for the variation in the attribute noise sensitivity SE(std[DA]). It decreases for random forests but increases for decision trees. Though the relative difference in SE(std[DA]) is considerable, overall it is not a key concern since the standard deviation of the accuracy on the original dataset std[DO] is low.

6 Discussion and Conclusions

We have elaborated on the issue of trust in big data analytics. Based on various existing techniques from machine learning we have derived a set of validation metrics that might yield more transparency to how models react to changes of data. They also show how models could be validated using transferability and explanation. Some of our proposals require significant effort to use, since they might be different for each classifier and dataset, e.g. transferability. Our current evaluation is also limited to noise robustness on a limited number of datasets. To provide more general conclusions on the behavior of classifiers across different datasets, more empirical evidence is needed. Furthermore, in general, it is very difficult to anticipate the exact behavior of concept drift or noise. Still, it is possible to use general assumptions on various future scenarios as an approximation. Those general assumptions lead to instantiations of the metrics that are easy to use as our evaluations for noise sensitivity has shown. Thus, we believe that the validation metrics are not only of conceptual interest but also of practical value to increase trust in big data analytics.