Full length article
Simulation and sensor data fusion for machine learning application

https://doi.org/10.1016/j.aei.2022.101600Get rights and content

Abstract

The performance of machine learning algorithms depends to a large extent on the amount and the quality of data available for training. Simulations are most often used as test-beds for assessing the performance of trained models on simulated environment before deployment in real-world. They can also be used for data annotation, i.e, assigning labels to observed data, providing thus background knowledge for domain experts. We want to integrate this knowledge into the machine learning process and, at the same time, use the simulation as an additional data source. Therefore, we present a framework that allows for the combination of real-world observations and simulation data at two levels, namely the data or the model level. At the data level, observations and simulation data are integrated to form an enriched data set for learning. At the model level, the models learned from observed and simulated data separately are combined using an ensemble technique. Based on the trade-off between model bias and variance, an automatic selection of the appropriate fusion level is proposed. Our framework is validated using two case studies of very different types. The first is an industry 4.0 use case consisting of monitoring a milling process in real-time. The second is an application in astroparticle physics for background suppression.

Introduction

The performance of machine learning depends to a great extent on the quality and the quantity of data available for training [1]. Since large data sets are most often required for training, the fusion of data sets from many sources can be helpful, but also challenging [2]. The problems of duplicate detection [3], schema matching [4], and conflict resolution [3], [5] are to be solved for the integration of several databases, especially when enclosing heterogeneous types of data. Integrating data from many sensors in order to get a comprehensive data set is widely studied for multi-sensor systems fusion, see, e.g. [6], [7]. With the internet of things, a multitude of distributed data sets are available for learning [8]. There, the different sensors measure different features of the same event, so that a new combined feature set can be created for training a model [8].

More recently, generative neural network models have been used to construct synthetic data sets [9]. However, in this paper, we want to focus on synthetic data from simulations. Simulations are most often considered as the ground truth, because they are based on theoretical knowledge of the particular domain. Hence, they are used for testing learned models or annotating the data. Simulations are used as experiments that offer scientists the chance to study a range of phenomena in a structured way. This is a standard procedure in computer science (e.g., [10]) as well as in engineering [11] and physics [12], to name but a few. Many investigations of multi-sensor fusion have used simulation environments only for testing their frameworks [13], [14]. Process simulations are established instruments to investigate processes in a virtual environment prior to deployment [15], [16]. More recently, simulations have been conceived as powerful data generators, providing promising opportunities for simulation data mining [17], [18]. However, simulations have many limitations in practice. They cannot provide a completely accurate representation of reality in real-time [19] and have usually only a limited prediction accuracy when modeling complex relationships [20].

In contrast, machine learning models can be applied in real-time and offer the opportunity to predict events based on the analysis of a set of explanatory variables. Therefore, new trends aim at replacing simulation models with surrogate machine learning models that have been trained on simulation data [21]. Some recent works focuses on learning from simulation data to monitor the real-world process and predict upcoming unknown events with a reasonable accuracy [18], [22], [16]. However, none of these approaches used simulation data to enrich the sensor data for a given learning task. In this paper, we present a framework that formally validate the use of simulation as a synthetic data generator by fulfilling certain conditions, namely completeness, conciseness, and correctness. It also suggests to solve possible data mismatches between sensors and simulation using machine learning-based methods. Additionally, our framework challenges the typical data and model handling for machine learning in the sense that it allows for more flexibility in the combination of sensor data and simulation. In particular, it automatically selects between two different integration levels.

    Data-level fusion

    The integration of observations and simulation is not restricted to the raw data of observations. Instead, a common representation for both is created and then enhanced by feature extraction, generation and selection. The performance of the model which is trained on these enhanced data is the criterion that guides the feature engineering.

    Model-level fusion

    The simulation and the sensor data can be used independently for training single models, one for each data set. The resulting models may then be combined to output the prediction. Weighting the two models adapts the framework to the application at hand.

    Level choice decision

    Moreover, the framework automatically decides which level to take based on a derived threshold for the co-variance of the single models. The criterion states that the variance-based error will be reduced if transiting from the data to the model level, if the co-variance is lower than the threshold.

The paper first gives an overview of related work in data fusion in Section 2. Conditions for the use of simulation as a valid data source are detailed in Section 3. The novel framework indicating the steps and methods of simulation-sensor fusion, including data mismatch evaluation and solving and integration level selection, is presented in Section 4. The real-world case study is then described as instance of the framework in Section 5. The real-world case studies are then described as instances of the framework in Section 5. The learning performance with and without the integration of simulation and sensor data is shown in the experiments and the automatic fusion level choice is evaluated.

Section snippets

Related work

Data fusion is an important phase in the KDD (i.e. Knowledge Discovery from Databases) process that creates enriched data from a multitude of sources that can be queried, searched, mined and analyzed for discovering new, interesting and useful patterns. Data collected from multiple sources can be fused at different levels [6]. The basic fusion level would be to append data at the raw level by fusing the raw data acquired from the different sources directly [23], [24]. For data which are used

Simulation as data generator

In addition to their major use as virtual testbeds [46], [13], [47], simulations are considered as data sources [18], [22], [16]. However, generating data with simulations is a challenging process [18], [22], [16]. First, simulations are models for reality and they are not perfectly mirroring the exact real-world situation. Even though that in many real-world settings, it is possible to rely on simulations models, the mismatch between simulation and real-world situation has to be solved in

Simulation-sensor data fusion framework

The proposed fusion framework is shown in Fig. 1.

The data acquisition process takes raw sensor data and acquires from the simulation the adequate data form depending on the use scenario of the simulation as data generator. More details are provided in Section 3.1. Then, the quality of all sensor data and simulation data are checked according to three criteria detailed in Section 3.2. Simulation-Sensor data mismatch is first solved. Feature engineering is used for enhancing the data quality by

Industry 4.0 use case: Milling process

In mechanical engineering, milling is one of the most important machining operations with a wide variety of application use cases, e.g., the machining of structural components for the aerospace industry Milling can be classified to cutting operations using defined cutting edges. The material removal is performed by superimposing a rotational movement of the tool rotation axis and a translational movement of the tool relative to the workpiece. Fig. 2 depicts an example of a pocket milling

Conclusion

The increasing availability of a wide range of data generators using measurement technologies such as sensors, physical models such as simulations and generative machine learning models such as GANs, among others, offers more and more growing amounts of data. Hence, a real-world entity can be described by more than one data source and with a wide variety of data types. Since both data quantity and quality matter for the success of machine learning algorithms, integrating efficiently data

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 and the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01—S18038A).

References (90)

  • ShaoY. et al.

    A machine learning based global simulation data mining approach for efficient design changes

    Adv. Eng. Softw.

    (2018)
  • GovekarE. et al.

    On stability and dynamics of milling at small radial immersion

    CIRP Ann.

    (2005)
  • KuljanicE. et al.

    TWEM, a method based on cutting forces—Monitoring tool wear in face milling

    Int. J. Mach. Tools Manuf.

    (2005)
  • de AguiarM.M. et al.

    Correlating surface roughness, tool wear and tool vibration in the milling process of hardened steel using long slender tools

    Int. J. Mach. Tools Manuf.

    (2013)
  • CusF. et al.

    An intelligent system for monitoring and optimization of ball-end milling process

    J. Mater Process. Technol.

    (2006)
  • KrauseM. et al.

    Improved γ/hadron separation for the detection of faint γ-ray sources using boosted decision trees

    Astropart. Phys.

    (2017)
  • AlbertJ.

    Implementation of the random forest method for the imaging atmospheric Cherenkov telescope MAGIC

    Nucl. Instrum. Methods Phys. Res. A

    (2008)
  • DengJ. et al.

    Imagenet: A large-scale hierarchical image database

  • LahatD. et al.

    Multimodal data fusion: An overview of methods, challenges, and prospects

    Proc. IEEE

    (2015)
  • DongX.L. et al.

    Data fusion: Resolving data conflicts for integration

    Proc. VLDB Endow.

    (2009)
  • BernsteinP.A. et al.

    Generic schema matching, ten years later

    Proc. VLDB Endow.

    (2011)
  • NaumannF. et al.

    Data fusion in three steps: Resolving inconsistencies at schema-, tuple-, and value-level

    IEEE Data Eng. Bull.

    (2006)
  • PiresI. et al.

    From data acquisition to data fusion: A comprehensive review and a roadmap for the identification of activities of daily living using mobile devices

    Sensors

    (2016)
  • StolpeM. et al.

    Distributed support vector machines: An overview

  • GoodfellowI. et al.

    Generative adversarial nets

  • GrulichP.M. et al.

    Generating reproducible out-of-order data streams

  • BilandA.

    Calibration and performance of the photon sensor response of FACT — The first G-APD Cherenkov telescope

    J. Instrum.

    (2014)
  • FischerY. et al.

    Object-oriented sensor data fusion for wide maritime surveillance

  • HerpelT. et al.

    Multi-sensor data fusion in automotive applications

  • CaoB.-T. et al.

    A hybrid RNN-GPOD surrogate model for real-time settlement predictions in mechanised tunnelling

    Adv. Model. Simul. Eng. Sci.

    (2016)
  • BunseM. et al.

    Towards active simulation data mining

  • FischerC.

    Runtime and accuracy issues in three-dimensional finite element simulation of machining

    Int. J. Mach. Mach. Mater.

    (2009)
  • MeschkeG. et al.

    Big data and simulation- A new approach for real-time TBM steering

  • VarshneyP.K.

    Multisensor data fusion

    Electron. Commun. Eng. J.

    (1997)
  • HallD.L. et al.

    An introduction to multisensor data fusion

    Proc. IEEE

    (1997)
  • EstebanJ. et al.

    A review of data fusion models and architectures: Towards engineering guidelines

    Neural Comput. Appl.

    (2005)
  • LiuC. et al.

    A sensor fusion and support vector machine based approach for recognition of complex machining conditions

    J. Intell. Manuf.

    (2018)
  • WaskeB. et al.

    Fusion of support vector machines for classification of multisensor data

    IEEE Trans. Geosci. Remote Sens.

    (2007)
  • S. Paradis, J. Roy, W. Treurniet, Integration of all data fusion levels using a blackboard architecture, in:...
  • HerpelT. et al.

    Multi-sensor data fusion in automotive applications

  • StolpeM.

    The Internet of Things: Opportunities and challenges for distributed data analysis

    SIGKDD Explor.

    (2016)
  • LiuC. et al.

    A sensor fusion and support vector machine based approach for recognition of complex machining conditions

    J. Intell. Manuf.

    (2018)
  • WangF. et al.

    Research on software architecture of prognostics and health management system for civil aircraft

  • FangX. et al.

    Scalable prognostic models for large-scale condition monitoring applications

    IISE Trans.

    (2017)
  • LahatD. et al.

    An alternative proof for the identifiability of independent vector analysis using second order statistics

  • Cited by (10)

    • Multivariate time series prediction of complex systems based on graph neural networks with location embedding graph structure learning

      2022, Advanced Engineering Informatics
      Citation Excerpt :

      On the other hand, the wider application of deep learning in PFPP, such as anomaly detection, fault traceability and even complex process control, requires powerful modeling capabilities with multiple input and multiple output models. In addition, due to the real-time and relatively stable production process of the process industry, it is difficult to obtain high-quality data suitable for current deep learning [7]. In the soft measurement task, the key quality laboratory test data is used as the label value of supervised learning, there is a scarcity of label data, and the objectivity of the label data cannot be guaranteed due to human factors.

    View all citing articles on Scopus
    View full text