Simulation and sensor data fusion for machine learning application

doi:10.1016/j.aei.2022.101600

Advanced Engineering Informatics

Volume 52, April 2022, 101600

https://doi.org/10.1016/j.aei.2022.101600 Get rights and content

Abstract

The performance of machine learning algorithms depends to a large extent on the amount and the quality of data available for training. Simulations are most often used as test-beds for assessing the performance of trained models on simulated environment before deployment in real-world. They can also be used for data annotation, i.e, assigning labels to observed data, providing thus background knowledge for domain experts. We want to integrate this knowledge into the machine learning process and, at the same time, use the simulation as an additional data source. Therefore, we present a framework that allows for the combination of real-world observations and simulation data at two levels, namely the data or the model level. At the data level, observations and simulation data are integrated to form an enriched data set for learning. At the model level, the models learned from observed and simulated data separately are combined using an ensemble technique. Based on the trade-off between model bias and variance, an automatic selection of the appropriate fusion level is proposed. Our framework is validated using two case studies of very different types. The first is an industry 4.0 use case consisting of monitoring a milling process in real-time. The second is an application in astroparticle physics for background suppression.

Introduction

The performance of machine learning depends to a great extent on the quality and the quantity of data available for training [1]. Since large data sets are most often required for training, the fusion of data sets from many sources can be helpful, but also challenging [2]. The problems of duplicate detection [3], schema matching [4], and conflict resolution [3], [5] are to be solved for the integration of several databases, especially when enclosing heterogeneous types of data. Integrating data from many sensors in order to get a comprehensive data set is widely studied for multi-sensor systems fusion, see, e.g. [6], [7]. With the internet of things, a multitude of distributed data sets are available for learning [8]. There, the different sensors measure different features of the same event, so that a new combined feature set can be created for training a model [8].

More recently, generative neural network models have been used to construct synthetic data sets [9]. However, in this paper, we want to focus on synthetic data from simulations. Simulations are most often considered as the ground truth, because they are based on theoretical knowledge of the particular domain. Hence, they are used for testing learned models or annotating the data. Simulations are used as experiments that offer scientists the chance to study a range of phenomena in a structured way. This is a standard procedure in computer science (e.g., [10]) as well as in engineering [11] and physics [12], to name but a few. Many investigations of multi-sensor fusion have used simulation environments only for testing their frameworks [13], [14]. Process simulations are established instruments to investigate processes in a virtual environment prior to deployment [15], [16]. More recently, simulations have been conceived as powerful data generators, providing promising opportunities for simulation data mining [17], [18]. However, simulations have many limitations in practice. They cannot provide a completely accurate representation of reality in real-time [19] and have usually only a limited prediction accuracy when modeling complex relationships [20].

In contrast, machine learning models can be applied in real-time and offer the opportunity to predict events based on the analysis of a set of explanatory variables. Therefore, new trends aim at replacing simulation models with surrogate machine learning models that have been trained on simulation data [21]. Some recent works focuses on learning from simulation data to monitor the real-world process and predict upcoming unknown events with a reasonable accuracy [18], [22], [16]. However, none of these approaches used simulation data to enrich the sensor data for a given learning task. In this paper, we present a framework that formally validate the use of simulation as a synthetic data generator by fulfilling certain conditions, namely completeness, conciseness, and correctness. It also suggests to solve possible data mismatches between sensors and simulation using machine learning-based methods. Additionally, our framework challenges the typical data and model handling for machine learning in the sense that it allows for more flexibility in the combination of sensor data and simulation. In particular, it automatically selects between two different integration levels.

Data-level fusion

The integration of observations and simulation is not restricted to the raw data of observations. Instead, a common representation for both is created and then enhanced by feature extraction, generation and selection. The performance of the model which is trained on these enhanced data is the criterion that guides the feature engineering.

Model-level fusion

The simulation and the sensor data can be used independently for training single models, one for each data set. The resulting models may then be combined to output the prediction. Weighting the two models adapts the framework to the application at hand.

Level choice decision

Moreover, the framework automatically decides which level to take based on a derived threshold for the co-variance of the single models. The criterion states that the variance-based error will be reduced if transiting from the data to the model level, if the co-variance is lower than the threshold.

The paper first gives an overview of related work in data fusion in Section 2. Conditions for the use of simulation as a valid data source are detailed in Section 3. The novel framework indicating the steps and methods of simulation-sensor fusion, including data mismatch evaluation and solving and integration level selection, is presented in Section 4. The real-world case study is then described as instance of the framework in Section 5. The real-world case studies are then described as instances of the framework in Section 5. The learning performance with and without the integration of simulation and sensor data is shown in the experiments and the automatic fusion level choice is evaluated.

Section snippets

Related work

Data fusion is an important phase in the KDD (i.e. Knowledge Discovery from Databases) process that creates enriched data from a multitude of sources that can be queried, searched, mined and analyzed for discovering new, interesting and useful patterns. Data collected from multiple sources can be fused at different levels [6]. The basic fusion level would be to append data at the raw level by fusing the raw data acquired from the different sources directly [23], [24]. For data which are used

Simulation as data generator

In addition to their major use as virtual testbeds [46], [13], [47], simulations are considered as data sources [18], [22], [16]. However, generating data with simulations is a challenging process [18], [22], [16]. First, simulations are models for reality and they are not perfectly mirroring the exact real-world situation. Even though that in many real-world settings, it is possible to rely on simulations models, the mismatch between simulation and real-world situation has to be solved in

Simulation-sensor data fusion framework

The proposed fusion framework is shown in Fig. 1.

The data acquisition process takes raw sensor data and acquires from the simulation the adequate data form depending on the use scenario of the simulation as data generator. More details are provided in Section 3.1. Then, the quality of all sensor data and simulation data are checked according to three criteria detailed in Section 3.2. Simulation-Sensor data mismatch is first solved. Feature engineering is used for enhancing the data quality by

Industry 4.0 use case: Milling process

In mechanical engineering, milling is one of the most important machining operations with a wide variety of application use cases, e.g., the machining of structural components for the aerospace industry Milling can be classified to cutting operations using defined cutting edges. The material removal is performed by superimposing a rotational movement of the tool rotation axis and a translational movement of the tool relative to the workpiece. Fig. 2 depicts an example of a pocket milling

Conclusion

The increasing availability of a wide range of data generators using measurement technologies such as sensors, physical models such as simulations and generative machine learning models such as GANs, among others, offers more and more growing amounts of data. Hence, a real-world entity can be described by more than one data source and with a wide variety of data types. Since both data quantity and quality matter for the success of machine learning algorithms, integrating efficiently data

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 and the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01—S18038A).

References (90)

KoM.H. et al.
Using dynamic time warping for online temporal fusion in multisensor systems
Inf. Fusion
(2008)
WiederkehrP. et al.
Virtual machining: Capabilities and challenges of process simulations in the aerospace industry
Procedia Manuf.
(2016)
HessS. et al.
Elaborated analysis of force model parameters in milling simulations with respect to tool state variations
Procedia CIRP
(2016)
SaadallahA. et al.
Active learning for accurate settlement prediction using numerical simulations in mechanized tunneling
Procedia CIRP
(2019)
TangM. et al.
Validity and limitation of analytical models for the bending stress of a helical wire in unbonded flexible pipes
Appl. Ocean Res.
(2015)
SaadallahA. et al.
Stability prediction in milling processes using a simulation-based machine learning approach
Procedia CIRP
(2018)
SunQ.-S. et al.
A new method of feature fusion and its application in image recognition
Pattern Recognit.
(2005)
KhaleghiB. et al.
Multisensor data fusion: A review of the state-of-the-art
Inf. Fusion
(2013)
Diez-OlivanA. et al.
Data fusion and machine learning for industrial prognosis: Trends and perspectives towards industry 4.0
Inf. Fusion
(2019)
SafizadehM.S. et al.
Using multi-sensor data fusion for vibration fault diagnosis of rolling element bearings by accelerometer and load cell
Inf. Fusion
(2014)

ShaoY. et al.

A machine learning based global simulation data mining approach for efficient design changes

Adv. Eng. Softw.

(2018)

GovekarE. et al.

On stability and dynamics of milling at small radial immersion

CIRP Ann.

(2005)

KuljanicE. et al.

TWEM, a method based on cutting forces—Monitoring tool wear in face milling

Int. J. Mach. Tools Manuf.

(2005)

de AguiarM.M. et al.

Correlating surface roughness, tool wear and tool vibration in the milling process of hardened steel using long slender tools

Int. J. Mach. Tools Manuf.

(2013)

CusF. et al.

An intelligent system for monitoring and optimization of ball-end milling process

J. Mater Process. Technol.

(2006)

KrauseM. et al.

Improved $γ$ /hadron separation for the detection of faint $γ$ -ray sources using boosted decision trees

Astropart. Phys.

(2017)

AlbertJ.

Implementation of the random forest method for the imaging atmospheric Cherenkov telescope MAGIC

Nucl. Instrum. Methods Phys. Res. A

(2008)

DengJ. et al.

Imagenet: A large-scale hierarchical image database

LahatD. et al.

Multimodal data fusion: An overview of methods, challenges, and prospects

Proc. IEEE

(2015)

DongX.L. et al.

Data fusion: Resolving data conflicts for integration

Proc. VLDB Endow.

(2009)

BernsteinP.A. et al.

Generic schema matching, ten years later

Proc. VLDB Endow.

(2011)

NaumannF. et al.

Data fusion in three steps: Resolving inconsistencies at schema-, tuple-, and value-level

IEEE Data Eng. Bull.

(2006)

PiresI. et al.

From data acquisition to data fusion: A comprehensive review and a roadmap for the identification of activities of daily living using mobile devices

Sensors

(2016)

StolpeM. et al.

Distributed support vector machines: An overview

GoodfellowI. et al.

Generative adversarial nets

GrulichP.M. et al.

Generating reproducible out-of-order data streams

BilandA.

Calibration and performance of the photon sensor response of FACT — The first G-APD Cherenkov telescope

J. Instrum.

(2014)

FischerY. et al.

Object-oriented sensor data fusion for wide maritime surveillance

HerpelT. et al.

Multi-sensor data fusion in automotive applications

CaoB.-T. et al.

A hybrid RNN-GPOD surrogate model for real-time settlement predictions in mechanised tunnelling

Adv. Model. Simul. Eng. Sci.

(2016)

BunseM. et al.

Towards active simulation data mining

FischerC.

Runtime and accuracy issues in three-dimensional finite element simulation of machining

Int. J. Mach. Mach. Mater.

(2009)

MeschkeG. et al.

Big data and simulation- A new approach for real-time TBM steering

VarshneyP.K.

Multisensor data fusion

Electron. Commun. Eng. J.

(1997)

HallD.L. et al.

An introduction to multisensor data fusion

Proc. IEEE

(1997)

EstebanJ. et al.

A review of data fusion models and architectures: Towards engineering guidelines

Neural Comput. Appl.

(2005)

LiuC. et al.

A sensor fusion and support vector machine based approach for recognition of complex machining conditions

J. Intell. Manuf.

(2018)

WaskeB. et al.

Fusion of support vector machines for classification of multisensor data

IEEE Trans. Geosci. Remote Sens.

(2007)

S. Paradis, J. Roy, W. Treurniet, Integration of all data fusion levels using a blackboard architecture, in:...

HerpelT. et al.

Multi-sensor data fusion in automotive applications

StolpeM.

The Internet of Things: Opportunities and challenges for distributed data analysis

SIGKDD Explor.

(2016)

LiuC. et al.

A sensor fusion and support vector machine based approach for recognition of complex machining conditions

J. Intell. Manuf.

(2018)

WangF. et al.

Research on software architecture of prognostics and health management system for civil aircraft

FangX. et al.

Scalable prognostic models for large-scale condition monitoring applications

IISE Trans.

(2017)

LahatD. et al.

An alternative proof for the identifiability of independent vector analysis using second order statistics

Cited by (10)

Advanced CRITIC–GRA–GMM model with multiple restart simulation for assuaging decision uncertainty: An application to transport safety engineering for OECD members
2024, Advanced Engineering Informatics
When dealing with multi-criteria decision-making (MCDM) activities in an uncertain environment, one of the fundamental requirements of the methodology used is that it provides creditable and defensible decisions. This is particularly true for transport safety engineering. In the current study, a hybrid decision-making model was developed by integrating several advanced mathematical models and an embedded simulation. Specifically, this study integrates criteria importance through intercriteria correlation (CRITIC), grey relational analysis (GRA), and the Gaussian mixture model (GMM), namely CRITIC–GRA–GMM with multiple restart simulation; the ultimate aim is to offer a decision tool with substantial stability and reliability. In particular, this study addresses the issue faced by conventional GMM – namely uncertain initialization and trapping in local optima – by embedding a multiple restart simulation to obtain a more robust model. The multilevel contrasts of the results in the application to transport safety engineering in Organisation for Economic Co-operation and Development (OECD) countries verified the efficiency, quality, and reliability of the proposed model, indicating its practicability in real-life decision-making activities. This novel framework contributes to existing decision approach databases by enhancing the decision-making process. Overall, the current study provides decision-makers and policymakers in OECD countries with a valuable tool for identifying the strengths and weaknesses of transport safety, enabling the formulation of effective measures and action plans to maximize safety performance.
Regression generative adversarial network based on bounded losses for prediction of free calcium oxide in cement clinker
2024, Advanced Engineering Informatics
The data imbalance problem caused by multi-time scales phenomenon affects the prediction accuracy, validity and robustness of free calcium oxide (fCaO) content in cement clinker calcination process. Focusing on this problem, we propose an regression generative adversarial network model to predict fCaO content, which contains a generator, discriminator and predictor. Generator and discriminator are designed as bounded loss generative adversarial network to solve mode collapse problem and improve the stability. They learn actual data features in adversarial learning style and produce fake data to enlarge the scale and feature space of actual data to train predictor and finally achieve the prediction of fCaO content, which overcomes the problem of data imbalance. For performance assessment, we visually evaluate the validity of generated data from the perspective of univariate distribution and multivariate joint distribution and invent sequence change tendency consistency index (TCI) to evaluate the robustness of fCaO content prediction. Experiments implemented by cement production data demonstrate that the proposed model has advantages in accuracy, availability and robustness in fCaO content prediction, especially TCI is higher 22.69 percentage points than that of without data augmentation.
Regularization when modeling with biased simulation data as a prior
2023, IFAC-PapersOnLine
Embedding physical knowledge in system identification increases the generalization capabilities of the identified models. For complex engineering systems, such as a process plant, the most complete and detailed quantitative description of the existing physical and structural knowledge is often provided by a simulator. We describe the procedure of fusing simulated data with measurement data via L2 regularization for models that are linear in the parameters. We characterize how the MSE minimization problem in this framework is nontrivial, and show that for certain realizations of the data there is no unique minimum of the MSE w.r.t. the regularization parameter. In these cases the MSE can even increase to larger values than both the variance and the bias, which is counter-intuitive. We show how this issue appears less frequently with more data, even though multiple minima can occur for any realization of the data. However, we show also that the Stein effect is present regardless, so that it is always possible to decrease the MSE with careful selection of the regularization parameter, i.e., information fusion may always be beneficial.
Efficient resource prediction framework for software-defined heterogeneous radio environmental infrastructures
2023, Advanced Engineering Informatics
Artificial Intelligence (AI) is defining the future of next-generation infrastructures as proactive and data-driven systems. AI-empowered radio systems are replacing the existing command and control radio networks due to their intelligence and capabilities to adapt to the radio environmental infrastructures that include intelligent networks, smart cities and AV/VR enabled factory premises or localities. An efficient resource prediction framework (ERPF) is proposed to provide proactive knowledge about the availability of radio resources in such software-defined heterogeneous radio environmental infrastructures (SD-HREIs). That prior information enables the coexistence of radio users in SD-HREIs. In a proposed framework, the radio activity is measured in both the unlicensed bands that include 2.4 and 5 GHz, respectively. The clustering algorithms k- means and DBSCAN are implemented to segregate the already measured radioactivity as signal (radio occupancy) and noise (radio opportunity). Machine learning techniques CNN and LRN are then trained and tested using the segregated data to predict the radio occupancy and radio opportunity in SD-HREIs. Finally, the performance of CNN and LRN is validated using the cross-validation metrics.
Multivariate time series prediction of complex systems based on graph neural networks with location embedding graph structure learning
2022, Advanced Engineering Informatics
Citation Excerpt :
On the other hand, the wider application of deep learning in PFPP, such as anomaly detection, fault traceability and even complex process control, requires powerful modeling capabilities with multiple input and multiple output models. In addition, due to the real-time and relatively stable production process of the process industry, it is difficult to obtain high-quality data suitable for current deep learning [7]. In the soft measurement task, the key quality laboratory test data is used as the label value of supervised learning, there is a scarcity of label data, and the objectivity of the label data cannot be guaranteed due to human factors.
Graph convolutional neural networks (GNNs) have an excellent expression ability for complex systems. However, the smoothing hypothesis based GNNs have certain limitations for complex process industrial systems with high dynamics and noisy environment. In addition, it is difficult to obtain an accurate information about the interconnections of sensor networks in manufacturing systems, which brings challenges to the application of GNNs. This paper introduces a graph convolution filter with a serial alternating structure of low-pass filter and high-pass filter to alleviate the problem of node feature loss. Furthermore, we propose a simple and effective method to learn graph structure information during training. This method combines the advantages of graph structure learning based on metric method and direct optimization method. Finally, a spatiotemporal parallel feature extraction framework for multivariate time series prediction is constructed. Experiments are carried out on real industrial datasets, and the results demonstrate the effectiveness of the model.
Human Choice Prediction in Language-based Non-Cooperative Games: Simulation-based Off-Policy Evaluation
2023, arXiv

View all citing articles on Scopus

View full text

Full length articleSimulation and sensor data fusion for machine learning application

Abstract

Introduction

Section snippets

Related work

Simulation as data generator

Simulation-sensor data fusion framework

Industry 4.0 use case: Milling process

Conclusion

Declaration of Competing Interest

Acknowledgments

Inf. Fusion

Procedia Manuf.

Procedia CIRP

Procedia CIRP

Appl. Ocean Res.

Procedia CIRP

Pattern Recognit.

Inf. Fusion

Inf. Fusion

Inf. Fusion

Adv. Eng. Softw.

CIRP Ann.

Int. J. Mach. Tools Manuf.

Int. J. Mach. Tools Manuf.

J. Mater Process. Technol.

Astropart. Phys.

Nucl. Instrum. Methods Phys. Res. A

Imagenet: A large-scale hierarchical image database

Multimodal data fusion: An overview of methods, challenges, and prospects

Proc. IEEE

Data fusion: Resolving data conflicts for integration

Proc. VLDB Endow.

Generic schema matching, ten years later

Proc. VLDB Endow.

Data fusion in three steps: Resolving inconsistencies at schema-, tuple-, and value-level

IEEE Data Eng. Bull.

From data acquisition to data fusion: A comprehensive review and a roadmap for the identification of activities of daily living using mobile devices

Sensors

Distributed support vector machines: An overview

Generative adversarial nets

Generating reproducible out-of-order data streams

Calibration and performance of the photon sensor response of FACT — The first G-APD Cherenkov telescope

J. Instrum.

Object-oriented sensor data fusion for wide maritime surveillance

Multi-sensor data fusion in automotive applications

A hybrid RNN-GPOD surrogate model for real-time settlement predictions in mechanised tunnelling

Adv. Model. Simul. Eng. Sci.

Towards active simulation data mining

Runtime and accuracy issues in three-dimensional finite element simulation of machining

Int. J. Mach. Mach. Mater.

Big data and simulation- A new approach for real-time TBM steering

Multisensor data fusion

Electron. Commun. Eng. J.

An introduction to multisensor data fusion

Proc. IEEE

A review of data fusion models and architectures: Towards engineering guidelines

Neural Comput. Appl.

A sensor fusion and support vector machine based approach for recognition of complex machining conditions

J. Intell. Manuf.

Fusion of support vector machines for classification of multisensor data

IEEE Trans. Geosci. Remote Sens.

Multi-sensor data fusion in automotive applications

The Internet of Things: Opportunities and challenges for distributed data analysis

SIGKDD Explor.

A sensor fusion and support vector machine based approach for recognition of complex machining conditions

J. Intell. Manuf.

Research on software architecture of prognostics and health management system for civil aircraft

Scalable prognostic models for large-scale condition monitoring applications

IISE Trans.

An alternative proof for the identifiability of independent vector analysis using second order statistics

Full length article
Simulation and sensor data fusion for machine learning application