1 Introduction

In the era of Industry 4.0, the ability to collect, analyze, and act on large volumes of data in real time is critical to maintaining competitiveness and fostering innovation. The fusion of spatial and temporal sensor data is emerging as a technological vanguard, promising to revolutionize how industries monitor and optimize their operational processes [1]. This work explores the effectiveness of spatial and temporal data fusion using deep learning to improve anomaly detection, predictive maintenance, and process optimization in Industry 4.0, addressing critical questions about improving model performance and the practical applicability of these techniques [2].

The relevance of this topic lies not only in its immediate applicability to critical industrial challenges but also in its potential to drive significant advances in operational efficiency and sustainability. Despite the growing accumulation of sensor data, many companies face difficulties interpreting this data in a way that generates real value [3]. This study addresses this gap, exploring how integrating spatial and temporal data can offer a deeper understanding of industrial processes, facilitating more informed and accurate decision-making [4].

Existing literature reveals a growing interest in data fusion and deep learning within Industry 4.0, although significant gaps remain in understanding how these technologies can be applied most effectively. Previous research has demonstrated the potential of spatial and temporal data to improve the precision of predictive models. However, there is a lack of consensus on best practices for integrating these types of data [5]. This study seeks to contribute to this emerging field by providing empirical evidence of the benefits of our data fusion methodology.

Our work employs a rigorous methodology, using convolutional neural networks (CNN) to analyze spatial data and recurrent neural networks (RNN) for temporal data before fusing these sources of information for model training using deep neural networks (DNN) [6, 7]. The choice of this methodology is justified by its ability to capture and analyze the complexity inherent in industrial data, allowing a richer and more nuanced interpretation than traditional techniques.

The results obtained from our study confirm initial expectations, showing significant progress in key performance metrics after implementing our data fusion methodology. Specifically, we have observed an increase in anomaly detection precision up to 92%, an extension of early detection time in predictive maintenance from 2 to 5 days, and an improvement in operational efficiency from 70% to 85%. These improvements underline the effectiveness of spatial and temporal data fusion in enriching analysis and prediction and highlight its practical applicability. The ability to more accurately identify abnormal conditions, anticipate maintenance needs, and optimize industrial processes confirms the value of our proposal, providing industries with advanced tools for more robust and accurate data-based decision-making.

This document is structured into several key sections to provide a comprehensive study overview. Section 2 reviews the related literature, highlighting previous studies and similar approaches in data fusion and deep learning applied to industrial environments. Section 3 presents the methodology, including the model design, data preprocessing details, and the optimization techniques implemented. Section 4 discusses the results obtained, analyzing the model’s performance in terms of accuracy, efficiency, and robustness. Section 5 addresses the study’s conclusions, summarizing key findings and proposing directions for future work. Finally, Section 6 includes the references used in this study.

2 Literature Review

In the era of Industry 4.0, the digitalization and automation of industrial processes have generated an unprecedented volume of sensor data, making data fusion critical to improving decision-making and operational efficiency [8]. DNNs are emerging as a promising solution, capable of extracting complex patterns and inferring large volumes of heterogeneous data without manual feature engineering [9]. CNNs are effective in processing data with spatial structure and are applied in industry to analyze images on assembly lines and detect material defects through visual inspections. On the other hand, RNNs, especially those with long-term memory units (LSTM) and recurrent gates (GRU), are essential for handling sequential data, such as vibration and temperature time series, to predict machinery failures [10].

Integrating spatial and temporal data has been an area of growing interest. Recent studies have shown that the combination of CNN and RNN can significantly improve the performance of models in prediction and anomaly detection tasks [11]. For example, [12] used a similar architecture to predict failures in wind energy systems, obtaining a 20% improvement in accuracy compared to traditional methods. Another study by [13] implemented a combination of CNN and LSTM to monitor the health of aircraft engines, achieving a 30% reduction in false positives. These studies highlight the effectiveness of hybrid approaches in capturing complex patterns and improving fault prediction accuracy.

According to the literature review, it has been identified that temporal data, such as temperature, vibration, and sound measurements, are subjected to a min-max normalization to scale the data between 0 and 1. This ensures that the different magnitudes of the sensors do not bias the subsequent analysis [14]. For spatial data, data augmentation techniques are used, including rotation, rescaling, and translation, to improve the robustness of the model against variations in the data [15].

Additionally, the presence of biases in sensor data and during the model training process can significantly impact the fairness and accuracy of predictive models. For this, several works have implemented various strategies to mitigate these biases. Normalization and standardization of data are essential to ensure that different magnitudes of data do not introduce bias [16]. Likewise, automatic anomaly detection using Isolation Forest and exponential smoothing helps identify and handle anomalous data that could distort model learning [17]. Data augmentation techniques are also used to balance the data and reduce bias in model training [18]. Furthermore, although a range of limited field testing in industrial settings is covered in this phase of the project, broader field testing and case studies will allow the applicability and effectiveness of the framework to be assessed in real-world conditions, providing additional data to improve and refine the model [19].

In addition to deep learning techniques, other data fusion and anomaly detection methods have been explored. Kalman filters and machine learning-based methods, such as Support Vector Machines (SVM) and Random Forests (RF), have been widely used in industry [20, 21]. Although these methods have proven useful, they have limitations in handling large volumes of complex, nonlinear data. For example, Kalman filters are efficient for linear systems, but their performance decreases due to nonlinearities and non-Gaussian noise. Similarly, although SVM and RF are effective in data classification, their ability to capture temporal dependencies is limited.

Our work differs from these approaches by proposing a hybrid framework that combines CNN, RNN, and DNN for data fusion and anomaly detection in industrial environments. This framework allows for the integration of heterogeneous data, leveraging the strengths of each type of neural network to improve model accuracy and robustness. CNNs extract spatial features from image data, RNNs (including LSTM and GRU) capture temporal patterns in time series, and DNNs integrate these features to make accurate and robust predictions. This integrated approach improves anomaly detection accuracy and reduces processing time and the number of false positives.

The literature review highlights the need for advanced approaches to data fusion in Industry 4.0. Existing studies have demonstrated the potential of deep learning techniques to improve accuracy and efficiency in anomaly detection and predictive maintenance. However, there is a continued need to develop and validate new approaches that can handle the complexity and heterogeneity of industrial data. Our work contributes to this area by proposing a hybrid deep-learning framework that combines the strengths of CNNs, RNNs, and DNNs. It demonstrates its effectiveness in improving the accuracy and robustness of anomaly detection in industrial environments [22].

3 Materials and Methods

At the core of the transformation towards Industry 4.0 is the effective management and analysis of data from various sensors distributed throughout industrial processes. However, efficiently fusing this heterogeneous data to extract valuable information and make accurate predictions remains a significant challenge. This problem lies in the complexity of the data, which varies from temporal to spatial structures, and in the need for models that can learn from this data effectively to improve decision-making and optimize industrial processes. To address this challenge, we have worked on the selection and preprocessing of the data sets to the architecture of the deep learning models and their evaluation to demonstrate how this work contributes to the current limitations in the fusion of data from sensors and contributes to the evolution towards more intelligent and autonomous industrial systems.

3.1 Solution Proposal and Model Architecture

Existing solutions are often limited by their ability to adapt to data diversity and its intrinsic complexity [23]. In this work, we use a combination of CNN, RNN, and DNN to improve the integration and analysis of heterogeneous sensory data. CNNs extract spatial features from data, such as images, and multidimensional measurements from sensors, capturing important patterns, such as textures and shapes. RNNs, including advanced variants such as LSTM and GRU, handle temporal sequences of data, capturing long-term dependencies essential to predict the evolution of operational states and anticipate possible failures [24]. Finally, DNNs integrate the information processed by CNNs and RNNs, combining these characteristics to make accurate predictions and generate detailed knowledge about the behavior and state of industrial systems.

Figure  1 illustrates the workflow for sensor data fusion and analysis using deep learning techniques. The process begins with collecting raw sensor data from various sources, including temperature, vibration, sound, and images. This data is first preprocessed to remove noise and normalize the values, ensuring consistency across different types of sensors. The preprocessed data is then split into two branches for further analysis: the CNN branch for spatial data and the RNN branch for temporal data. In the CNN branch, convolutional layers extract spatial features from the images, identifying essential patterns such as textures and shapes. These features are passed through pooling layers to reduce dimensionality while retaining critical information [25]. Simultaneously, the RNN branch processes temporal data using LSTM layers, which capture long-term dependencies and temporal patterns in the sensor readings. The output from the LSTM layers provides a comprehensive representation of the data’s temporal characteristics. The extracted spatial and temporal features are then integrated using a DNN. The DNN combines these features through dense layers, allowing the model to learn complex interactions between the spatial and temporal data. This integrated approach enhances the accuracy of predictions and the robustness of anomaly detection. The final output of the DNN is used to make informed decisions about the operational state of industrial systems, including anomaly detection, predictive maintenance, and process optimization. This workflow ensures a thorough sensor data analysis, leveraging the strengths of both CNN and RNN architectures to provide detailed insights into the systems’ behavior and state.

Fig. 1
figure 1

Workflow for sensor data fusion and analysis using deep learning

We start by preprocessing raw data from a vast industrial sensor network. In this step, we apply data-cleaning techniques to remove inconsistencies and outliers. We use normalization to scale the data to a standard range, critical for comparisons and subsequent analysis. In some instances where the data requires it, we apply data augmentation techniques to improve the robustness of our models [26]. These techniques include rotations and rescaling of images, as well as the generation of synthetic time series using bootstrapping techniques to strengthen the reliability of our predictions.

Subsequently, the prepared data are classified according to its intrinsic nature. For data with spatial features such as images or multidimensional sensor measurements, we employ CNN. These networks are specialized in extracting and learning critical spatial patterns such as textures and shapes from data, using techniques such as convolution and pooling to reduce dimensionality while preserving essential features [27].

In parallel, data with a temporal dimension, such as sensor time series, are processed using RNN, including advanced variants such as LSTM and GRU. Although LSTMs and GRUs can capture long-term dependencies more effectively, the original RNN is used as an initial layer to preprocess the data sequences. This allows for initial noise reduction and essential temporal pattern extraction before moving on to the more complex LSTM and GRU layers, thus optimizing the model’s overall performance [28].

The integration of spatial and temporal characteristics is carried out through a DNN, which processes and combines this information to generate accurate insights and predictions. This hybrid model uses advanced data fusion techniques, such as feature concatenation followed by densely connected layers, to effectively handle and analyze the diversity of sensory data [29]. This capability allows us to obtain a deep and detailed understanding of the behavior and state of industrial systems.

This model’s result provides predictions and classifications that are fundamental for making strategic decisions in the management and optimization of industrial processes. Thanks to the model’s ability to integrate and analyze heterogeneous data effectively, we open new possibilities to improve operational efficiency and safety and perform effective predictive maintenance within Industry 4.0.

In our strategy to prevent phishing attacks, we use a combination of CNN to analyze spatial data and RNN for temporal data. CNNs identify visual patterns in emails and websites, such as spoofed logos and unusual design elements that indicate phishing attempts [30]. On the other hand, RNNs, including variants such as LSTM and GRU, are essential for analyzing data sequences and temporal behaviors, such as URL structure and user behavior patterns, indicative of phishing attacks. Combining these techniques allows for comprehensive analysis, capturing visual characteristics and sequential patterns crucial to identifying and preventing phishing attacks.

On the other hand, RNNs, including variants such as LSTM and GRU, are essential for analyzing temporal data. These networks use their recurring connections to remember previous information, which is necessary to predict and understand phenomena with temporal dependencies, such as the sequence of actions in a phishing attack [31]. The DNN, which is the core of our architecture, integrates extracted spatial and temporal features to model complex relationships and perform advanced inference. Its ability to process large volumes of data and fuse it is vital to effectively identifying and preventing phishing attacks.

Hyperparameter optimization was done through a cross-validation process using a grid search and random search approach to identify the best hyperparameter values. For CNN, we adjust the kernel size, stride, and number of filters to optimize the detection of critical visual features. Multiple experiments were performed varying these parameters to find the optimal combination that maximizes accuracy and reduces classification error [32].

For RNN, including variants such as LSTM and GRU, we calibrate the number of hidden units and the dropout rate to balance the model’s memory capacity and prevent overfitting. A k-fold cross-validation was implemented to evaluate the model’s stability and robustness with different hyperparameter settings.

DNN requires careful tuning of the network depth (number of hidden layers) and learning rate. We use the Adam optimizer because it can automatically adapt each parameter’s learning rate. Furthermore, L2 regularization was implemented to avoid overfitting and improve the model’s generalization ability. The search for the best hyperparameters was performed iteratively, using training and validation experiments to find the optimal balance between computational performance and predictive accuracy, ensuring that the model is effective under controlled test conditions and in real operational environments.

The proposed architecture integrates CNN and RNN branches to process spatial and temporal data effectively. The CNN branch comprises multiple convolutional layers followed by pooling layers. The initial convolutional layer consists of 32 filters of size 3x3, using ReLU activation, a stride of 1, and ’same’ padding to maintain the spatial dimensions of the input. This layer captures essential spatial features from the input images. Following this, a second convolutional layer with 64 filters of the same size further refines these features, again using ReLU activation and similar stride and padding settings. After these convolutional operations, a max-pooling layer with a pool size of 2\(\times \)2 and a stride of 2 reduces the dimensionality of the feature maps while preserving the most critical information. This is followed by another convolutional layer with 128 filters, maintaining the previous configurations, and an additional max-pooling layer to ensure robust feature extraction.

The RNN branch utilizes LSTM layers to handle sequential data effectively. The first LSTM layer contains 100 units and employs a dropout rate of 0.2 to prevent overfitting, allowing the network to generalize better to unseen data. This layer processes input sequences of length 50, capturing long-term dependencies essential for accurate temporal pattern recognition. A second LSTM layer with 50 units and a similar dropout rate further processes the temporal features, enhancing the network’s ability to learn intricate temporal dependencies.

These CNN and RNN branch configurations are integral to the proposed architecture, ensuring that spatial and temporal features are effectively captured and integrated for robust and accurate predictions. This design enables the model to provide detailed insights into the behavior and state of industrial systems, significantly improving anomaly detection and predictive maintenance capabilities.

3.2 Study Environment

The environment where this work is carried out focuses on a manufacturing company with an infrastructure equipped with various machinery and sensor technologies designed to maximize efficiency and guarantee product quality. This industrial environment integrates an ecosystem of sensors and actuators, each fulfilling a specific production cycle function, as represented in Fig.  2.

Fig. 2
figure 2

Outline of the monitoring and analysis system in industry 4.0 with deep learning integration

Temperature sensors are strategically distributed throughout critical machinery, such as presses and ovens, monitoring thermal conditions to prevent overheating that could result in downtime or equipment failure. Vibration sensors attach to machinery components that are susceptible to wear, such as bearings and gears, detecting abnormal patterns that indicate the need for preventative maintenance [33]. Sound/acoustic sensors are positioned near high-noise operating areas to identify deviations in back-ground noise, which could signal impending operational or mechanical problems.

Additionally, the company has computer vision cameras to conduct real-time quality inspections and monitor safety conditions within the plant, ensuring that all products meet quality standards and that safety practices are rigorously followed and observed. In this environment, pressure sensors monitor the pneumatic and hydraulic systems, while flow and level sensors monitor the efficient management of liquid resources and granular materials [34]. Proximity and position sensors guarantee precision in assembly operations. These sensors and advanced robotics systems facilitate agile production and are adaptable to changing market demands. Additionally, environmental sensors continuously monitor air quality and humidity, critical aspects of maintaining product integrity and comfort in the plant.

In the network and data processing infrastructure, data are collected by the sensor network and transmitted through PLCs/RTUs to a master terminal unit (MTU), which acts as a centralized data concentrator. From here, the information flows to a robust database hosted on local servers or cloud platforms, where it is stored for analysis. The connection to this database is secured through Ethernet and Wi-Fi connections, which offer flexibility and scalability for the growing volume of data. We have implemented a distributed approach using parallel processing technologies such as Apache Spark and TensorFlow Extended (TFX) to address scalability and handle large data sets. These technologies enable efficient real-time data processing and the ability to scale horizontally as needed. At the core of the data processing system, a robust local and cloud server equipped with deep learning technologies analyzes and fuses data from multiple sources [35]. This analysis allows you to identify complex patterns and make accurate predictions that inform strategic decisions for process optimization and preventive maintenance. Operators interact with the system through an HMI control panel, which offers an intuitive interface for viewing real-time data, adjusting parameters, and quickly responding to system alerts. This integrated environment reflects the company’s commitment to innovation and excellence, leveraging deep learning capabilities to overcome the challenges of Industry 4.0 and set new standards for productivity and quality.

3.3 Dataset

Data capture in the manufacturing plant is designed to cover a full spectrum of critical operational variables. Table  1 describes the type of sensor, the kind of data it captures, and the approximate volume of data generated.

Table 1 Characteristics and Data of Sensors in Industry 4.0

Temperature sensors arranged at critical points on production lines monitor thermal conditions. These sensors generate readings in degrees Celsius and produce around 1000 readings per hour with a sampling rate of 1 Hz, allowing for constant, real-time monitoring of machines and processes. Vibration sensors measure acceleration in g-force to capture the operating dynamics of machinery. With a sampling rate of 10 Hz, they provide 5000 readings per hour, essential for early detection of any signs of malfunction or wear.

Sound sensors detect subtle changes in the plant’s acoustic environment and record noise levels in decibels. At a sampling rate of 2 Hz, these sensors deliver 2000 readings per hour, significantly contributing to detecting operational anomalies and mechanical problems. Computer vision cameras perform quality inspections on the production line. Although the sampling rate is relatively low at 0.2 Hz and generates 500 images per hour, each image is analyzed to ensure the products meet established quality standards.

The pressure sensors, with a sampling rate of 1 Hz, provide readings of 1000 Pascal per hour, which is vital for monitoring pneumatic and hydraulic systems and ensuring structural safety. The flow and level sensors offer measurements in liters per minute, with 1000 readings per hour at a sampling rate of 1 Hz, allowing efficient management of the flow of materials through the plant. To ensure precision in manufacturing, proximity and position sensors provide data in millimeters and XYZ coordinates. Together, these sensors produce around 6000 readings per hour (3000 each) with a sampling rate of 5 Hz. The environmental sensors measure air quality and humidity, providing 800 and 1000 readings per hour, respectively, with 0.5 Hz and 1 Hz sampling frequencies. This data is essential to maintain manufacturing conditions within optimal parameters.

The data collected by this sensor network is transmitted through communication interfaces to a centralized data storage system. The database collects and organizes the flow of information, which is then processed by powerful servers. Deep learning models are deployed on these servers to analyze and fuse data from multiple sources, identify complex patterns, and make accurate predictions to inform decision-making.

3.4 Data Processing

Specific techniques are applied to temporal and spatial data to manage their nature and improve model effectiveness [36]. Additionally, we implemented data augmentation strategies to strengthen our dataset and improve the model’s generalization ability. Time series obtained from temperature, vibration, and sound sensors are standardized before analysis. The first step is a min-max normalization, which scales the data to a range of [0, 1]. This aligns the magnitudes of the different metrics for efficient model convergence during training. Furthermore, differentiation is applied to stabilize the mean of the time series, eliminating trends and seasonalities and thus facilitating the detection of significant patterns. Automatic anomaly detection using the Isolation Forest method and exponential smoothing is implemented to handle anomalous data, mitigating the impact of abrupt fluctuations that could distort model learning. Data gaps are handled by linear interpolation or K-nearest neighbors (KNN)-based imputation techniques, depending on the nature and volume of missing data [37].

Images captured by computer vision cameras are processed through a series of transformations. First, rescaling is performed to unify the size of all photographs, essential for input to the CNN. Histogram normalization techniques are then applied to improve the contrast and visibility of relevant image features. Additionally, to enhance the robustness of the model to variations in the position and orientation of objects in the images, we use data augmentation techniques, including random rotation, rescaling, and translation. This allows the model to learn to recognize patterns more effectively, regardless of their location in the image.

For temporal data, sequences are generated by applying Gaussian noise and temporal perturbations. This is done with data augmentation, a technique to expand the data set and avoid artificial overfitting. This introduces controlled diversity that helps the model learn the fundamental characteristics of the time series beyond the noise. For spatial data, in addition to the above transformations, horizontal and vertical shifts, as well as brightness and contrast adjustments, are applied to simulate different lighting conditions. This data augmentation approach ensures that the model can generalize well when faced with new data not found in the original training set [38]. Each pre-processing step is recorded and reviewed to ensure reproducibility and transparency. Using these techniques, we prepare a robust and representative data set that allows deep learning models to make accurate inferences.

3.5 Model Training

Model training requires methods that ensure the model generalizes efficiently and is robust to new data. To achieve a reliable and efficient model, cross-validation, regularization, and optimization strategies are implemented. K-fold cross-validation is applied to evaluate the robustness of the model. The five-fold approach is precisely due to an optimal balance between computational efficiency and statistical rigor. This process randomly divides the data set into five distinct subsets or folds. Each iteration uses four folds to train the model, while the remaining fold is reserved for validation [39]. This process is repeated five times, and each fold is used exactly once as a validation set, as presented in Figure  3. The cross-validation results are averaged to obtain a more precise estimate of model performance. This technique allows you to identify problems such as overfitting and ensure that the model has consistent performance regardless of data partitioning.

Fig. 3
figure 3

Fivefold cross-validation diagram for evaluation of machine learning models

To avoid overfitting, regularization techniques are integrated into the network architecture. L2 regularization (dropping weight) was used, which adds a penalty term to the loss function based on the magnitude of the model coefficients. This penalty promotes smaller weights and a simpler model, which tends to generalize better. The dropout method is also implemented, which randomly "turns off" a percentage of neurons during training. This prevents the model from becoming too reliant on any specific neuron or pathway, thus promoting a more robust feature representation.

Regarding optimizations, Adam’s algorithm is selected as the optimizer due to its ability to adapt the learning rate of each parameter automatically. Adam combines the advantages of stochastic gradient descent with boosting and per-coordinate learning rate adaptation, making it computationally efficient and effective at converging in complex search spaces. To tune hyperparameters such as the initial learning rate, the dropout factor, and the strength of L2 regularization, a preliminary search is performed on the grid, followed by a more detailed search around promising values.

The training was performed in a high-performance computing environment with multiple GPUs to facilitate parallel processing and significantly reduce the time required for training. Techniques such as batch normalization were also implemented to stabilize and accelerate training convergence [40]. Each step of the training process, from cross-validation to regularization and optimization strategies, allows building a model that is accurate in its predictive capacity and robust against variations in input data, thus ensuring its applicability in industrial environments.

3.6 Model Evaluation

To evaluate the effectiveness of our model, an independent test data set was used, which had not been exposed to the model during the training or validation phases. This separation between training, validation, and testing ensures that performance evaluation is objective and reflects the model’s ability to generalize to new data. The test data set contains a variety of cases and scenarios encountered on the production floor, covering all expected variations of operational data.

The independent data set was split by assigning 70% for training and 30% for testing. This division is designed to provide a substantial sample for model learning while reserving a significant portion for thorough and unbiased evaluation. The test set was selected to reflect the diversity and complexity of the production environment and did not participate in any model tuning phases, ensuring that the measured performance is a reliable representation of the model’s generalization to unseen data. Comparison between the distributions of the training and test sets was ensured by using stratification, maintaining the same proportion of classes in the two sets, which is crucial in imbalanced data contexts. Metrics used for evaluation included precision, recall, F1 score, and AUC.

The precision was calculated as:

$$\begin{aligned} \text {Precision} = \frac{TP}{TP + FP}. \end{aligned}$$
(1)

TP is the number of true positives, and FP is the number of false positives. The recall was determined as follows:

$$\begin{aligned} \text {Recall} = \frac{TP}{TP + FN}. \end{aligned}$$
(2)

FN is the number of false negatives, the F1 score, which is the harmonic average of precision and recall, was calculated as follows:

$$\begin{aligned} F1 = 2 \times \left( \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \right) . \end{aligned}$$
(3)

The AUC provides an aggregate measure of performance across all classification thresholds. In the anomaly detection use case, the model successfully identified non-standard patterns in the vibration and sound data that indicated potential machinery failures. By comparing the model results with maintenance logs, we observed that the model could predict anomalies that matched the reported failures with an precision of 93.

For predictive maintenance, the model was applied to temporal data from temperature and pressure sensors, predicting the optimal time for maintenance before production interruptions occurred. Model performance resulted in a 20% reduction in unplanned downtime compared to the time-based maintenance approach. The model processed flow and level data in process optimization to suggest adjustments that improved material use efficiency. Implementing the model resulted in a 15% reduction in material waste and a 5% increase in overall production line productivity.

Table 2 Detail of Industrial Sensor Parameters and Sampling Specifications

4 Results

The results obtained in this study not only demonstrate the transformative potential of sensor data fusion using deep learning in Industry 4.0 but also underline the critical importance of the adopted methodology. The significant improvement in anomaly detection, predictive maintenance, and operational efficiency provides tangible testimony to the effectiveness of our approach.

4.1 Data Fusion Process

This process allows us to identify how integrating multiple types of data improves the precision and effectiveness of the model and opens new avenues for predictive analysis and data-based decision-making. This stage transforms dispersed and heterogeneous data into actionable insights and measurable results, marking a significant advance in the practical application of deep learning in industry.

Table  2 shows various sensors used to monitor different aspects of the manufacturing plant, from temperature and vibration to humidity and liquid flow. This diversity is essential to obtain a complete picture of the operational and environmental status. Comprehensive coverage of different types of sensors ensures that all relevant aspects of the production process are captured, enabling early detection of potential problems and proactive intervention. Each data set is recorded with a precise timestamp for accurate temporal correlation during the fusion phase.

The value ranges in the table encompass typical operating scales for each monitored variable. For example, a temperature of 0°C to 50°C is usual in most industrial environments, while the vibration range of 0 to 5 g force covers most machinery conditions. These specific ranges and units of measurement indicate that the data collected is relevant and applicable to analyzing normal and abnormal conditions within the plant.

Sampling frequency varies between data types, reflecting the specific monitoring needs of each variable. For example, the high sampling rate of vibration sensors (100 Hz) is critical for capturing transient anomalies that could indicate mechanical failures. In contrast, computer vision cameras have a lower sampling rate (0.5 Hz), which is suitable for quality inspections that do not require the same temporality as vibration or sound data.

Combining these different types of data, each with its scale, unit of measurement, and sampling frequency, presents challenges and opportunities for data fusion. Proper normalization and preprocessing are essential to integrating these disparate data into a coherent framework that can be effectively analyzed. The successful fusion of this data allows the identification of anomalies or suboptimal conditions and the development of predictive models that can anticipate failures before they occur, optimize processes, and improve overall plant efficiency.

4.1.1 Individual Data Preprocessing

This preprocessing allows multiple data sources to be aligned, allowing for effective fusion and subsequent analysis using DNN techniques. Table  3 provides a quantitative and detailed view of the data volume, collection period, and specific preprocessing methods used for each data type. This process ensures that the data is normalized, free of noise, and in a standard format suitable for data fusion, which is essential for the success of analytics in the industry.

Table 3 Characterization and preprocessing of temporal and spatial data in industry 4.0

In the temporal data preprocessing process, we apply a Min-Max normalization for each time series (temperature, vibration, sound) to scale the data from 0 to 1. This ensures that different sensor magnitudes do not bias the subsequent analysis. Additionally, noise filters were used for each type of data. For example, we apply a low-pass filter to the vibration data to remove high-frequency noise irrelevant to anomaly detection.

In spatial data preprocessing, all images captured by computer vision cameras are rescaled to a standard resolution of 256 \(\times \) 256 pixels. This standardizes the input size for image analysis and reduces the computational burden. The results presented in the data preprocessing table reveal information for effective fusion and subsequent analysis. First, the volume of data, which varies significantly between different types of sensors, highlights the enormous scale of information generated daily in a manufacturing plant. For example, with 50,000 readings per day, vibration sensors generate a particularly dense data stream, underscoring the importance of efficient filtering techniques to extract relevant signals from this background noise.

The min–max normalization applied to all temporal data guarantees a uniform comparative basis, allowing the different metrics to be integrated without bias due to scale differences. This step interprets the signals in the broader context of the data fusion, facilitating the accurate detection of anomalies and operational patterns. On the other hand, rescaling and contrast adjustment of images highlights the need to prepare visual data to maximize the relevance of features for CNN. Standardizing image size to 256 \(\times \) 256 pixels optimizes processing and contributes to a manageable computational load, crucial for real-time or near-real-time analysis. The variation in the collection period, from continuous for temporal data to event-based for images, indicates the monitoring system’s adaptability to the specific needs of each type of data. This reflects a well-thought-out data collection strategy to capture the most relevant information for data fusion analysis.

4.1.2 Feature Extraction

Table  4 summarizes the deep learning techniques and results obtained in extracting features from spatial and temporal data. Additionally, it is essential to consider the importance of each type of sensor in the anomaly detection process. Image sensors are crucial for detecting visual defects in products and equipment.

Table 4 Summary of Deep Learning Techniques and Results in Feature Extraction from Spatial and Temporal Data

CNN extracts key visual features that indicate anomalies, such as edges, textures, and shapes. The high accuracy in defect classification (92%) and the reduction in inspection time (30%) highlight the effectiveness of these sensors in detecting visual anomalies. Temperature sensors are essential for monitoring the thermal status of machines and equipment. RNNs, specifically LSTMs, are used to identify patterns and temporal dependencies in temperature data, making it possible to predict failures related to overheating or abnormal thermal fluctuations. The improvement in failure prediction (85%) and maintenance anticipation (40%) underlines the importance of these sensors.

Vibration sensors are essential for detecting changes in machines’ mechanical behavior. RNN/LSTM allows you to analyze time series of vibration data to identify wear patterns and possible mechanical failures. These sensors’ ability to predict mechanical failures and improve maintenance anticipation is crucial to maintaining operational efficiency. Sound sensors monitor machine operating noise. Variations in sound levels may indicate mechanical problems or malfunctions. RNN/LSTMs help capture and analyze these temporal variations, providing an additional tool for early anomaly detection.

Combining these different types of sensors allows for a comprehensive approach to anomaly detection, leveraging the strengths of each sensor to provide a robust and accurate system. Integrating spatial and temporal data through deep learning techniques significantly improves failures’ detection and prediction capacity, ensuring operational continuity and reducing maintenance costs.

Figure  4 shows the comparison between the actual data and the predictions made by the model for the temperature, vibration, and sound variables over time. In each of the graphs, the data is represented by the blue line, while the model predictions generated by the RNN are shown in red. For temperature over time (tracking for 15 days), it is observed that the RNN has successfully captured the overall trend and temperature fluctuations. The vibration graph closely tracks the peaks and valleys of the actual data, suggesting that the model has learned underlying vibration patterns that could indicate the health of the machinery and potentially predict future events, maintenance, or failures.

Fig. 4
figure 4

Visualization of temporal feature extraction in industrial sensor data

To quantify the model’s performance in extracting temporal features, we have calculated the following evaluation metrics: precision, recall, F1 score, and mean square error (MSE). The model predictions (red line) closely follow the actual data (blue line), indicating good temporal tracking ability. To provide a quantitative comparison of performance, we present the following metrics:

Temperature:

  • Accuracy: 0.91

  • Remember: 0.89

  • F1 score: 0.90

  • MSE: 0.015

Vibration:

  • Accuracy: 0.93

  • Remember: 0.91

  • F1 score: 0.92

  • MSE: 0.012

Sound:

  • Accuracy: 0.88

  • Remember: 0.86

  • F1 score: 0.87

  • MSE: 0.018

These quantitative metrics indicate that the model performs well in predicting temporal characteristics, with high precision and low error rates in all analyzed variables. The combination of visualization and quantitative metrics provides a comprehensive view of the progress made in temporal feature extraction.

4.1.3 Feature Integration

After individual feature extraction, each dataset has its high-dimensional representation optimized to highlight relevant information. In the case of images, CNNs may have identified hundreds of visual features, while RNNs may have determined multiple meaningful temporal patterns. Feature integration is performed through a concatenation or combination process that aligns these different representations into a unified feature vector. This vector is the integral input for subsequent learning models that make predictions or classifications.

According to the results obtained in Table  5, CNNs, specialized in image processing, have demonstrated a remarkable ability to extract up to 256 distinctive features per image. This level of detail, which captures fundamental visual elements such as edges and textures, is reflected in the substantial 8% improvement in postfusion classification precision, underscoring the ability of CNNs to accurately discern between standard images and defective ones. On the other hand, the application of RNN with LSTM to time series has made it possible to encode patterns and trends over time in 128 significant features. The 10% improvement in predicting future events highlights how LSTMs capture and use temporal information to anticipate critical conditions, an invaluable capability for predictive maintenance and operational optimization.

Combining these multidimensional data sets into a unified 384-element vector, followed by normalization and dimensionality reduction using principal component analysis (PCA), has resulted in a more manageable data set of 300 features. At the same time, 95% of the original variation is preserved. This process ensures computational efficiency and maintains data integrity, allowing the final model to perform with high precision and sensitivity.

Table 5 Summary table of methods for extraction, fusion, and spatial and temporal data validation

4.1.4 Creation of the Merged Dataset

A critical task in the data fusion process is creating the merged data set. Here, feature vectors previously extracted and processed from spatial and temporal data are fused to represent each data sample comprehensively. The integration of spatial and temporal features has increased the dimensionality of the dataset, requiring a careful balance between retaining relevant information and managing computational complexity. After fusion, the total number of features is adjusted to the sum of the spatial and temporal features, providing an enriched database for analysis. Coverage of the variety of data allows for more robust analysis and capture of complex interactions between different types of signals. Table  6 presents the temporal characteristics collected in six different samples. Each row corresponds to an individual sample and presents five temporal features that could represent, for example, various aspects of a time series, such as temperature, vibration, or sound at specific moments or time intervals.

Table 6 Extracted temporal characteristics for sensor data analysis

The table presents a series of samples with five different temporal characteristics. Each row represents a sample with values ranging from 0.49 to 0.85, suggesting moderate variation in measurements over time. When analyzing the values, it is observed that Characteristic_4_Temporal tends to have higher values than the other characteristics, indicating an operational parameter that regularly reaches higher peaks within the observed range. For example, this corresponds to a periodic increase in workload or temperature that exceeds the average operating values.

In contrast, Characteristic_3_Temporal also shows significant variability. Still, it tends to remain in a medium range, which may reflect the more stable behavior of an operating variable, such as constant flow in a manufacturing process. Characteristic_1_Temporal and Characteristic_5_Temporal exhibit the lowest values, which could correspond to intrinsically lower measurements, such as background noise levels or vibrations in a no-load operating state. Fluctuations in these measurements indicate operating patterns and maintenance needs or signal the emergence of abnormal conditions.

The values show consistency in temporal characteristic one, remaining in the upper range of 0.6 to 0.8. This consistency could indicate a stable condition or a constant trend in the variable corresponding to that characteristic, a process under control, or the absence of significant disruptive events. Variations in the other temporal characteristics, although within similar ranges, suggest more complex dynamics. For example, feature four shows higher values in all samples, with a peak of 0.85 in sample 4, which may indicate a turning point or a significant change in the measured process.

4.1.5 Ablation Studies

To evaluate the contribution of each component of our model, we performed ablation studies. These studies selectively remove different elements from the model and assess their impact on overall performance. In this study, we selectively remove the CNN layers responsible for spatial feature extraction. Additionally, we removed the LSTM units used to capture temporal dependencies in the temperature, vibration, and sound data. Similarly, we remove the DNN layer used to integrate spatial and temporal features.

Table  7 summarizes the results of the ablation studies. These studies show how each model component contributes to overall performance and highlight the importance of a well-designed architecture for anomaly detection.

Table 7 Results of ablation studies

Analysis of ablation studies shows how removing different components affects the model’s overall performance. For example, removing the CNN layers resulted in a 20% decrease in overall model accuracy. This is because CNNs extract key visual features critical for visual anomaly detection. Without these layers, the model cannot identify edges, textures, and shapes that indicate defects. Furthermore, a 25% increase in classification error was observed, underscoring the importance of CNNs in spatial data analysis.

Removing the LSTM units showed an even more significant impact on model accuracy, with a 30% decrease in fault prediction accuracy. LSTMs are crucial for capturing temporal dependencies and patterns in time series data. Without these units, the model cannot adequately analyze changes over time, significantly reducing its predictive ability. Furthermore, the 35% increase in the MSE indicates that the accuracy of the predictions is also severely affected.

Removing the DNN layer to integrate spatial and temporal features resulted in a 25% decrease in overall model accuracy. This layer is essential to combine the features extracted by CNNs and LSTMs effectively. Without this layer, the model cannot correctly integrate spatial and temporal information, leading to a 30% increase in prediction error. This demonstrates the importance of efficient data integration for accurate and robust predictions.

The ablation studies highlight the critical importance of each model component to its overall performance. The CNN layers, LSTM units, and the DNN integration layer are all essential to ensure accurate and effective anomaly detection. Removing any of these components results in a significant performance decrease, underscoring the need for a well designed and carefully optimized architecture for anomaly detection applications.

4.2 Data Fusion Results

After implementing the data fusion process, it is necessary to quantify the impact on model performance. Table  8 presents a detailed comparison of these metrics, illustrating the significant progress made when fusing data using CNN and RNN, in contrast to previous results and alternative fusion methods.

Table 8 Comparison of performance metrics between data fusion methods

The analysis of the results table evidences the success of data fusion using CNN+RNN over perfusion methods and other alternative fusion approaches. Precision and recall increased to 0.88 and 0.85, respectively, indicating the model’s more notable ability to identify positive cases and correctly retrieve a more significant proportion. The F1 score also increased to 0.86, reflecting a better balance between precision and recall than previous and alternative fusion methods. Furthermore, the AUC improved to 0.90, demonstrating an improvement in the model’s ability to distinguish between positive and negative classes across different thresholds.

The comparative analysis marks the effectiveness of the CNN+RNN data fusion in enriching the fused data set, allowing for deeper and more accurate analysis. Although the alternative methods showed improvements in the results before fusion, they did not achieve the performance of the proposed approach. Alternative fusion using simple concatenation, weighted sum, or advanced hybrid models shows incremental improvements. Still, it confirms the advantage of adequately integrating CNN and RNN capabilities to capture data’s spatial and temporal complexity.

4.3 Visualizations and Analysis of Data Fusion and DNN Model Training

To illustrate the significant impact of this fusion, visualizations highlight the before-and-after in terms of precision, recall, F1 score, and AUC, as well as graphs showing the influence of the combined features on the model predictions. Figure 5 directly compares critical performance metrics (‘Precision’,‘Recall’, ‘F1 Score’, and ‘AUC’) before and after data fusion implementation. This analysis highlights the tangible improvements achieved through the merger and provides a solid basis for evaluating the effectiveness of our methodology in improving model precision and reliability.

The improvement in precision indicates that the model can now better identify true positives among all optimistic predictions. This is critical in environments where false alarms can be costly or disruptive. The increase in recall reveals that the model is more effective at identifying true positives among all true positive cases to detect as many relevant events or conditions as possible.

The F1 score, which balances precision and recall, shows a notable increase, indicating a more robust balance between these metrics. The increase in AUC underlines that the model can better distinguish between positive and negative classes across different thresholds. This is a strong indicator of the improvement in the quality of the model predictions.

Fig. 5
figure 5

Performance metrics comparison chart before and after data fusion

Figure  6 reveals distinctive patterns of importance assigned to spatial and temporal features. Certain spatial and temporal features have significantly high importance values, indicating that the model considers them crucial for making accurate predictions. These features are associated with specific data patterns that characterize events or states of interest, such as anomaly detection or standard operating pattern recognition. On the other hand, the distribution of importance between features shows variability that reflects the data’s complexity and the model’s ability to capture and use this diversity to its advantage.

Fig. 6
figure 6

Heat map of importance of spatial and temporal features in data fusion

The fusion of spatial and temporal data culminates in an enriched dataset that feeds the training of a deep learning model, DNN. The previously presented visualizations, including performance improvement graphs and feature heatmaps, not only highlight the importance of combined features but also provide an understanding of the direct impact of this approach on model training. Following data fusion, DNN model training focuses on complex patterns indicative of specific operational states, such as anomaly detection or identification of maintenance needs.

4.4 Model Evaluation with Independent Test Dataset

The final stage in validating the performance of our DNN model involved its evaluation using an independent test data set. This allows for an unbiased review of the model and examines its ability to generalize to new data. Data division:

  • Training set: 70% of the total data.

  • Independent test set: 30% of total data.

Table  9 details the results obtained from this evaluation, providing a direct comparison between the performance metrics during training and on the independent test set, thus illustrating the tangible impact of our data fusion methodology on the effectiveness and reliability of the model.

Table 9 Results on the independent test set

Evaluation of the model on the independent test data set reveals a slight decrease in the performance metrics compared to the values obtained during training. This difference, although minimal, is an essential consideration in the analysis of the model’s generalization capacity. Precision and AUC, for example, show a decrease of less than 2%, suggesting that the model can correctly identify positive instances and distinguish between classes in never-before-seen data. Furthermore, the recovery and F1 scores exhibit remarkable robustness on the test set, with minimal difference from the training values. These results indicate that the model effectively recovers many positive instances and balances precision and recall, respectively, even when faced with new data.

Table 10 Improvements in operational metrics by use case after data fusion

4.5 Comparison Based on Use Cases

Table  10 presents comparative results highlighting anomaly detection, predictive maintenance, and process optimization improvements. The improved accuracy in anomaly detection is due to the model’s ability to interpret complex interactions between spatial and temporal data, allowing for more accurate identification of abnormal patterns. This improvement is essential in industrial environments where undetected anomalies can lead to costly or dangerous failures.

The timing of early detection for predictive maintenance underscores the model’s ability to analyze temporal trends and long-term spatial correlations, predicting the need for early interventions. From a theoretical perspective, data fusion provides a solid foundation for models that require identifying the system’s current state and projecting its future behavior. Improved operational efficiency reflects the model’s ability to integrate and analyze data from multiple sources, optimizing processes by identifying inefficiencies and suggesting adjustments. In theory, this demonstrates how data fusion enriches the data set available to the model, enabling a level of analysis and understanding that goes beyond what would be possible with isolated data.

In this case study, we implement our data fusion framework in an electronic component manufacturing plant. The plant uses various sensors, including temperature, vibration, pressure, flow, and vision, to monitor different aspects of the production process. The environment includes temperature sensors that monitor the thermal conditions of machines and the environment, vibration sensors that capture acceleration data to identify mechanical problems, pressure sensors that monitor pneumatic and hydraulic systems, flow sensors that measure the flow of liquid in the production process, and vision cameras that perform quality inspections in real-time.

The data fusion process involved several steps. First, sensor data were collected and stored on a central server. Normalization and noise filtering techniques were applied to prepare the data for analysis. CNNs were used to extract visual features from the images, and RNNs (with LSTM) were used to capture temporal patterns in the sensor data. Spatial and temporal features were integrated into a unified vector using concatenation and dimensionality reduction techniques such as PCA. Finally, a DNN model was trained using the fused feature vector to detect anomalies and predict maintenance needs. The results of the case study were significant. After implementing data fusion, detection accuracy improved from 80% to 92%. Early detection time for maintenance was extended from 2 to 5 days, reducing unplanned downtime by 20%. Operational efficiency increased from 70% to 85%, resulting in a 15% reduction in material waste and a 5% increase in production line productivity.

This case study demonstrates the applicability and effectiveness of our data fusion approach in a real industrial environment. Integrating spatial and temporal data not only improved anomaly detection and predictive maintenance accuracy but also optimized operational efficiency, highlighting the practical benefits of our methodology.

4.6 Comparison with Other Existing Techniques

The evolution of Industry 4.0 has given rise to a diversity of techniques to address critical challenges such as anomaly detection, predictive maintenance, and process optimization. While our proposal focuses on the fusion of spatial and temporal data to enrich the analysis and improve model performance, other techniques have taken different approaches. Table  11 compares our methodology and other prevalent artificial intelligence and machine learning methods.

Our data fusion proposal is distinguished by its holistic approach, integrating spatial and temporal information for a deep and multifaceted understanding of the data. This methodology contrasts with alternative techniques that may focus on single data dimensions, potentially limiting their ability to capture the complexity of operational environments. However, it is essential to recognize that the increased complexity of implementing data fusion-based models could present a challenge concerning the computational resources and technical expertise required. Despite this, investment in this approach promises a significant return in terms of predictive precision, operational efficiency, and ability to adapt to various industrial situations.

Table 11 Comparison of characteristics between data fusion and alternative techniques in industry 4.0

To evaluate the performance of the proposed method, a comparison was made with Kalman filters and other learning-based methods such as Support Vector Machines (SVM) and Random Forests (RF). The results of this comparison are presented in Table  12. The Kalman filter implementation in this study consists of the following components:

- State Vector (x): Represents the variables of interest that we aim to estimate, such as the position and velocity of a machinery component. The state vector includes parameters like temperature, vibration levels, and other critical sensor readings in this context.

- State Transition Function (F): Describes how the state vector evolves from one-time step to the next. This function incorporates the dynamics of the system being monitored. For instance, it predicts the next state of the system based on its current state and a mathematical model of the system’s behavior.

- Observation Model (H): Maps the true state space into the observed space, indicating how the measurements are related to the state vector. This model helps translate the actual state of the system into the observed sensor readings.

- Process Noise (Q): Represents the uncertainty in the state transition process, accounting for the variability in the system’s dynamics. This noise is assumed to follow a Gaussian distribution with a covariance matrix Q, capturing the level of unpredictability in the system changes.

- Measurement Noise (R): Accounts for the uncertainty in the sensor measurements. This noise also follows a Gaussian distribution with a covariance matrix R, indicating the reliability of the sensor readings.

Together, these components form the basis of the Kalman filter algorithm, which iteratively estimates the system’s state by minimizing error covariance. The configurations used in this study are tailored to the specific characteristics of the industrial environment and the types of sensors deployed.

Table 12 Comparison of results with Kalman filters and other learning-based methods

Analysis of the comparative results shows that the proposed method with CNN and RNN significantly outperforms Kalman, SVM, and RF filters regarding prediction accuracy and MSE. Although Kalman filters are fast in computation time, their accuracy is considerably lower due to their inability to capture complex, nonlinear dependencies in the data. SVM and RF offer better performance than Kalman filters but still fall short of the proposed method’s accuracy and ability to handle complex temporal patterns.

The proposed method’s ability to integrate spatial and temporal data through CNN and RNN provides higher accuracy in anomaly detection and better generalization under various operating conditions. Although the calculation time of the proposed method is slightly longer, this difference is justified by the significant improvement in the accuracy and robustness of the model.

4.7 Computational Complexity Analysis and Optimization

We analyzed the computational requirements and explored possible optimizations to address the computational complexity of the hybrid approach combining CNN, RNN, and DNN. First, in terms of infrastructure, model training is carried out in a high-performance computing (HPC) environment equipped with multiple graphics processing units (GPUs). This environment allows parallel processing, significantly reducing the time required for training complex models. The infrastructure includes servers with NVIDIA Tesla V100 GPUs, which offer high processing capacity and adequate memory to handle large volumes of data and complex models.

The training time varies depending on the size of the data set and the complexity of the model. In our experiments, training the combined network (CNN + RNN + DNN) on a medium-sized data set ( 1 million records) took approximately 48 h using 4 GPUs in parallel. Implementing the model requires significant memory due to convolution operations and maintaining states in RNNs. On average, each training instance uses about 32 GB of VRAM.

We use parallel processing techniques to optimize the model and distribute the workload across multiple GPUs. This speeds up training and allows us to handle more extensive and complex data sets. We implement Apache Spark to distribute data preprocessing and analysis tasks, improving the system’s efficiency. Additionally, we employ optimization algorithms such as Grid Search and Random Search to tune critical model hyperparameters, including learning rate, batch size, and regularization parameters (L2 and dropout). The selection of the Adam optimization algorithm allows automatic adjustments to the learning rate of each parameter, improving convergence in complex search spaces. We use dimensionality reduction techniques such as Principal Component Analysis (PCA) to reduce the number of features while maintaining 95% of the original variance. This reduces the computational burden and improves the efficiency of the model.

Batch normalization is implemented to stabilize the training process and accelerate convergence. This is crucial to handle variations in data and maintain computational efficiency. Implementation of these techniques resulted in a reduction in training time by 30% and an improvement in model accuracy by 10% compared to traditional methods without optimization. In addition to the improvements above, we have evaluated other important metrics such as model robustness, operational efficiency, and real-time performance. Model robustness improved by 17%, efficiency increased by 15%, and real-time performance increased by 15% after applying the optimizations.

Figure  7 shows the improvement in model accuracy, reduction in training time, memory usage, accuracy after dimensionality reduction, robustness, efficiency, and real-time performance due to the applied optimizations.

Fig. 7
figure 7

Impact of Optimizations on the Model Performance Metrics

5 Discussion

The fusion of spatial and temporal data has proven effective, as evidenced by the crucial increase in model performance metrics. Specifically, we observed a 15% increase in anomaly detection accuracy, going from 80% before data fusion to 92% after. However, we are aware of concerns related to scalability and real-time processing. We have integrated parallel processing technologies and distributed architectures such as Apache Spark and TensorFlow Extended (TFX) to address these concerns. These technologies improve the ability to handle large volumes of data and enable efficient real-time processing, ensuring that our framework is scalable and able to respond to the growing demands of Industry 4.0.

Additionally, the predictive capability of predictive maintenance has been significantly improved, with a 150% increase in early detection time from 2 to 5 days. This result allows a broader window for preventive intervention. It contributes to more efficient resource planning and reduces unplanned downtime, vital to maintaining operational continuity and efficiency. Regarding process optimization, we have achieved a 21.4% improvement in operational efficiency, going from 70% to 85%. This increase indicates the model’s ability to identify and suggest operational adjustments that lead to higher productivity and more efficient utilization of resources, which can translate into significant cost savings and improvements in long-term operational sustainability [28].

When comparing our findings with previous research in the field, we see similarities in the trend toward adopting data fusion techniques. However, our study is distinguished by its focus on integrating spatial and temporal data, which offers clear advantages concerning model performance [40]. We acknowledge that our study has limitations, mainly related to the scope of the data sets and the specific fusion methodology applied. These limitations could affect the breadth of applicability of our findings [8]. For future research, it would be beneficial to explore more extensive and diversified data sets and compare different data fusion techniques to validate and possibly expand the applicability of our results [1, 41].

From a practical perspective, the study results offer promising avenues for implementation in natural industrial environments. For example, anomaly detection and predictive maintenance improvements suggest that companies could adopt data fusion techniques to prevent failures and optimize production. Adapting these approaches will require considering context-specific factors, such as data availability and existing technological infrastructure. Despite its limitations, the research provides valuable information to advance deep learning technology and its practical application.

Although our study has demonstrated the effectiveness of the data fusion approach in a specific industrial application, we recognize that the diversity of the data set used is limited. This specialization could restrict the applicability of our findings to other industrial contexts. The choice of this data set was justified by its relevance and availability, allowing detailed analysis and significant results within this sector.

To address this limitation, we suggest that future research explore more diverse data sets spanning multiple types of industrial applications. This extension will allow us to validate and generalize our results, ensuring that the proposed approach can be adapted to different industrial contexts and challenges. We believe this future research will be crucial to strengthening the applicability and relevance of our methodology in Industry 4.0.

6 Conclusion

Through detailed methodology and analysis, our work has demonstrated how intelligent integration of diverse data sources can unlock new capabilities in anomaly detection, predictive maintenance, and industrial process optimization. Our study has revealed significant improvements in several key performance metrics, including an increase in anomaly detection precision by up to 92%, an extension of early detection time for predictive maintenance by 150%, and an increase in operational efficiency by up to 85%. These achievements underline the potential of our proposal to transform industrial operations, enabling greater precision in decision-making, reduced downtime, and overall process optimization.

Beyond technical advances, the study provides valuable insights into the importance of careful data preparation and the selection of appropriate modeling techniques to address complex challenges in industrial environments. The fusion of spatial and temporal data is emerging as a powerful strategy, not only for its ability to improve model performance but also for its potential to offer a deeper and more nuanced understanding of operating systems.

Comparison of our methodology with other existing techniques has further high-lighted its relevance and applicability. Despite facing certain limitations, such as the complexity of implementation and the need for significant computational resources, the proposal is distinguished by its ability to provide more precise and reliable solutions, effectively adapting to the changing dynamics of Industry 4.0. Industry adoption of our approach could lead to operational improvements and greater sustainability and competitiveness in the global market. By capitalizing on the richness of fused data, businesses can move toward an era of intelligent operations with greater efficiency, security, and adaptability.

7 Supplementary information

In this study, we provide a representative sample of the data used in our analyses as part of the supplementary information. In addition, a preliminary version of the developed software is included, offering an initial vision of the tools and methodologies applied. This approach allows for more robust transparency and reproducibility, supporting our findings and allowing other researchers to validate or extend our research. The inclusion of this documentation is essential for the validation of the results presented. It will, therefore, be kept available for review by the editorial teams or any interested party, as required.