Introduction

Over the past few decades, there has been an exponential surge in the utilization of remote sensing techniques, resulting in remarkable advancements in both the quality and quantity of remote sensing images. With the availability of remote sensing images with different characteristics, and free of charge, remote sensing is now a common tool for studying the dynamics of Earth systems (Zhu et al. 2018). High temporal and spatial resolution images are highly desired to ensure better monitoring and detection of temporal changes. For example, dense time-series data are crucial in monitoring crop yields (Bolton and Friedl 2013), natural hazards and disasters (Han et al. 2010; Van Leeuwen et al. 2010), or changes in land use (Yin et al. 2014). However, due to the trade-off between the pixel size and the scanning swath of the sensors, it is almost impossible to find a satellite product with high spatial and temporal resolutions simultaneously (Tan et al. 2018). For instance, satellite sensors such as MODIS provide daily reflectance images with a spatial resolution ranging from 250 m to 1 km, while AVHRR offers a resolution of 1100 m, and Sentinel-3A/B captures images at a resolution of 300 m. Unfortunately, these spatial resolutions are generally considered low for most applications related to spatio-temporal monitoring of land use and natural hazards. On the other hand, sensors like Landsat ETM+/OLI, SPOT-5, and Sentinel-2A/B offer finer resolutions between 10 m and 60 m. However, they have longer revisiting cycles, with Sentinel-2 revisiting every 5 days, Landsat every 16 days, and SPOT-5 every 26 days (Fensholt 2004; Wang et al. 2017). The abundance of satellite images available allows for the integration of data from different sensors, enabling researchers to address this trade-off and generate dense time series data.

In recent years, significant advances have been made in data fusion within the remote sensing community. However, a thorough examination of the available methods using benchmark datasets is currently lacking. Evaluations and comparisons are made across various data sets that employ specific sensors. One primary challenge lies in achieving a consistent evaluation due to the scarcity of a suitable dataset that encompasses a variety of locations, scenarios, landscapes, and, more importantly, incorporates data from new sensors. Few remote sensing datasets suitable for data fusion could be found in the literature. A dataset was released for public use with a study comparing the performance of the well known spatio temporal fusion (STF) methods STARFM and ESTARFM, using MODIS and Landsat images from two sites with contrasting spatial and temporal variability (Emelyanova et al. 2013) and was used to test the performance of newly developed STF methods (Liu et al. 2019; Zhao et al. 2018; Song et al. 2018; Cheng et al. 2017). The authors of Li et al. (2020) published a public dataset for spatiotemporal models consisting of MODIS and Landsat pairs.

The primary concern of these datasets is its exclusive reliance on MODIS and Landsat data, despite the availability of different satellite images with varying spatial and temporal resolutions and longer useful lifetimes, which introduces a challenge in comprehending the efficacy of fusion models when applied to integrate imagery from various sensors. Furthermore, there are other shortcomings regarding available benchmark datasets, such as the limited number of paired images and the lack of image diversity, which are insufficient to make a fair comparison.

Keeping benchmark datasets updated with new satellite missions is not only essential but a fundamental necessity for ensuring the development and assessment of fusion methods. As satellite technology continues to advance, new sensors introduce higher spectral, spatial, and radiometric capabilities, offering richer datasets that can significantly improve fusion model performance. Failure to adapt to these advancements risks rendering fusion techniques obsolete and limiting their applicability in real-world scenarios. Moreover, the meticulous preparation of benchmark datasets is a critical factor influencing the quality and reliability of fusion results. Ensuring consistency between sensors by minimizing spectral, spatial, and temporal discrepancies allows fusion models to produce more precise and meaningful outputs. Proper dataset curation enhances model generalizability, ensuring that results are not biased toward a specific sensor combination but rather applicable across different satellite missions and environmental conditions. Testing spatiotemporal fusion methods with European sensors is of paramount importance in advancing the field of remote sensing and geospatial analysis to ensure the continuity of the data and products, replacing these satellites that are close to the end of their operational lives. By systematically testing these methods with different datasets, researchers can assess their accuracy, efficiency, and robustness in capturing temporal and spatial details in images captured by different sensors.

STF methods aim to merge two or multiple data sources with similar spectral ranges to provide time series with high spatial and temporal resolutions. In most cases, STF methods integrate data from one sensor with high spatial but low temporal resolution (HSLT) and data from another sensor with low spatial and high temporal resolution (LSHT) (Tan et al. 2018). Despite the diversity of STF methods proposed in the literature, each method has shown its advantages in terms of prediction accuracy, computing efficiency, and input data requirements. Nevertheless, there are few studies focused on comparing these methods (Han and Zhao 2019; Chen et al. 2015; Liu et al. 2019; Zhou et al. 2021). On the other hand, to our knowledge, there is no comparative study of STF methods using the new European satellites (Sentinel-2 and Sentinel-3 OLCI). In fact, most of the developed methods are designed mainly to fuse MODIS and Landsat images, even though those missions will end in a few years. Therefore, it is paramount to test fusion methods not only in different scenarios, but also with different sensors and satellites.

In this context, we are contributing to the scientific community by publishing a new benchmark remote sensing dataset based on Sentinel-2 and Sentinel-3 images. The proposed dataset has four key characteristics that differentiate it from existing datasets: First, it introduces for the first time in the literature the use of new European sensors in the STF algorithms, offering different temporal resolution as well as higher spatial and spectral resolution. Second, the dataset is large-scale, not only in terms of spatial resolution, but also in terms of the total number of paired images. Third, it provides a high spatial resolution ratio between the coarse and fine images. Fourth, our dataset includes several ecosystems, landscapes, weather conditions, and temporal and spatial variances. Additionally, we present experimental results that compare five of the most commonly used STF methods in the literature, taking into account the availability of their source code. The STF methods used in this work were evaluated both quantitatively and qualitatively using our newly benchmarked dataset.

New benchmark dataset description

Data sources

The Copernicus Program has emerged as one of the leading space data providers worldwide, offering free and unrestricted access to satellite data acquired by the Sentinel constellation. The primary objective of the Sentinel program is to guarantee seamless data continuity by replacing Earth observation missions that have been retired or are approaching the end of their operational lifespan. This proactive approach aims to prevent disruptions in ongoing studies. Specifically, users can freely download Sentinel-2 1-C level and Sentinel-3 OLCI 1-B images from the Copernicus Open Access Hub website https://scihub.copernicus.eu/.

Sentinel-2 1C level

Sentinel-2 is a satellite imaging mission of the European Space Agency (ESA) designed to monitor changes in land surface conditions. The first satellite, Sentinel-2A, was launched on 23 June 2015, and was later joined by its counterpart, Sentinel-2B, in March 2017 to increase the temporal frequency (Hagolle et al. 2015; Drusch et al. 2012). These twin satellites operate simultaneously, revolving in the same orbit but phased at \(180{^{\underline{\textrm{o}}}}\), and now regularly release data. The Sentinel-2 mission systematically acquires data over land and coastal areas within a latitude band spanning from \(56^{\circ }\) south (Isla Hornos, Cape Horn, South America) to \(82.8^{\circ }\) north (above Greenland). The primary goal of the mission is to ensure the continuity and improvement of multispectral imagery, which has previously been provided by missions such as Landsat and SPOT. Sentinel-2 data can be used in various applications, such as climate change analysis, emergency management, security, and land monitoring, including assessments of various aspects as vegetation, water cover, soil, and coastal areas. Compared to the Landsat sensor, the Sentinel-2 sensor covers a wider field of view, spanning 290 km, as opposed to 185 km, and has been widely employed for global monitoring in recent decades (Wang et al. 2016; Drusch et al. 2012). Sentinel-2 images offer spectral information across 13 bands spanning visible, near-infrared (NIR), and shortwave infrared (SWIR) wavelengths, with a spatial resolution of 10 m for most bands, six bands at 20 m resolution, and three bands at 60 m resolution (Du et al. 2016).

Sentinel-3 OLCI 1-B level

Sentinel-3 OLCI is another mission of (ESA) within the Copernicus program, operated jointly with EUMETSAT. Its purpose is to provide ocean, land, and atmospheric services. The current mission consists of two satellites: Sentinel-3A, launched on 16 February 2016, and Sentinel-3B, launched on 25 April 2018. Two additional satellites, Sentinel-3C and Sentinel-3D, are planned for future launches (Verhoef and Bach 2012; Cazzaniga et al. 2019). These twin satellites are in near-polar, sun-synchronous orbit, following the same orbital path but with a phasing difference of \(140^\circ \). Sentinel-3 carried out a dual view conical instrument called Sea and Land Surface Temperature Radiometer (SLSTR), a dual frequency SAR altimeter (SARL), a Microwave Radiometer (MWR), and the Ocean and Land Color Instrument (OLCI), which is a push-broom imaging spectrometer (Prikaziuk et al. 2021). The main objective of the Sentinel-3 mission envisages the continuity medium-resolution sensors such as MERIS, VGT, SPOT, and ENVISAT. The Sentinel-3 OLCI mission provides sea surface temperature measurements that are of comparable quality to those obtained by the Advanced Along Track Scanning Radiometer (AATSR) on ENVISAT. It also delivers global coverage of ocean color, water quality, ocean surface, and land ice/snow surface temperature products. The Sentinel-3 OLCI sensor captures data in 21 spectral bands ranging from visible to near-infrared (400 to 1020 nm). It has a spatial resolution of 300 m and a temporal resolution of less than 1.4 days (Donlon et al. 2012). Each of the 21 bands can be programmed in terms of position and width (Donlon et al. 2012).

Fig. 1
figure 1

The geographic locations of the sites included in the dataset

Table 1 Description of Sentinel-2 and Sentinel-3 bands

Sites description

We generated a benchmark dataset that includes 15 sites spanning diverse ecosystems and geographic locations, with a total of 689 Sentinel-2 Sentinel-3 pairs. These sites were carefully selected to introduce a variety of challenges for fusion algorithms, ranging from rapid environmental changes to urban expansion and coastal ecosystem dynamics. Figure 1 shows the locations of the dataset sites on the world map. Each site consists of Sentinel-2 and Sentinel-3 image pairs, with Sentinel-3 providing 16 bands at a spatial resolution of 300 m, while Sentinel-2 offers 10 bands at varying spatial resolutions of 10 m, 20 m, and 60 m. The details of the bands are listed in Table 1. By including multiple bands and diverse geographic conditions, this dataset allows for rigorous testing of STF methods. The inclusion of sites with different environmental conditions makes it easier to identify limitations and failure modes in STF techniques. Understanding these constraints is crucial for guiding the development of more adaptive and generalizable fusion methods.

The site selection was based on several key factors. First, the sites are distributed across multiple continents, covering North America, South America, Africa, Europe, Asia, and Australia, ensuring global representation. The dataset also incorporates a wide range of climate zones and ecosystems, including arid deserts, temperate forests, tropical rainforests, coastal zones, and high-altitude regions, allowing STF models to be evaluated under different environmental conditions. More details about the sites are presented in Annex A. Another important consideration was the dynamic nature of land cover changes at each site. Some locations were chosen due to their susceptibility to rapid transformations, such as wildfire-prone landscapes, expanding urban areas, and vulnerable coastal ecosystems. Additionally, cloud-free data availability was a critical factor in site selection, ensuring that Sentinel-2 and Sentinel-3 observations were consistently available for each location. Lastly, each site presents challenges for STF algorithms, such as seasonal vegetation changes, urban development, deforestation, and atmospheric disturbances like sandstorms. The dataset is designed to serve as a comprehensive benchmark for evaluating STF methods across diverse landscapes. Table 2 summarizes the locations and key characteristics of each site.

Fig. 2
figure 2

Example pairs S2/S3 from Waterbank dataset (NIR-Red-Blue color composite)

For clarity, we provide a detailed description of three representative sites that illustrate different STF challenges: Waterbank in Australia, Neom in Saudi Arabia, and Maspalomas in Spain. Details about other sites are presented in Annex A.

Waterbank site

The reason for selecting this site is to evaluate the performance of various STF methods in capturing rapid changes caused by natural processes such as forest fire events. Waterbank (17.706\(^{\circ }\) S, 122.7259\(^{\circ }\) E) is a rural location in northern western Australia. This region has a hot semi-arid climate characterized by two distinct seasons. The wet season in Waterbank typically begins around December and ends around June. Throughout this period, the Waterbank receives an average of approximately 580.2 mm of rainfall. On the contrary, during the dry season from July to November, the total rainfall is less than 29.1 mm. In February, which is the peak of the wet season, it rains an average of 11.5 days within that particular month. In fact, Waterbank has even experienced record-breaking rainfall of 476.6 mm on a single day during January. The month of January holds the record for being the wettest month, with 910.8 mm of rain. Regarding the temperatures, during the wet season, the average daytime temperature ranges between 29.2 and 34.30 \(^{\circ }\)C, while the overnight temperature ranges from 15.3 to 26.40 \(^{\circ }\)C. In the dry season, the average daytime temperature varies from 28.8 to 33.5 \(^{\circ }\)C, while the nighttime temperature ranges from 13.6 to 25.0 \(^{\circ }\)C. During this period, vegetation becomes dry and highly susceptible to ignition, resulting in a significant fuel load.

The Northern Territory in Australia records one of the highest frequencies of early-season fires. Many land managers carry out prescribed fires to remove grasses that could act as fuel for more destructive wildfires later in the dry season. As a result, the area experiences frequent fires throughout the year, leading to rapid changes in vegetation patterns. This site serves as an ideal testing ground to evaluate the accuracy of STF algorithms in a heterogeneous landscape characterized by sudden changes. We collected a total of 38 cloud-free image pairs from Sentinel-2/Sentinel-3 OLCI, spanning from 09 January 2019, to 14 November 2020, and covering a surface of 324 km\(^{2}\). Figure 2 displays four pairs of Sentinel-2/Sentinel-3 images, illustrating the pronounced and abrupt changes in the area resulting from fire events, accompanied by gradual changes in vegetation.

Fig. 3
figure 3

Example pairs S2/S3 from the Neom City dataset (NIR-Red-Blue color composite)

Fig. 4
figure 4

Example pairs S2/S3 from the Gran Canaria dataset (NIR-Red-Blue color composite)

Table 2 Description of the sites of the dataset

Neom city site

The objective of this site is to evaluate the precision of STF algorithms in urban changes within a desert region. Neom City, located in the northwest part of the Kingdom of Saudi Arabia on the Red Sea coast in Tabouk province (28.102\(^{\circ }\) N, 35.207\(^{\circ }\) E), is one of the world’s newest smart cities, constructed from scratch as part of Saudi Arabia’s Vision 2030 initiative. The climate in this location is relatively cooler than typical temperatures across the Gulf Cooperation Council, due to the presence of mountains and cool winds originating from the Red Sea. January holds the record for being the coldest month, with temperatures ranging from 10 to 21 \(^{\circ }\)C. On the other hand, August is the hottest month, with temperatures ranging between 17 and 40 \(^{\circ }\)C. The area experiences high sun exposure throughout the year, with an average duration of 9 to 11 hours per day. Additionally, there are only a few cloudy days, resulting in a larger number of cloud-free images available for analysis. The site exhibits a gradual transformation in land cover due to extensive urban development, including the construction of buildings and infrastructure. In addition, there are phenological changes that occur in both the land and coastal areas. This dataset comprises 65 cloud-free image pairs from Sentinel-2/Sentinel-3 OLCI, captured between 14 March 2018, and 18 December 2020 and covers 144 km\(^{2}\). These images document the construction progress of various projects, such as a helipad and a golf course, among the initial development endeavors. Figure 3 displays four pairs of images from this site, providing visual evidence of the progress made in the construction process within the city of Neom.

Maspalomas site

The primary objective of this site is to evaluate the precision of STF methods in coastal and vulnerable ecosystems. The Canary Islands, known for their exceptional biodiversity (Ibarrola-Ulzurrun et al. 2017), include Maspalomas (27.760\(^{\circ }\) N, 15.587\(^{\circ }\) E), a natural reserve located in the southern part of Gran Canary Island. Maspalomas has a subtropical desert climate characterized by dry, hot summers and mild winters. The average annual temperature ranges from approximately 20 to 25 \(^{\circ }\)C. Winter in Maspalomas sees an average rainfall of 35 mm, and the absence of rain extends for 10 months with average highs varying between 21 and 23 \(^{\circ }\)C. Warm summers are marked by average high temperatures ranging from 26 to 29 \(^{\circ }\)C. Maspalomas has several ecosystems, including a coastal dune with significant tourist activities, a terrestrial ecosystem, and an inner-water ecosystem known as the Maspalomas lagoon. The region is also prone to sandstorms that occur at various times throughout the year. This presents a challenging scenario for STF algorithms due to the complex interplay between different ecosystems, vulnerable areas, and human activities. The dataset for this site comprises 36 cloud-free image pairs from Sentinel-2/Sentinel-3 OLCI, covering the period from 06 January 2019, to 11 December 2020 and covers 81 km\(^{2}\). Figure 4 presents four pairs of images, highlighting the striking heterogeneity of the landscapes surrounding this site. Table 2 presents the specifications of each site.

Fig. 5
figure 5

Flowchart of the dataset preprocessing

Image preprocessing

To create the dataset, the first step in the preprocessing workflow is atmospheric correction, which plays a critical role in ensuring the accuracy and consistency of surface reflectance values. Atmospheric correction is essential for removing atmospheric effects such as scattering and absorption caused by gases, aerosols, and water vapor, thereby standardizing the data across different acquisition times and sensor platforms. While there are several atmospheric correction algorithms available in the literature, only a few are designed to work with Sentinel-2 1-C and Sentinel-3 OLCI 1-B level data. For this study, atmospheric correction was performed using the Image Correction for Atmospheric Effects (iCOR) algorithm (Renosh et al. 2020). Unlike many existing datasets that either lack atmospheric correction or apply different algorithms to fine and coarse images, leading to inconsistencies, we ensured a uniform correction process across both sensors. This approach minimizes potential biases and enhances the reliability of the dataset for fusion applications.

iCOR is capable of processing satellite images collected on various types of water and land surfaces, and it involves multiple processing steps. First, it identifies water and land pixels by applying a band threshold. The aerosol optical thickness (AOT) is then estimated from the land pixels and extrapolated over the water. Next, an adjacency correction is applied using the SIMilarity Environmental Correction (SIMEC) approach (Sterckx et al. 2015). Finally, iCOR utilizes solar and view angles (Sun zenith angle (SZA), view zenith angle (VZA), and relative azimuth angle (RAA)), along with a Digital Elevation Model (DEM), to perform atmospheric correction. This correction is carried out by referencing MODTRAN5 Look-Up-Tables (LUTs) (Berk et al. 2006). The iCOR algorithm is available as a plugin in the SNAP software.Footnote 1 Subsequently, the Sentinel-3 data was reprojected to the UTM/WGS84 projection to match the projection of the Sentinel-2 data. The co-registration process involved two steps: co-registration of the Sentinel-2 and Sentinel-3 time series, and co-registration of the image pairs. While some datasets omit co-registration for simplification, we implemented a robust two-step process to ensure precise spatial alignment, a critical factor in reducing fusion errors. In the first step, ENVI 5.6.1 software was used to co-register a single pair of Sentinel-2/Sentinel-3 images. Then, each image was used to co-register the entire time series using AROSICS software (Scheffler et al. 2017). AROSICS employs cross-correlation in the Fourier space for accurate coregistration. This step is particularly crucial, as even within a single sensor time series, geolocation misalignments can occur (Rufin et al. 2020). Without proper time series co-registration, these errors would significantly degrade fusion quality, especially when generating temporally consistent. Finally, the Sentinel-3 OLCI images were resampled using the Nearest Neighbor method to match the spatial resolution of the Sentinel-2 data. Additionally, we ensured that all spectral bands were preserved, unlike some datasets that reduce bands to optimize storage or computational efficiency. Retaining the full spectral information enhances the dataset’s versatility for future applications, including spectral-based analyses and machine learning approaches in remote sensing. The preprocessing workflow is illustrated in Fig. 5. The generated dataset is organized by sites and date and is freely available at Boumahdi et al. (2023). Each folder contains the preprocessed coarse/fine image pair. The dataset is provided using a WGS 84 / UTM zone 48N projection, with a tag image file format (TIFF).

Comparison of spatiotemporal fusion methods using sentinel-2 and sentinel-3 OLCI

We performed a benchmark analysis on our dataset, assessing various STF methods that serve as representatives. The objective of this undertaking was to provide a thorough examination of the current state-of-the-art performance, specifically within the framework of utilizing Sentinel imagery. The insights obtained from this benchmarking exercise not only improve our current research comprehension, but also lay the foundation for future endeavors in this field.

STARFM

The Spatial and Temporal Adaptive Reflectance Fusion Method (STARFM) is one of the first STF methods in the literature (Gao et al. 2006). It is widely used and has become the basis for the development of many other methods. STARFM is the first weighted function-based method, and requires one fine resolution image and a coarse resolution image at the base date \(t_{1}\), as well as a coarse resolution image at the prediction date \(t_{2}\). The algorithm assumes that there is a uniform temporal change between all classes in the coarse and fine images, biased by a constant error. A moving window is used to identify the spectrally similar neighbour pixels in the fine image at \(t_{1}\). When the coarse pixel is homogeneous, the spectral changes between \(t_{1}\) and \(t_{2}\) are consistent, so they are added directly to the fine image in \(t_{1}\). However, in the case where the coarse pixel is mixed, the weight function is calculated based on neighbouring similar pixels using a moving window. The weight function determines the contribution of each neighboring pixel to the reflectance estimation of the central pixel in the moving window. The spectral difference, the temporal difference, and the spatial Euclidean distance between the neighbor and the central pixel are combined to determine its distance. Finally, the predicted image at \(t_{2}\) is generated using the previously calculated weight function.

ESTARFM

ESTARFM was proposed mainly to enhance the performance of STARFM in heterogeneous landscapes (Zhu et al. 2010). It requires two pairs of coarse-fine images from prior and posterior dates, as well as a coarse image at the prediction date. The two fine images are used to search for similar pixels within a moving window centered on the target pixel. The spectral similarity between the target pixel and similar pixels is used to determine the weights for all similar pixels. To increase accuracy in the heterogeneous area, the conversion coefficients are calculated using a linear regression analysis for each similar pixel in a search window. Subsequently, the weight function is combined with the conversion coefficients to predict the fused images at date \(t_{2}\).

FSDAF

Flexible Spatiotemporal Data Fusion (FSDAF) is a hybrid method presented in Zhu et al. (2016). It requires at least a coarse-fine pair image on the base date \(t_{1}\) and a coarse image on the prediction date \(t_{2}\). FSDAF combines the idea of weighted function-based and unmixing-based methods with spatial interpolation. Firstly, the fine resolution image at the base time \(t_{1}\) is used to generate a classification map. The coarse images are then used to estimate the temporal changes for each class between \(t_{1}\) and \(t_{2}\). A temporal prediction of the fine resolution image at \(t_{2}\) is made based on the class-level changes from the previous step, and the residual at each coarse pixel is also calculated. Another prediction, called spatial prediction, is made using a Thin Plate Spline (TPS) interpolator by downscaling the coarse resolution image at \(t_{1}\) to match the size of the fine resolution image. The residuals from the temporal prediction are distributed using a weighted function based on the TPS prediction and geographical homogeneity. The final step consists of estimating the final prediction at \(t_{2}\) by introducing information from the neighborhood.

Fit-FC

Fit-FC is another spatiotemporal method presented in Wang and Atkinson (2018) with the main objective of fusion of Sentinel-2 MSI and Sentinel-3 OLCI images. Fit-FC is a linear regression-based method that requires one pair of coarse-fine images at the base date \(t_{1}\) and a coarse image at the prediction date \(t_{2}\). The first step involves fitting local linear regression models between the two coarse images at \(t_{1}\) and \(t_{2}\) within a moving window. The regression coefficients are assigned to the center pixel of the moving window and then applied to the fine image at \(t_{1}\). The second step is a spatial filtering (SF) technique that is used to address the blocky artifact problem that may arise from the previous step. Spectrally similar pixels within the moving window are considered to define a weighted function for each center pixel. The final step is Residual Compensation. It involves downscaling the residuals from the first step’s coarse image to the size of the fine image using cubic interpolation to avoid the smoothing effect caused by spectrally similar pixels. To preserve spectral information, the updated residual is added back to the predicted image generated in the SF prediction step.

Table 3 Pairs of images from the dataset used in the experiments and the time lapse between the prior and predicted date

SFSDAF

SFSDAF is an enhanced FSDAF method that incorporates subpixel class fraction change information (Li et al. 2020). SFSDAF requires one pair of coarse-fine images at a prior date. In the first step, the ISODATA algorithm is used to generate a land cover map from the fine image at the base date. The endmembers are calculated as the average of the fine resolution image pixels for each class, and then a soft classification is applied to the fine image at the date \(t_{1}\). In the second step, an image of a coarse resolution class fraction is calculated as the difference between the coarse resolution class fractions at \(t_{1}\) and \(t_{2}\). This image is then downscaled from \(t_{1}\) to \(t_{2}\) using bicubic and TPS interpolation. The downscaled image is added to the fine resolution class fraction at \(t_{1}\) to obtain the fine resolution class fraction at \(t_{2}\). The final prediction uses a spatial interpolation, similar to that used in FSDAF, to combine the temporal and spatial predictions obtained in the previous steps.

Performance metrics

Several indices were calculated to facilitate a comprehensive comparison of the performance of the STF method, encompassing different aspects of accuracy. All predicted images were compared with the Sentinel-2 image on the prediction date \(t_{2}\). Four indices were used for this comparison. The root mean square error (RMSE) (Zhou et al. 2016) was employed to provide an overall measure of the differences between the predicted and original reflectance. A lower RMSE indicates a higher accuracy. Cross-Correlation (CC) (Loncan et al. 2015), with an ideal value of 1, was used to evaluate the geometric distortion present in the predicted image. The third metric, Structural Similarity (SSIM) (Wang et al. 2004), quantified spatial similarities in the overall structure between the predicted and original images. A higher SSIM value indicates better results. Peak Signal to Noise Ratio (PSNR) was also employed as a quantitative assessment (Hore and Ziou 2010). It evaluated the spatial reconstruction quality and provided an overall measure of bias between the predicted and original images. A higher PSNR value indicates a higher quality spatial reconstruction.

Results and discussion

As described in the previous section, we have selected five of the most commonly used STF methods in the literature, each of which has public available codes. These methods include STARFM (Gao et al. 2006), ESTARFM  (Zhu et al. 2010), FSDAF  (Zhu et al. 2016), Fit-FC (Wang and Atkinson 2018), and SFSDAF  (Li et al. 2020). We compared their performance when fusing Sentinel-2 and Sentinel-3 images from our dataset. The results of this comparison provide a baseline for future studies to be conducted.

For comparison, six different bands of the Sentinel-2 and Sentinel-3 OLCI sensors were utilized. These bands are the blue band (Band-2/Oa4), the green band (Band-3/Oa6), the red band (Band-4/Oa8), the red-edge 1 band (Band-5/Oa11), the red-edge 3 band (Band-7/Oa16), and the near-infrared (NIR) band (Band 8A/Oa17). Table 3 presents the details of the image pairs used in this work, including the dates and the time difference in days between the previous and prediction dates.

Visual evaluations

A coarse/fine (Sentinel-3/Sentinel-2) pair of images from the base date \(t_{1}\) is used as input for the fusion methods, along with a coarse image at the prediction date \(t_{2}\). The fine image at \(t_{2}\) is used as the ground truth to compare the prediction results.

Figure 6 displays both predicted images and original Sentinel-2 images (HR0 in the figure) over the Waterbank site on 08 May 2020, the Neom city site on 12 February 2019, and the Maspalomas site on 28 September 2019. The pair of Sentinel-3 OLCI/Sentinel-2 images on the base date (Prior LR0 & HR0) were acquired on 05 May 2020, 03 January 2019 and 29 August 2019 from Waterbank, Neom city, and Maspalomas, respectively. These images were chosen for visual evaluation because of their notable spatial and temporal changes between the base and predicted dates. The Waterbank site exhibits a large and rapid change, manifested as fire and agriculture activities in the form of center-pivot irrigation circles. Neom City features significant urban construction and a change in land cover type, transitioning from desert to green areas or water bodies. The intensity of the color of the buildings in this area has changed from grey to white, indicating ongoing urban development at the location. The Maspalomas site displays changes in vegetation intensity, high heterogeneity, and meteorological conditions described as sandstorms.

Fig. 6
figure 6

Sentinel-3 images on base date (Prior LR0) and prediction date (LR0), Sentinel-2 images observed on base date (Prior HR0) and prediction date (HR0), and the Sentinel-2-like images. First row Waterbank site with (NIR, Red, Green) composition, prior HR0 was acquired on 05 May 2020, and HR0 08 May 2020. Second row the Neom city site with (RGB) composition, prior HR0 was acquired on 03 January 2019 and HR0 was acquired on the 12 February 2019. Third row on Maspalomas site with (NIR, Red, Green) composition, prior HR0 was acquired on 29 August 2019 and HR0 was acquired on 28 September 2019

For visual examination, we used different color compositions for each site to highlight the features considered in this study, such as fire, burned areas, agriculture fields, and construction. Specifically, we used the NIR-Red-Green color composite for both the Waterbank and Maspalomas sites, while the RGB color composite was used for the Neom city site.

Figure 6 shows that all methods were generally successful in capturing spatial details, despite the high-scale factor (300 to 10).

The Waterbank site presents the greatest challenge for all methods. First, there is high heterogeneity due to the center pivot irrigation. In the Sentinel-3 OLCI images, each of these circles occupies only one or two pixels, while in the Sentinel-2 images, more than 30 pixels occupy the same area. Second, the methods retain the smoke caused by the fire from the prior HR0, even if it disappears on the predicted date. Fit-FC fails to detect the edge of the burned area, and its blurring effect is greater compared to FSDAF. When it comes to the city of Neom, most of the methods are unable to detect spatial changes such as green areas and constructions. At the Maspalomas site, the methods face a greater challenge in defining the coast, especially near the Maspalomas Dunes, which are located at the bottom of the images. As a result, the methods fail to detect spatial changes within a single coarse pixel, and the predicted images have spatial details that are more similar to the base image at \(t_{1}\) than the actual Sentinel-2 image at \(t_{2}\).

The general visual evaluation of the five methods in Fig. 6 did not show a significant difference in their performance. Figure 7 presents a zoomed-in area within the boxes shown in Fig. 6. At the Waterbank site, there have been changes in the details and color of the center pivot irrigations from the base to the prediction date. In the false color composite images, the color has changed from green to white, while the size and boundaries of the irrigations remained the same. This site exhibits high heterogeneity, with each circle occupying one to two pixels in the coarse image and 30 to 60 pixels in the fine image. None of the methods was able to detect the temporal changes accurately; instead, they preserved the details and intensity from the base date and some methods maintained the spatial boundaries. However, certain methods exhibited worse results, characterized by a blurring effect around the edges. For example, FSDAF and Fit-FC showed this blurring effect and FSDAF failed to define all the boundaries of the centre pivot irrigations. Regarding the burned area at the same location, the fire expanded from the base to the prediction date, increasing the surface area of the burned region, visible as a green color in the false color composite. The smoke from the fire on the base date disappeared in the prediction date. Generally, the methods generated tones similar to those of the actual image on the prediction dates, but they all retained the smoke in the predicted image. STARFM and SFSDAF methods were able to predict the new burned areas with imprecise edges, while the ESTARFM performed better in this regard. Fit-FC and FSDAF exhibited lower performance due to their blurring effects, making them less suitable for capturing sudden changes.

Fig. 7
figure 7

Zoomed-in area (Rectangles area in Prior HR0 in Fig. 6) of Sentinel-2 images at prior and prediction date and the images generated by the methods used in the comparison

Zoomed-in view of the Neom city site revealed a change in land cover between the base date (prior HR0) and the prediction date (HR0), with the emergence of green areas and water basins. Among the methods, ESTARFM demonstrated better performance by detecting most of these changes, particularly the presence of water basins, which no other method predicted. On the contrary, Fit-FC and FSDAF exhibited the worst performance. Neither of these methods was able to detect spatial details or temporal changes, and the blurring effect affected all heterogeneous pixels. However, Fit-FC generated an image that closely resembled the base date image rather than the actual date image.

The zoomed-in view of the Maspalomas site reveals a change in image color due to a sandstorm that occurred in the base date. Additionally, many areas transitioned from light to intense red in the false color composition. STARFM was able to detect some of these changes, although the spectral characteristics of the generated image differed from the actual image. ESTARFM also captured some of the changes and produced images with spectral characteristics closer to the actual image. However, similar to the previous site, FSDAF generated blurred images and lacked clear boundaries between the land and water areas. Many urban areas near the coast were not noticeable due to the blurring effect. Fit-FC, like FSDAF, struggled to capture small details in the images. It is important to note that none of these methods successfully predicted the small spatial details or temporal changes seen in the actual image. This highlights the challenges faced by these STF methods in predicting changes in small objects within highly heterogeneous sites, as well as sudden and continuous temporal changes in other cases.

Table 4 Quantative assessment of experimental results on Waterbank site on 08 June 2019
Table 5 Quantative assessment of experimental results on Neom City site on 12 July 2018
Table 6 Quantative assessment of experimental results on Maspalomas site on 28 September 2019

Quantitative evaluations

To better assess the accuracy of the predictions, the metrics CC, RMSE, SSIM, and PSNR were calculated for the three sites, and the results are included in (Tables 4, 5 and 6). The best results are shown in bold. Each table presents in detail the quantitative assessments for every band used in the comparison of the aforementioned methods. All the methods were able to fuse different bands within the bandwidth of both Sentinel-2 and Sentinel-3 OLCI sensors, with a higher spatial resolution ratio between coarse and fine images. ESTARFM obtained competitive results and the metrics were closer to the optimal values than the other methods in all six bands. For example, it significantly improved the SSIM for the Neom City site from 0.4 to 0.88 for band 6. Additionally, ESTARFM consistently achieved higher PSNR values compared to other methods at all sites and bands. SFSDAF also showed good results at all three sites, with metrics that were comparable to those of ESTARFM. The qualitative metrics presented in Tables 4, 5 and 6 align with the visual evaluation discussed in the previous section. On the contrary, FSDAF generally exhibited the highest RMSE among the other methods and the lowest CC.

The variation in performance, both in visual and quantitative assessments, can be attributed to the number of pairs used in the fusion process. ESTARFM, for instance, requires two pairs of data, both from the prior and posterior dates. This additional information, particularly regarding temporal changes, enables for more accurate predictions. However, it is important to note that even with the use of multiple pairs, all methods experienced challenges in certain pixels, resulting in inaccurate predictions in some cases.

Different pairs were utilized in this comparison, with five dates tested for each site. Table 3 provides the specific dates for each pair. The selection of these dates aimed to ensure a diverse range of temporal variations in the input data, with the time difference between the base and predicted dates ranging from 5 to 40 days. Figure 8 shows the quantitative evaluations of the results of the experiments mentioned above. We should note that CC, RMSE, SSIM and PNSR in the Fig. 8 correspond to the average of the 6 bands mentioned previously in Table 1.

Fig. 8
figure 8

Average RMSE, CC, SSIM and PSNR for the six bands of each method for the three sites of the dataset between the actual and the predicted image

The results indicate that the time gap between the base and prediction dates does not have as significant an impact as the image heterogeneity and temporal changes between the two images. It is notable that the results for the Neom City site are consistently similar, which can be attributed to the continuous changes occurring in the site over time. ESTARFM demonstrated promising results compared to other algorithms. For example, Fit-FC exhibited a high RMSE for the generated image, indicating lower consistency between the prior and predicted images. In all experiments, ESTARFM outperformed the other methods in terms of PSNR, indicating an overall bias between the actual and predicted images. STARFM showcased a stronger correlation between the base date image and the predicted images in most of the experiments, with correlations exceeding 0.9. On the other hand, Fit-FC exhibited lower correlations, such as 0.65 in the first pair of the Maspalomas site. RMSE values were generally low in the majority cases, with some outliers observed, particularly in the second and fourth pair of the Neom City site, which experienced greater temporal changes during that period (0.11 for Fit-FC and 0.1 for FSDAF).

Moreover, the results show that there were no significant differences between the five STF methods in landscapes with low heterogeneity and low temporal variance. However, their performance varied in landscapes with higher heterogeneity and greater temporal variance, highlighting their sensitivity to such conditions. Through visual assessment and quantitative metrics, ESTARFM was found to be more suitable for heterogeneous landscapes and sudden events in terms of accurately capturing the structural details between the generated image and the ground truth image, even though it may not always have the lowest errors. While the availability of coarse/fine image pairs for fusion is not always guaranteed, SFSDAF or STARFM demonstrated competitive results and were able to work with a minimum number of image pairs.

Discussion

STF techniques have been widely explored in the literature, with most benchmark datasets focusing on the fusion of MODIS and Landsat due to their complementary temporal and spatial resolutions. Studies such as Zhu et al. (2010) and subsequent research have extensively evaluated the performance of STF methods using these datasets, where the typical spatial resolution ratio between the coarse and fine resolution images is 8:1. However, this study introduces a benchmark dataset with a significantly different configuration, where the Sentinel-3 to Sentinel-2 resolution ratio is approximately 30:1. This larger ratio presents both a challenge and an opportunity for the development of more advanced STF methods. Previous STF evaluations have primarily relied on MODIS-Landsat datasets, where MODIS provides high temporal frequency but at a coarser 250 m to 1 km resolution, and Landsat offers finer 30 m resolution but with longer revisit intervals of 16 days. While these datasets have been instrumental in advancing STF methods, the Sentinel-2/Sentinel-3 fusion scenario presents a fundamentally different problem. The greater disparity in spatial resolution between Sentinel-3 (300 m) and Sentinel-2 (10-60 m) introduces new interpolation challenges, requiring more sophisticated spatial downscaling strategies to preserve spectral and structural integrity.

The 30:1 resolution disparity between Sentinel-3 and Sentinel-2 presents a greater spatial extrapolation gap than MODIS-Landsat fusion. This larger scale difference increases uncertainty in reconstructed images, particularly in areas with high spatial heterogeneity, such as urban environments, agricultural fields, and regions undergoing rapid land cover changes. While ESTARFM and FSDAF demonstrated robust performance in our dataset, the results indicate that current STF algorithms may not be fully optimized for handling such extreme resolution differences. The larger Sentinel-2/Sentinel-3 gap necessitates the development of enhanced downscaling strategies, possibly incorporating machine learning or physics-informed models to improve spatial coherence. Conversely, this challenge also represents an opportunity to advance STF research by pushing the limits of existing methods. Given the need to reconstruct high-resolution observations from extremely coarse inputs, the Sentinel-2/Sentinel-3 dataset could drive new methodological developments.

Beyond technical challenges, there is a growing need for new benchmark datasets tailored to modern Earth observation missions. MODIS and Landsat, the most widely used sensors in STF research, are expected to be retired in the coming years, necessitating the transition to alternative sensor constellations. The Sentinel series, with its open-access policy and systematic acquisitions, provides an ideal successor for developing next-generation STF methods. However, Sentinel-2 and Sentinel-3 were not originally designed with STF applications in mind, making it essential to establish a robust benchmark dataset to support research in this area. This dataset provides a new standard for evaluating STF methods by incorporating a diverse range of geographic locations, including snow, forests, agricultural lands, deserts, mountainous regions, and coastal environments. The heterogeneity of the selected sites ensures that STF models can be tested under varied environmental conditions, facilitating a broader assessment of their performance.

Previous studies have highlighted that the performance of STF methods is highly dependent on the spatial and temporal variances of the study area. Li et al. (2020); Emelyanova et al. (2013) with their dataset and comparison demonstrated that ESTARFM generally outperformed STARFM in heterogeneous landscapes but was not always the best performer. In regions where spatial variance was dominant over temporal variance, STARFM exhibited lower errors. Similarly, studies have shown that the choice of fusion algorithm should consider the underlying spatial and temporal characteristics of the dataset rather than assuming a universally optimal method (Zhou et al. 2021). Moving forward, STF research should focus on developing adaptive models capable of handling extreme resolution disparities, integrating multi-source data such as SAR and topographic information, and improving computational efficiency. By establishing a new benchmark dataset, this work paves the way for next-generation fusion techniques that will support high-resolution, temporally continuous Earth observation applications.

Conclusions

In this paper, we have generated a new benchmark dataset, Sentinel-2/Sentinel-3, with the aim of contributing to the scientific community. For the first time in the literature, we present a dataset consisting of pairs of new European satellites, Sentinel-2/Sentinel-3 OLCI, exhibiting various temporal, spatial, and spectral characteristics. Furthermore, this dataset encompasses distinct sites from different regions of the world with diverse spatial features. Moreover, the dataset offers a wide range of spectral bands, thereby enriching the available data options for various applications. One of the applications of this dataset is to evaluate spatiotemporal fusion methods. The dataset’s diverse sites offer a good challenge for testing the STF techniques across various landcover types and assess their ability to handle spatial and temporal changes. In addition, we conducted a comparative analysis of the STF methods using European sensors. Specifically, we demonstrated the performance of five of the most widely used STF methods in the literature. The experiments conducted in this study reveal that all STF methods suffer from adequately capturing spatial details in highly heterogeneous landscapes, as well as capturing sudden changes like forest fires. The results obtained from these experiments serve as valuable baselines for future studies and applications involving this dataset.