1 Introduction

The available modern digitization infrastructure such as smart storage clusters allows archives, libraries or film museums to digitize their analog historical film collections. The digitization process of analog films involves various stages [7, 18]. One fundamental stage is to scan the original film strips by using modern scanners such as the Scanity HDR or the ScanStation 5K [6]. During this process the frame content projected on different film reel types (e.g. 35 mm, 16 mm or 9.5 mm) is scanned and converted into a specific standard video format. However, the scan window used can include additional information such as black or white borders, Sprocket-Holes (SH) or parts of the next and previous frames. This effect is called overscanning [6] (see Fig. 1a, b). Film scanners can be configured to scan the strips including the overscan areas or not. In order to get all information projected on the original reels, historian film experts are interested to get digitized versions including also the overscan areas. However, for automatic film analysis tools such as classification of cinematographic settings [9], object detection [8] or scene understanding[15] it is of significance to get input frames including as much information as possible without showing additional overscan artifacts. The geometric layout of the SHs is a significant indicator for each individual film reel type (see Fig. 1c, d). Furthermore, it defines the geometric borders for the core frame window which includes the most significant information of an analog film. The scope of this work is on detecting SHs in digitized analog films [4] including overscan areas in order to crop only the original frame window. Historical films raise further challenges which make an automatic detection process not trivial. They can include different damages, such as cracks, scratches, dust or over- and underexposures [10, 25]. Some examples are demonstrated in Fig. 2. These characteristics also affect the behavior of the SHs and make it challenging for traditional computer vision approaches [11].

Fig. 1.
figure 1

Demonstration of the overscan effect during the scan process of analog films. a) 16 mm and b) 9.5 mm film reel with scan window (red) and final frame content (crop - yellow). b) 9.5 mm film reel c) 16 mm and b) 9.5 mm film reel geometry and sprocket hole layout. (Color figure online)

Fig. 2.
figure 2

Demonstration of original historical frames of digitized 9.5 mm and 16 mm film reel types. The reels visualizes different kinds of damages like scratches, cracks or dust (Efilms project [4]).

One similar exploration is focused on detecting the SHs in historical footage by using a fully unsupervised approach based on traditional image thresholding in combination with Connected Component Labeling and illustrates first baseline results. However, for our best knowledge there are no further comparable investigations on automatically detecting overscan information in already scanned analog films by using segmentation algorithms [11].

This paper proposes a method in order to detect overscan information in digitized analog films by finding SHs. The approach is able to classify film reel types by exploring the geometry and layout of detected SHs (see Fig. 1c,d). The introduced approach is based on an adapted and optimized segmentation network in combination with an unsupervised Gaussian Mixture Model (GMM) for fine-grained segmentation results in order to calculate the exact frame crop window. Therefore, a dataset including 15000 extracted images of the benchmark database MS COCO [16] is created and used as base for our exploration. All image samples are enriched by synthetically generated SHs and are finally deformed to get photo-realistic samples representing the film reel types 16 mm and 9.5 mm. The final tests are done with a separate dataset including real-world and original 16 mm and 9.5 mm historical film frames related to the time of the National-Socialism [4]. Finally, this exploration points out the effectiveness of using low-level features in terms of inference runtime and segmentation performance. This investigation is evaluated by using state-of-the-art metrics such as mean Intersection over Union (mIoU) and dice coefficient. The contribution of this paper is summarized as follows:

  • We provide a novel Overscan Detection Pipeline in order to precisely detect and remove overscan areas in digitized analog films.

  • We create and provide a self-generated dataset based on MS COCO including synthetically generated sprocket-hole types of two different film reels (16 mm and 9.5 mm).

  • We provide a fundamental base for further research on innovative digitization and fine-grained segmentation methods. Therefore we give full access to our sourcecode and self-generated dataset [GitHub-Repository: https://github.com/dahe-cvl/isvc2020_paper].

In Sect. 1, the motivation, the problem description as well as the main contributions of this work are introduced. Similar and alternative approaches are discussed in Sect. 2. A detailed insight of the methodological background of the explored pipeline is given in Sect. 3. Section 4 describes the results and points out the benefits of our proposed approach. We summarize our investigation with the conclusion in Sect. 5.

2 State-of-the-Art

Digitization and Video Restoration: Exploring automatic digitization mechanisms for historical analog film strips raise significant attention in the last decade [1, 25]. The fundamental step is to scan these film reels with smart techniques which are based on sensible light sources in combination with an optical system demonstrated by Flueckiger et al. [6]. Few scanner systems such as the Scanity HDR are able to recognize overscan areas semi-automatically by registering SHs with dedicated camera systems and the corresponding frame lines (indicates the split between two consecutive frames). Furthermore, the multi-spectral characteristics are analyzed to achieve high-quality scans (e.g. color correction) [6]. However, there is no overall scanner technique which is able to handle each film reel type and detect the exact overscan areas automatically. Deeper automatic video analysis is based on detecting filmographic techniques such as shot detection [25] or shot type interpretation [20, 21] and can be significantly influenced by these areas. Different explorations focus on the restoration of historical films [24, 25]. Yeh et al. [23] and Iizuka and Simo-Serra [12] have published an approach in order to remove and clean-up film frames including small cracks or damage artifacts. The results are obtained by using Generative Adversarial Networks (GAN) or cycle GANs in order to generate synthetically frame realistic datasets for training their models.

Semantic Image Segmentation: There are several traditional computer vision techniques for segmentation such as active contour, watershed or mean shift and mode finding which can be used for finding geometries such as sprocket holes in images [22]. Semantic Image Segmentation is used in different computer vision domains such as object detection, scene understanding or medical imaging [26]. Since 2015 different technologies and mechanisms are explored in order to detect correlated areas in an image and classify these areas into various categories such as cars, persons or cats [2, 3, 8]. The Fully Convolutional Network (FCN) forms the major base and plays a significant role in this research area up to now with over 14000 citations [17]. Further investigations such as DeepLabV3 [2], DeepLabV3+ [3] or Mask-RCNN [8] are published and provide novel standard techniques in this domain. However, we are able to see significant performance improvements related to run-time [19] and pixel-based segmentation [14, 26] since the last 5 years. Benchmark datasets such as PASCAL VOC [5] or MS COCO [16] are used to evaluate segmentation algorithms. Traditional segmentation techniques such as image thresholding demonstrates promising results by using Gaussian Mixture Models (GMM) as introduced by Zhao et al. [27] and Karim and Mohamed [13]. However, there are only few investigations on detecting SHs to remove overscan areas by using unsupervised image histogram-based solutions [11]. To our best knowledge there is no investigation on detecting SHs in digitized analog films by using semantic image segmentation approaches.

3 Methodology

Dataset Preparation: Since there is no public dataset available which includes different types of SH, we created a dataset based on the benchmark dataset MS COCO [16]. Therefore, images from the categories person, cat and car are downloaded for the training set. However, the category does not have any significant effect for the further process. In order to get meaningful image samples including SH and the corresponding ground truth masks a two-step process is defined (see Fig. 3a). In the first step the ground truth masks (binary maps) for the reel type 16 mm and 9.5 mm are generated. The exact geometry and position layout of the SHs are used to create and position the holes (see Fig. 1c, d). Moreover, the position and scaling of the holes varies randomly in order to get diversity ground truth masks (see Fig. 3b). In the second step the masks are merged and overlapped with the image samples gathered from the MS COCO dataset. Finally, our self-generated dataset consists of 15000 samples including the ground truth masks and the images with synthetically generated SHs demonstrated in Fig. 3b. Compared with real original scanned films the first version of our dataset displays significant differences corresponding to the quality. Historical films include scratches and cracks as well as show enormous variances in the exposure. In a second version deformation strategies e.g., image blurring and changing the global contrast and brightness are used to get more reliable and challenging training samples compared to the final test set (see Fig. 3c). The final test set includes 200 extracted frames with manual pixel-based annotated masks showing 16 mm as well as 9.5 mm reel types. The frames are randomly extracted of 10 different digitized analog films from the project Ephemeral Films [4] (see Fig. 3c).

Fig. 3.
figure 3

Demonstration of the dataset used in this exploration. a) Schematic pipeline for generating synthetic samples b) Randomly selected examples of our self-generated dataset c) Comparison of synthetically generated samples (training/validation set) with real original film frames (test set).

Overscan Segmentation Network: The proposed pipeline consists of three steps: Pre-Processing Module (PrePM), Segmentation Net (SegNet) and Post-Processing Module (Post PM). In the PrePM the input frame is pre-processed by applying standard functions such as grayscale conversion, resizing, zero centering and standardization. The first core part of the pipeline is SegNet which includes a pre-trained backbone CNN model in combination with a segmentation network head. The modules Connected Component Labeling Analysis (CCL), Gaussian Mixture Model (GMM), Reel Type Classifier (RTC) and Crop Window Calculator (CWC) form together the PostPM which is the second core part of our pipeline. Figure 4 illustrates a schematic overview of the pipeline used in this investigation during inference phase.

Fig. 4.
figure 4

Schematic overview of the Overscan Detection and Segmentation Pipeline used in this exploration.

The SegNet module generates a binary mask which represents the predicted SH areas. These masks are post-processed in the next stage. The CCL process creates labeled connected components. Furthermore, a filtering process is applied in order to remove outliers like small false predicted areas which depends on the frame composition. Since precise segmentation masks are needed to calculate the final crop window (CWC) the GMM module is introduced in this pipeline. This module is based on the corresponding histogram features of each predicted hole sub-area. These features can be used to predict the optimal threshold Th between background as well as hole pixels. The output is a precise thresholded SH in the input frame. Figure 5a illustrates the GMM thresholding process and the effect of using the combination of SegNet and GMM. In order to calculate the final crop window (CWC) the inner points of the holes are extracted (see Fig. 5b). By using these points (e.g. SH-16: 4 inner points, SH-9.5: 2 inner points) the center point related to the positions of the SHs can be calculated. Finally, the frame crop window is generated by specifying a scale-factor (e.g. SH-16 mm: 1.37 and SH-9.5 mm: 1.3) and maximizing it to the borders given through the inner points (see Fig. 5b). RTC is based on the fine-grained results of the GMM module as well as the calculated hole positions in the frame by using a 3x3 helper grid (see Fig. 5c). This information is used to get a precise final classification result. For example, the fields 1, 3, 7 and 9 are related to SH-16 whereas field 2 and 8 corresponds to SH-9.5.

Fig. 5.
figure 5

Demonstration of a) effect of using unsupervised GMM process in PostPM b) inner point P1(x1, y1) creation for SH-16 (top-left), SH-9.5 (bottom-left) and center point C(xy) calculation (right) as well as c) 3x3 helper grid for hole position detection and RTC.

4 Experimental Results

The evaluation focus of this investigation is on precise segmentation of detected holes as well as runtime performance during inference phase. Therefore, two different experiments are defined, and evaluation metrics used are described as follows:

Metrics: The final segmentation results are evaluated by calculating metrics such as Mean Intersection over Union (mIoU - Jaccard Index) and Dice Coefficient (F1 Score). The model performance is assessed by interpreting training and validation loss curves over a training period of maximal 50 epochs. The exploration of the runtime performance is done by analyzing the inference time (seconds per processed frame) and using GPU as well as CPU. Finally, all generated results are compared with state-of-the-art solutions. Experiment 1: The first experiment focuses on evaluating the segmentation performance of the backbone networks MobileNetV2, SqueezeNet, VGG and Resnet101 in combination with the heads FCN and DeepLab. Since the objective is to mask SHs which are represented by significant edges and perceptual shapes in given frames we train the SegNet models with two different variants. In the first variant low-level features (LF) of the networks are used as input for the segmentation heads whereas the second variant is using the last feature layers (high-level features = HF) of each individual backbone network. The expectation is that the outputs of low-level layers include more significant features for detecting the holes as the feature vectors of deeper layers which are representing more abstract characteristics of an image. Furthermore, this exploration involves different training strategies such as learning rate reduction, early stopping and custom generated data augmentation (e.g. random horizontal and vertical flips).

Fig. 6.
figure 6

Demonstration of the segmentation model performances by using different backbone networks over 50 training iterations. Validation loss - low-level features (left), validation loss - high-level features (middle) and validation dice coefficient (right).

The HF features represent the last layer of the feature extractor part of all backbone networks used in this evaluation. For example, the features of \(layer30-MaxPool2d\) in VGG16 or the \(layer7-bottleneck2\) in Resnet101. The output of \(layer4-InvertedResidual\) in MobileNetV2, \(layer5-MaxPool2d\) in SqueezeNet, \(layer4-bottleneck2\) in ResNet and \(layer4-MaxPool2d\) in VGG16 are used as LF (referred to pytorch implementations). The results of this experiment demonstrate that using LF in all model variants display significant better segmentation results instead of using the last feature layers of the evaluated model architectures. Resnet101 (LF) in combination with the FCN segmentation head achieves the best results of a mIoU of 0.9407 at a threshold of 0.50 and a Dice score of 0.9684 (@0.5). Furthermore, the VGG16 (LF) in combination with the DeepLab head illustrate a mIoU of 0.9371 (@0.5) and a Dice score of 0.9660 (@0.5). The runtime optimized and compressed model MobileNetV2+DeepLab reaches scores of \(mIoU@0.5=0.9096\) and a \(Dice@0.5=0.9504\). All explored combinations demonstrate an averaged significant performance increase of a \(\varDelta mIoU@0.8=0.23 (23\%)\). A further evaluation compares the state-of-the-art (SOTA) solutions DeepLabV3 and FCN with the SegNet models trained on our self-generated synthetically dataset. The SOTA models are pre-trained on Pascal VOC (21 classes) and adapted to distinguish only between SH and background pixels by adapting and retraining the last convolutional layer in the networks. The results of the new trained model variants with our introduced dataset outperforms the SOTA solutions. The effectiveness of the networks trained with our dataset is also demonstrated by the comparison to the histogram-based fully unsupervised solution [11]. The combination Resnet101+FCN+LF displays a performance increase of a \(\varDelta mIoU@0.5=0.12 (12\%)\). An overview of all segmentation results is summarized in Table 1. The validation loss history of the experiments using LF as well as HF are visualized in Fig. 6 (left and middle). The right plot in Fig. 6 demonstrates the validation dice coefficients over the full training time of 50 epochs. Experiment 2: Since digitized analog films varies in terms of resolution and film length significantly we focus in this experiment on the runtime performance of our pipeline by analyzing compressed architectures such as MobilenetV2 or Squeezenet compared to very deep and large networks like VGG16 and Resnet101. For this experiment the model variants ResNet101+FCN+LF, VGG16+DL+LF, SqueezeNet+FCN+LF and MobileNetV2+DeepLab+LF are selected since they are showing the best mIoU and Dice coefficient scores with respect to the segmentation head used (see Table 1). In order to extract the exact frame window and remove only overscan information precise segmentation of the holes is needed. Therefore, this experiment also explores the effect of using the additional introduced GMM module in the PostPM stage. The performance evaluation of this experiment demonstrates that MobileNetV2+Deeplab+LF and Squeezenet+FCN+LF outperforms the other model combinations with respect to CPU runtime during inference time. One frame needs a process time of about 100 ms (CPU) and both models are able to process about 10 frames per second. However, the Resnet101+FCN+GMM+LF combination reaches similar scores as well as the best segmentation results. Table 2 summarizes the results of the performance experiments. The second part of experiment 2 explores segmentation results by using the additional GMM module in the PostPM stage. Table 3 gives an overview of the reached results and demonstrates the effect of using the GMM module. The variant MobileNetV2+Deeplab+LF displays an increase of the mIoU score (@0.5) of 2.5%. Table 3 points out that the holes are segmented significantly better by using the additional GMM module. However, in terms of runtime performance as well as segmentation results the combination Resnet101+FCN+GMM+LF demonstrates the overall best results in this exploration and outperforms state-of-the-art solutions [11] by a \(\varDelta mIoU@0.5=12.76\%\) and a \(\varDelta Dice@0.5=8\%\). Figure 7 shows the influence of using this module on two example frames. The first image of both examples demonstrates two example SHs crops of the original image. In the next column the predicted masks without the GMM module are visualized and display that the holes are not segmented precisely. The final post-processed mask can be seen in the last column. Since GMM is based on histogram features the pipeline is able to generate an accurate mask which is used for calculating the final crop window in the last stage of our approach.

Table 1. This table illustrates the segmentation result of experiment 1 on the final test set including real-world historical film frames. Mean Intersection over Union (mIoU@0.5 and 0.8) as well as the dice coefficient Dice@0.5.
Table 2. Comparison of runtime performance metrics during inference time. The values displays the execution time in seconds per frame as well as the numbers of corresponding model parameters by using low-features (LF) and high-features (HF).
Table 3. Comparison of final test results with and without using additional GMM module.
Fig. 7.
figure 7

This figure demonstrates two examples for fine-grained segmentation in (a) and (b) and compares the effect with and without using the additional GMM module.

5 Conclusion

This investigation proposes a novel Overscan Detection pipeline in order to get fine-grained masks showing SHs in real-world historical footage. The approach is based on a supervised segmentation network for pre-masking the holes which is trained on a self-generated dataset including synthetically created SHs as well as an unsupervised GMM module which is based on histogram-features. We conclude that using low-level features of the backbone networks outperform the evaluated combinations by an averaged \(\varDelta mIoU@0.8=23\%\). Furthermore, this investigation visualizes the effectiveness of using a GMM in combination with a segmentation network and demonstrates a performance increase of \(\varDelta mIoU@0.5=1.72\%\) and \(\varDelta Dice@0.5=1\%\)). Moreover, the best model combination Resnet101+FCN+LF+GMM indicates a segmentation performance increase of \(\varDelta mIoU@0.5=12.7\%\) and \(\varDelta Dice@0.5=8\%\)). Finally, we propose a fundamental base for further research by providing full access to our sources as well as the generated dataset.