Abstract
Automatic video analysis is explored in order to understand and interpret real-world scenes automatically. For digitized historical analog films, this process is influenced by the video quality, video composition or scan artifacts called overscanning. The main aim of this paper is to find the Sprocket Holes (SH) in digitized analog film frames in order to drop unwanted overscan areas and extract the correct scaled final frame content which includes the most significant frame information. The outcome of this investigation proposes a precise overscan detection pipeline which combines the advantages of supervised segmentation networks such as DeepLabV3 with an unsupervised Gaussian Mixture Model for fine-grained segmentation based on histogram features. Furthermore, this exploration demonstrates the strength of using low-level backbone features in combination with low-cost CNN architectures like SqueezeNet in terms of inference runtime and segmentation performance. Moreover, a pipeline for creating photo-realistic frame samples to build a self-generated dataset is introduced and used in the training and validation phase. This dataset consists of 15000 image-mask pairs including synthetically created and deformed SHs with respect to the exact film reel layout geometry. Finally, the approach is evaluated by using real-world historical film frames including original SHs and deformations such as scratches, cracks or wet splices. The proposed approach reaches a Mean Intersection over Union (mIoU) score of 0.9509 (@threshold: 0.5) as well as a Dice Coefficient of 0.974 (@threshold: 0.5) and outperforms state-of-the-art solutions. Finally, we provide full access to our source code as well as the self-generated dataset in order to promote further research on digitized analog film analysis and fine-grained object segmentation.
1 Introduction
The available modern digitization infrastructure such as smart storage clusters allows archives, libraries or film museums to digitize their analog historical film collections. The digitization process of analog films involves various stages [7, 18]. One fundamental stage is to scan the original film strips by using modern scanners such as the Scanity HDR or the ScanStation 5K [6]. During this process the frame content projected on different film reel types (e.g. 35 mm, 16 mm or 9.5 mm) is scanned and converted into a specific standard video format. However, the scan window used can include additional information such as black or white borders, Sprocket-Holes (SH) or parts of the next and previous frames. This effect is called overscanning [6] (see Fig. 1a, b). Film scanners can be configured to scan the strips including the overscan areas or not. In order to get all information projected on the original reels, historian film experts are interested to get digitized versions including also the overscan areas. However, for automatic film analysis tools such as classification of cinematographic settings [9], object detection [8] or scene understanding[15] it is of significance to get input frames including as much information as possible without showing additional overscan artifacts. The geometric layout of the SHs is a significant indicator for each individual film reel type (see Fig. 1c, d). Furthermore, it defines the geometric borders for the core frame window which includes the most significant information of an analog film. The scope of this work is on detecting SHs in digitized analog films [4] including overscan areas in order to crop only the original frame window. Historical films raise further challenges which make an automatic detection process not trivial. They can include different damages, such as cracks, scratches, dust or over- and underexposures [10, 25]. Some examples are demonstrated in Fig. 2. These characteristics also affect the behavior of the SHs and make it challenging for traditional computer vision approaches [11].
One similar exploration is focused on detecting the SHs in historical footage by using a fully unsupervised approach based on traditional image thresholding in combination with Connected Component Labeling and illustrates first baseline results. However, for our best knowledge there are no further comparable investigations on automatically detecting overscan information in already scanned analog films by using segmentation algorithms [11].
This paper proposes a method in order to detect overscan information in digitized analog films by finding SHs. The approach is able to classify film reel types by exploring the geometry and layout of detected SHs (see Fig. 1c,d). The introduced approach is based on an adapted and optimized segmentation network in combination with an unsupervised Gaussian Mixture Model (GMM) for fine-grained segmentation results in order to calculate the exact frame crop window. Therefore, a dataset including 15000 extracted images of the benchmark database MS COCO [16] is created and used as base for our exploration. All image samples are enriched by synthetically generated SHs and are finally deformed to get photo-realistic samples representing the film reel types 16 mm and 9.5 mm. The final tests are done with a separate dataset including real-world and original 16 mm and 9.5 mm historical film frames related to the time of the National-Socialism [4]. Finally, this exploration points out the effectiveness of using low-level features in terms of inference runtime and segmentation performance. This investigation is evaluated by using state-of-the-art metrics such as mean Intersection over Union (mIoU) and dice coefficient. The contribution of this paper is summarized as follows:
-
We provide a novel Overscan Detection Pipeline in order to precisely detect and remove overscan areas in digitized analog films.
-
We create and provide a self-generated dataset based on MS COCO including synthetically generated sprocket-hole types of two different film reels (16 mm and 9.5 mm).
-
We provide a fundamental base for further research on innovative digitization and fine-grained segmentation methods. Therefore we give full access to our sourcecode and self-generated dataset [GitHub-Repository: https://github.com/dahe-cvl/isvc2020_paper].
In Sect. 1, the motivation, the problem description as well as the main contributions of this work are introduced. Similar and alternative approaches are discussed in Sect. 2. A detailed insight of the methodological background of the explored pipeline is given in Sect. 3. Section 4 describes the results and points out the benefits of our proposed approach. We summarize our investigation with the conclusion in Sect. 5.
2 State-of-the-Art
Digitization and Video Restoration: Exploring automatic digitization mechanisms for historical analog film strips raise significant attention in the last decade [1, 25]. The fundamental step is to scan these film reels with smart techniques which are based on sensible light sources in combination with an optical system demonstrated by Flueckiger et al. [6]. Few scanner systems such as the Scanity HDR are able to recognize overscan areas semi-automatically by registering SHs with dedicated camera systems and the corresponding frame lines (indicates the split between two consecutive frames). Furthermore, the multi-spectral characteristics are analyzed to achieve high-quality scans (e.g. color correction) [6]. However, there is no overall scanner technique which is able to handle each film reel type and detect the exact overscan areas automatically. Deeper automatic video analysis is based on detecting filmographic techniques such as shot detection [25] or shot type interpretation [20, 21] and can be significantly influenced by these areas. Different explorations focus on the restoration of historical films [24, 25]. Yeh et al. [23] and Iizuka and Simo-Serra [12] have published an approach in order to remove and clean-up film frames including small cracks or damage artifacts. The results are obtained by using Generative Adversarial Networks (GAN) or cycle GANs in order to generate synthetically frame realistic datasets for training their models.
Semantic Image Segmentation: There are several traditional computer vision techniques for segmentation such as active contour, watershed or mean shift and mode finding which can be used for finding geometries such as sprocket holes in images [22]. Semantic Image Segmentation is used in different computer vision domains such as object detection, scene understanding or medical imaging [26]. Since 2015 different technologies and mechanisms are explored in order to detect correlated areas in an image and classify these areas into various categories such as cars, persons or cats [2, 3, 8]. The Fully Convolutional Network (FCN) forms the major base and plays a significant role in this research area up to now with over 14000 citations [17]. Further investigations such as DeepLabV3 [2], DeepLabV3+ [3] or Mask-RCNN [8] are published and provide novel standard techniques in this domain. However, we are able to see significant performance improvements related to run-time [19] and pixel-based segmentation [14, 26] since the last 5 years. Benchmark datasets such as PASCAL VOC [5] or MS COCO [16] are used to evaluate segmentation algorithms. Traditional segmentation techniques such as image thresholding demonstrates promising results by using Gaussian Mixture Models (GMM) as introduced by Zhao et al. [27] and Karim and Mohamed [13]. However, there are only few investigations on detecting SHs to remove overscan areas by using unsupervised image histogram-based solutions [11]. To our best knowledge there is no investigation on detecting SHs in digitized analog films by using semantic image segmentation approaches.
3 Methodology
Dataset Preparation: Since there is no public dataset available which includes different types of SH, we created a dataset based on the benchmark dataset MS COCO [16]. Therefore, images from the categories person, cat and car are downloaded for the training set. However, the category does not have any significant effect for the further process. In order to get meaningful image samples including SH and the corresponding ground truth masks a two-step process is defined (see Fig. 3a). In the first step the ground truth masks (binary maps) for the reel type 16 mm and 9.5 mm are generated. The exact geometry and position layout of the SHs are used to create and position the holes (see Fig. 1c, d). Moreover, the position and scaling of the holes varies randomly in order to get diversity ground truth masks (see Fig. 3b). In the second step the masks are merged and overlapped with the image samples gathered from the MS COCO dataset. Finally, our self-generated dataset consists of 15000 samples including the ground truth masks and the images with synthetically generated SHs demonstrated in Fig. 3b. Compared with real original scanned films the first version of our dataset displays significant differences corresponding to the quality. Historical films include scratches and cracks as well as show enormous variances in the exposure. In a second version deformation strategies e.g., image blurring and changing the global contrast and brightness are used to get more reliable and challenging training samples compared to the final test set (see Fig. 3c). The final test set includes 200 extracted frames with manual pixel-based annotated masks showing 16 mm as well as 9.5 mm reel types. The frames are randomly extracted of 10 different digitized analog films from the project Ephemeral Films [4] (see Fig. 3c).
Overscan Segmentation Network: The proposed pipeline consists of three steps: Pre-Processing Module (PrePM), Segmentation Net (SegNet) and Post-Processing Module (Post PM). In the PrePM the input frame is pre-processed by applying standard functions such as grayscale conversion, resizing, zero centering and standardization. The first core part of the pipeline is SegNet which includes a pre-trained backbone CNN model in combination with a segmentation network head. The modules Connected Component Labeling Analysis (CCL), Gaussian Mixture Model (GMM), Reel Type Classifier (RTC) and Crop Window Calculator (CWC) form together the PostPM which is the second core part of our pipeline. Figure 4 illustrates a schematic overview of the pipeline used in this investigation during inference phase.
The SegNet module generates a binary mask which represents the predicted SH areas. These masks are post-processed in the next stage. The CCL process creates labeled connected components. Furthermore, a filtering process is applied in order to remove outliers like small false predicted areas which depends on the frame composition. Since precise segmentation masks are needed to calculate the final crop window (CWC) the GMM module is introduced in this pipeline. This module is based on the corresponding histogram features of each predicted hole sub-area. These features can be used to predict the optimal threshold Th between background as well as hole pixels. The output is a precise thresholded SH in the input frame. Figure 5a illustrates the GMM thresholding process and the effect of using the combination of SegNet and GMM. In order to calculate the final crop window (CWC) the inner points of the holes are extracted (see Fig. 5b). By using these points (e.g. SH-16: 4 inner points, SH-9.5: 2 inner points) the center point related to the positions of the SHs can be calculated. Finally, the frame crop window is generated by specifying a scale-factor (e.g. SH-16 mm: 1.37 and SH-9.5 mm: 1.3) and maximizing it to the borders given through the inner points (see Fig. 5b). RTC is based on the fine-grained results of the GMM module as well as the calculated hole positions in the frame by using a 3x3 helper grid (see Fig. 5c). This information is used to get a precise final classification result. For example, the fields 1, 3, 7 and 9 are related to SH-16 whereas field 2 and 8 corresponds to SH-9.5.
4 Experimental Results
The evaluation focus of this investigation is on precise segmentation of detected holes as well as runtime performance during inference phase. Therefore, two different experiments are defined, and evaluation metrics used are described as follows:
Metrics: The final segmentation results are evaluated by calculating metrics such as Mean Intersection over Union (mIoU - Jaccard Index) and Dice Coefficient (F1 Score). The model performance is assessed by interpreting training and validation loss curves over a training period of maximal 50 epochs. The exploration of the runtime performance is done by analyzing the inference time (seconds per processed frame) and using GPU as well as CPU. Finally, all generated results are compared with state-of-the-art solutions. Experiment 1: The first experiment focuses on evaluating the segmentation performance of the backbone networks MobileNetV2, SqueezeNet, VGG and Resnet101 in combination with the heads FCN and DeepLab. Since the objective is to mask SHs which are represented by significant edges and perceptual shapes in given frames we train the SegNet models with two different variants. In the first variant low-level features (LF) of the networks are used as input for the segmentation heads whereas the second variant is using the last feature layers (high-level features = HF) of each individual backbone network. The expectation is that the outputs of low-level layers include more significant features for detecting the holes as the feature vectors of deeper layers which are representing more abstract characteristics of an image. Furthermore, this exploration involves different training strategies such as learning rate reduction, early stopping and custom generated data augmentation (e.g. random horizontal and vertical flips).
The HF features represent the last layer of the feature extractor part of all backbone networks used in this evaluation. For example, the features of \(layer30-MaxPool2d\) in VGG16 or the \(layer7-bottleneck2\) in Resnet101. The output of \(layer4-InvertedResidual\) in MobileNetV2, \(layer5-MaxPool2d\) in SqueezeNet, \(layer4-bottleneck2\) in ResNet and \(layer4-MaxPool2d\) in VGG16 are used as LF (referred to pytorch implementations). The results of this experiment demonstrate that using LF in all model variants display significant better segmentation results instead of using the last feature layers of the evaluated model architectures. Resnet101 (LF) in combination with the FCN segmentation head achieves the best results of a mIoU of 0.9407 at a threshold of 0.50 and a Dice score of 0.9684 (@0.5). Furthermore, the VGG16 (LF) in combination with the DeepLab head illustrate a mIoU of 0.9371 (@0.5) and a Dice score of 0.9660 (@0.5). The runtime optimized and compressed model MobileNetV2+DeepLab reaches scores of \(mIoU@0.5=0.9096\) and a \(Dice@0.5=0.9504\). All explored combinations demonstrate an averaged significant performance increase of a \(\varDelta mIoU@0.8=0.23 (23\%)\). A further evaluation compares the state-of-the-art (SOTA) solutions DeepLabV3 and FCN with the SegNet models trained on our self-generated synthetically dataset. The SOTA models are pre-trained on Pascal VOC (21 classes) and adapted to distinguish only between SH and background pixels by adapting and retraining the last convolutional layer in the networks. The results of the new trained model variants with our introduced dataset outperforms the SOTA solutions. The effectiveness of the networks trained with our dataset is also demonstrated by the comparison to the histogram-based fully unsupervised solution [11]. The combination Resnet101+FCN+LF displays a performance increase of a \(\varDelta mIoU@0.5=0.12 (12\%)\). An overview of all segmentation results is summarized in Table 1. The validation loss history of the experiments using LF as well as HF are visualized in Fig. 6 (left and middle). The right plot in Fig. 6 demonstrates the validation dice coefficients over the full training time of 50 epochs. Experiment 2: Since digitized analog films varies in terms of resolution and film length significantly we focus in this experiment on the runtime performance of our pipeline by analyzing compressed architectures such as MobilenetV2 or Squeezenet compared to very deep and large networks like VGG16 and Resnet101. For this experiment the model variants ResNet101+FCN+LF, VGG16+DL+LF, SqueezeNet+FCN+LF and MobileNetV2+DeepLab+LF are selected since they are showing the best mIoU and Dice coefficient scores with respect to the segmentation head used (see Table 1). In order to extract the exact frame window and remove only overscan information precise segmentation of the holes is needed. Therefore, this experiment also explores the effect of using the additional introduced GMM module in the PostPM stage. The performance evaluation of this experiment demonstrates that MobileNetV2+Deeplab+LF and Squeezenet+FCN+LF outperforms the other model combinations with respect to CPU runtime during inference time. One frame needs a process time of about 100 ms (CPU) and both models are able to process about 10 frames per second. However, the Resnet101+FCN+GMM+LF combination reaches similar scores as well as the best segmentation results. Table 2 summarizes the results of the performance experiments. The second part of experiment 2 explores segmentation results by using the additional GMM module in the PostPM stage. Table 3 gives an overview of the reached results and demonstrates the effect of using the GMM module. The variant MobileNetV2+Deeplab+LF displays an increase of the mIoU score (@0.5) of 2.5%. Table 3 points out that the holes are segmented significantly better by using the additional GMM module. However, in terms of runtime performance as well as segmentation results the combination Resnet101+FCN+GMM+LF demonstrates the overall best results in this exploration and outperforms state-of-the-art solutions [11] by a \(\varDelta mIoU@0.5=12.76\%\) and a \(\varDelta Dice@0.5=8\%\). Figure 7 shows the influence of using this module on two example frames. The first image of both examples demonstrates two example SHs crops of the original image. In the next column the predicted masks without the GMM module are visualized and display that the holes are not segmented precisely. The final post-processed mask can be seen in the last column. Since GMM is based on histogram features the pipeline is able to generate an accurate mask which is used for calculating the final crop window in the last stage of our approach.
5 Conclusion
This investigation proposes a novel Overscan Detection pipeline in order to get fine-grained masks showing SHs in real-world historical footage. The approach is based on a supervised segmentation network for pre-masking the holes which is trained on a self-generated dataset including synthetically created SHs as well as an unsupervised GMM module which is based on histogram-features. We conclude that using low-level features of the backbone networks outperform the evaluated combinations by an averaged \(\varDelta mIoU@0.8=23\%\). Furthermore, this investigation visualizes the effectiveness of using a GMM in combination with a segmentation network and demonstrates a performance increase of \(\varDelta mIoU@0.5=1.72\%\) and \(\varDelta Dice@0.5=1\%\)). Moreover, the best model combination Resnet101+FCN+LF+GMM indicates a segmentation performance increase of \(\varDelta mIoU@0.5=12.7\%\) and \(\varDelta Dice@0.5=8\%\)). Finally, we propose a fundamental base for further research by providing full access to our sources as well as the generated dataset.
References
Bhargav, S., Van Noord, N., Kamps, J.: Deep learning as a tool for early cinema analysis. In: SUMAC 2019 - Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia heritAge Contents, co-located with MM 2019, pp. 61–68 (2019). https://doi.org/10.1145/3347317.3357240
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv e-prints arXiv:1706.05587, June 2017
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Ephemeral films project (2015). http://efilms.ushmm.org. Accessed 20 Apr 2020
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2014). https://doi.org/10.1007/s11263-014-0733-5
Flückiger, B., Pfluger, D., Trumpy, G., Aydin, T., Smolic, A.: Film material-scanner interaction. Technical report, University of Zurich, Zurich, February 2018. https://doi.org/10.5167/uzh-151114
Fossati, G., van den Oever, A.: Exposing the Film Apparatus. Amsterdam University Press (2016)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision 2017-October, pp. 2980–2988 (2017). https://doi.org/10.1109/ICCV.2017.322
Helm, D., Kampel, M.: Shot boundary detection for automatic video analysis of historical films. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 137–147. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_14
Helm, D., Kampel, M.: Video shot analysis for digital curation and preservation of historical films. In: Rizvic, S., Rodriguez Echavarria, K. (eds.) Eurographics Workshop on Graphics and Cultural Heritage. The Eurographics Association (2019). https://doi.org/10.2312/gch.20191344
Helm, D., Pointner, B., Kampel, M.: Frame border detection for digitized historical footage. In: Roth, P.M., Steinbauer, G., Fraundorfer, F., Brandstötter, M., Perko, R. (eds.) Proceedings of the Joint Austrian Computer Vision and Robotics Workshop 2020, pp. 114–115. Verlag der Technischen Universität Graz (2020). https://doi.org/10.3217/978-3-85125-752-6-26
Iizuka, S., Simo-Serra, E.: DeepRemaster: temporal source-reference attention networks for comprehensive video enhancement. ACM Trans. Graph. (Proc. SIGGRAPH Asia 2019) 38(6), 1–13 (2019)
Kalti, K., Mahjoub, M.: Image segmentation by gaussian mixture models and modified FCM algorithm. Int. Arab J. Inf. Technol. 11(1), 11–18 (2014)
Laradji, I.H., Vazquez, D., Schmidt, M.: Where are the Masks: Instance Segmentation with Image-level Supervision. arXiv preprint arXiv:1907.01430 (2019)
Liang, Z., Guan, Y.S., Rojas, J.: Visual-semantic graph attention network for human-object interaction detection. arXiv abs/2001.02302 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965
Pisters, P.: Filming for the Future: The Work of Louis van Gasteren. Amsterdam University Press (2017)
Poudel, R.P.K., Liwicki, S., Cipolla, R.: Fast-SCNN: Fast Semantic Segmentation Network. arXiv e-prints arXiv:1902.04502 (2019)
Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: Proceedings - International Conference on Image Processing, ICIP, pp. 2620–2624 (2018). https://doi.org/10.1109/ICIP.2018.8451474
Svanera, M., Savardi, M., Signoroni, A., Kovács, A.B., Benini, S.: Who is the director of this movie? Automatic style recognition based on shot features. CoRR abs/1807.0, pp. 1–13 (2018). http://arxiv.org/abs/1807.09560
Szeliski, R.: Segmentation. In: Szeliski, R. (ed.) Computer Vision, pp. 235–271. Springer, London (2011). https://doi.org/10.1007/978-1-84882-935-0_5
Yeh, R.A., Lim, T.Y., Chen, C., Schwing, A.G., Hasegawa-Johnson, M., Do, M.N.: Image restoration with deep generative models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6772–6776 (2018)
Zaharieva, M., Mitrović, D., Zeppelzauer, M., Breiteneder, C.: Film analysis of archived documentaries. IEEE Multimedia 18(2), 38–47 (2011). https://doi.org/10.1109/MMUL.2010.67
Zeppelzauer, M., Mitrović, D., Breiteneder, C.: Archive film material - a novel challenge for automated film analysis. Frames Cinema J. 1(1) (2012). https://www.ims.tuwien.ac.at/publications/tuw-216640
Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.S.: Dual graph convolutional network for semantic segmentation. In: BMVC (2019)
Zhao, L., Zheng, S., Yang, W., Wei, H., Huang, X.: An image thresholding approach based on gaussian mixture model. Pattern Anal. Appl. 22(1), 75–88 (2019). https://doi.org/10.1007/s10044-018-00769-w
Acknowledgment
Visual History of the Holocaust: Rethinking Curation in the Digital Age (https://www.vhh-project.eu - last visited: 2020/09/30). This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Grant Agreement 822670.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Helm, D., Kampel, M. (2020). Overscan Detection in Digitized Analog Films by Precise Sprocket Hole Segmentation. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2020. Lecture Notes in Computer Science(), vol 12509. Springer, Cham. https://doi.org/10.1007/978-3-030-64556-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-64556-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64555-7
Online ISBN: 978-3-030-64556-4
eBook Packages: Computer ScienceComputer Science (R0)