Abstract
Image stitching is one key enabling component for recent immersive VR technology. The quality of the stitched images greatly affects VR experiences. Evaluation of stitched panoramic images using existing assessment tools is insufficient for two reasons. First, conventional image quality assessment (IQA) metrics are mostly full-referenced, while panorama reference is hard to obtain. Second, existing IQA metrics are not designed to detect and evaluate errors typical in stitched images. In this paper, we design an IQA metric for stitched images, where ghosting and shape inconsistency are the most common visual distortions. Specifically, we first locate the error with a fine-tuned convolutional neural network (CNN), and later refine the locations using an error-activation mapping generated from the network. Each located error is defined by both its size and distortion level. Extensive experiments and comparisons confirm the effectiveness of our metric, and indicate the network’s remarkable ability to detect error patterns.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recent rapid development of virtual reality (VR) technologies has led to a new 360-degree look-around visual experience. By displaying stereoscopic 360 scene in head-mounted rigs like Occulus Rift, users can perceive an immersed sensation of reality. Image stitching is typically used to construct a seamless 360 view from multiple captured viewpoint images, and thus the quality of the stitched scene is crucial in determining the level of immersive experience provided. The widely-adopted stitching process [5, 22, 23] can be broadly divided into the following steps: (i) register the capturing cameras and project each captured scene accordingly, (ii) merge overlapping spatial regions based on corresponding capturing camera parameters, and (iii) smooth/blend over the merged scene. Although errors may be introduced at each step, noticeable distortions are usually introduced at the misaligned overlapping regions around the scene objects, which we call shape breakage. The following blending step is then employed to alleviate the breakage by imposing a consistency constraint over the entire scene [7, 16]. The misaligned scene objects, after being blended, are exposed as ghosting or object fragment [15, 19]. Because of the uniqueness of two most common error types during image stitching—shape breakage (including object fragment) and ghosting, stitched image quality assessment is fundamentally different from conventional IQA for images distorted from compression artifacts or network packet losses [17, 18, 24]. Specifically, conventional IQA methods focus on various noise type’s global influence upon visual comfort or information integrity, while in SIQA aims at local distortions that damage object or scene integrity. Further, the most common distortions in conventional IQA come form compression losses, which is not applied to SIQA tasks.
Comparison of example dataset used for conventional IQA and SIQA experimentations. It is easily observed that conventional IQA samples are evenly distorted over the image, while the SIQA distortions come in local patches. Patch b and c are patches with ghosting error type, a and d are undistorted patches with high image quality.
Hence, the process of assessing stitched image quality can be understood as searching for stitched errors over the composed scene—a process of locating and assessing particular error types rather than overall assessment of every local spatial region. Thus, it is necessary to study SIQA as a new problem apart from the conventional IQA. Figure 1 illustrates the comparison between some typical samples used for IQA and SIQA. It is clear that most parts of the stitched image have approximately reference quality as in IQA tasks, and the noisy IQA samples do not have prominent local shape distortions as in SIQA tasks; the two groups of samples hardly share a comparable stand.
In this paper, we propose to assess the stitched image quality based on an error-localization-quantification algorithm. First, we detect the potential error regions by searching through the entire stitched image in local patches of unified size, each patch is to be decided as “intact” or “distorted”. The decision is based on an intelligent agent trained via a convolutional neural network (CNN) [21]. Then the detected regions are refined to finer regions according to the extent of error. This process is conducted within each potential region in finer pixel patches, which are later retained or removed from the coarse region according to the contributions made towards the region being tagged as distorted. Finally, after obtaining refined regions that well bound the distortions, a quantized metric is formulated on refined patches assessing both the error range and extent.
Contributions: Our contributions are twofold. First, for the SIQA task we propose a new algorithm. The proposed error-localization-quantification metric is simple, straightforward and requires no reference images. Further, our method outputs the explicit locations of error, this is far more meaningful for stitching algorithm optimization than just an evaluation score. Second, the successful localization of multiple error types in our pipeline demonstrates that the CNN is enabled to have remarkable ability to detect spatial patterns, which is beyond scene object detection. The observation implies the possibility for generic classification, localization and concept discovery.
The paper is organized as follows. Section 2 discusses previous related works in SIQA. Section 3 introduces our proposed method. Experimentation is presented in Sect. 4, and Sect. 5 draws the conclusion.
2 Related Work
This paper has two lines of work related to the proposed the method: previous SIQA methods, and deep features for discriminative localization.
Previous SIQA methods: In contrast with the emergence of panoramic techniques, the works to evaluate the stitched panoramic image quality seem insufficient and slow in development. Here, we introduce the previous SIQA methods. Much previous SIQA metrics pay more attention to photometric error assessment [12, 13, 20] rather than errors caused by misalignment. In [12] and [20], misaligned error types are omitted and the metrics focus on color correction and intensity consistency, which are low-level representation of overall distortion level. [13] try to quantify the error by computing the structure similarity index (SSIM) of high-frequency information of the stitched and unstitched image difference in the overlapping region. However, since unstitched images used for test are directly cropped from the reference, the effectiveness of the method is not validated. In [10], the work pays more attention to assessing video consistency among subsequent frames and only adopted a luminance-based metric around the seam. In [14], the gradient of the intensity difference between the stitched and reference image is adopted to assess the geometric error, however, the experiments are conducted on mere 6 stitched scenes and references, which is in sufficient for a designed metric. We observe that the design of most previous SIQA metrics require full reference [6], which are difficult to obtain in panorama-related applications. Moreover, there seems hardly any SIQA method directly indicates where the distortion is, thus limit the metric’s guidance for stitching algorithms.
Differently in our work, the assessment is handled under the error detection algorithm, which directly indicates the location of error and naturally requires no reference, the method is described in the next section.
Deep feature-based discriminative localization: The implementation of Convolutional Neural Networks (CNNs) has led to impressive performance on a variety of visual recognition tasks [8, 9, 26]. Much recent work show its remarkable ability to localize objects, and the potential of being transferred to other generic classification, localization and concept discovery [2, 25]. Most of the related works are based on the weakly-supervised object localization. In [3], the regions that causes the maximal activations are masked out with a self-taught object localization technique. In [11] a method is proposed for transferring mid-level image representations, and achieve object localization by evaluating the CNNs output on patches with overlap. [25] uses the class activation map to refer to the weighted activation maps generated for each image. In [2], a method for performing hierarchical object detection is proposed under the guidance of a deep reinforcement learning agent.
While global average pooling is not a novel technique that we propose here, the observation that it can be applied for nonphysical spatial patter – error localization and the implementation to solve image quality related problems, to the best of our knowledge, is unique to our work. We believe the effectiveness and simplicity of the proposed method will make it generic for other IQA tasks.
The coarse error localization pipeline. A ResNet model is truncated and followed by a flatten layer and softmax layer of 2 classes. The trained classifier is applied to a top-down search which categorize local patches from distorted to intact. The detected patches are labeled in red-shadowed bounding boxes.
3 Proposed Method
The proposed method is to assess the quality of any stitched image and locate distorted regions. We construct it with three steps: Coarse error localization, error-activation-guided refinement, and error quantification.
3.1 Coarse Error Localization
There are two common error types in a stitched scene – ghosting and shape breakage. We employ a ResNet model, a state-of-the-art architecture, to obtain a two-class classifier between “intact” and “distorted”. The fine-tuned model is later utilized for error localization refinement. Even though it is possible for a single patch to hold two types of error at the same time, the detection of each error type is done separately for later assessment. As shown in Fig. 2, we feed the model with labeled bounding boxes containing error as “distorted” examples, and the perfectly aligned areas as “intact”. With ResNet, we achieved a remarkable classification accuracy. With the classifier, we coarsely localize the error through the stitched image.
To protect the potential continuous distortion regions, while preserve the fineness of search, we make a trade-off between the window size and sliding step size. In a complex scene constructed by multiple objects, the object volume has prominent effect on visual saliency [1, 4]. We assume this also implies to texture patterns like shape breakage or ghosting, thus the integrity of distorted region must be preserved. To this end, we merge the adjacent patches with the same tag, as illustrated in Fig. 2. The merged patches form the coarse error localization.
3.2 Error-Activation Guided Refinement
After obtaining the coarse error, a refined localization that more precisely describe the range of error is required for accurate error descriptions. We find that the class activation mapping considerably discriminative in describing image regions with errors, as a result, we trim the coarse regions with error-activation-guided refinement. The network we fine-tuned for coarse error detection – ResNet architecture largely consists of convolutional layers, similar to [25], we project back the weights of the output layer on to the convolutional feature maps, thus obtaining the importance of each pixel batch that activates a region to be categorized as containing error or no error, the process is called error-activation-mapping.
The error-activation-mapping is obtained by computing the weighted sum of feature maps of the last convolutional layer. For a stitched image with error type T, the error activation mapping E at spatial location (x, y) is computed by Eq. 1:
where \(f_{i}(x,y)\) is the activation of unit i in the last convolutional layer at (x, y), and \(\omega ^{T}_{i}\) indicates the importance of the global average pooling result for error type T. The score of an image being diagnosed with error type T can be presented in Eq. 2:
Hence the error-activation mapping \(E_{T}(x,y)\) directly represents the importance of the activation at (x, y) leading to the image being diagnosed with error type T. The obtained error-activation mapping will serve as a guidance towards error localization refinement.
For each coarsely localized region, we apply the error-activation mapping as a filter. The threshold is adaptive according to how rigid the filter is, here we adopt the global average. Despite its simplicity, the refinement process integrates the global activation information into the locally categorized patches, which naturally protect the overall integrity of distorted regions. The entire refinement process is as demonstrated in Fig. 3.
3.3 Error Quantification
To quantify the error and form a unified metric, we think it necessary to combine a twofold evaluation, the error range and distortion level. The range is easily represented by the area of the refined location, while the distortion level is represented by the error-activation-mapping weights.
The range index \(M_r^j\) of a refined error location j is formulated as follows:
where \(A^j\) is the area of the refined error location j and A indicates the total area of the image. The distortion level \(M_d^j\) of a refined error location j is represented as the sum of error-activation mapping within:
The quantification of error for location j is represented as:
the exponents \(\alpha _j\) and \(\beta _j\) are used to adjust the relative importance of range and distortion level. Finally, the quantification of error for an entire stitched image M is formulated as Eq. 6:
4 Experimentation
Experiment data: All experiments are conducted on our stitched image quality assessment dataset benchmark called SIQA dataset, which is based on synthetic virtual scenes, since we try to evaluate the proposed metric for various stitching algorithms under ideal photometric conditions. The images are obtained by establishing virtual scenes with the powerful 3D model tool—Unreal Engine. As illustrated in Fig. 4. A synthesized 12-head panoramic camera is placed at multiple locations of each scene, covering \(360^\circ \) surrounding view, and each camera has an FOV (field of view) of \(90^\circ \). Exactly one image is taken for each of the 12 cameras at one location simultaneously. SIQA dataset utilized twelve different 3D scenes varying from wild landscapes to structured scenes, stitched images are obtained using a popular off-the-shelf stitching tool Nuke, altogether 408 stitched scenes, the original images are in high-definition with \(3k-by-2k\) in size.
We label the two error types manually in each scene, a scene might contain multiple regions with a single or both error types, or there might be no distortion at all. For ghosting 297 bounding boxes are labeled, and for shape breakage 220 bounding boxes are labeled.
Coarse error localization: Our fine-tuned model to categorize “intact” and “distorted” is the ResNet 50 architecture using Tensor-Flow backed. We truncate the layer after bn5c branch2c and follow a flatten layer and a softmax layer of 2 classes. We choose \(epoch=50\) and \(batchsize=16\) as the parameters, the model is fine-tuned separately for the two error types, and the classifier achieves remarkable accuracy of \(95.5\%\) for shape breakage and \(96.5\%\) for ghosting, as illustrated in Table 1.
The test result of the fine-tuned classifier is illustrated in Table 1. With the remarkable ability to classify distortion and undistorted regions, we impose a top-down search for distorted patches through the entire image. Considering the object size with respect to the image size, we choose \(400 \times 400\) for ghosting and two window shapes of \(200 \times 800\) and \(800 \times 200\) for shape breakage. The differentiated window shapes are chosen according to our analysis. To tell whether a region is ghosted, one must refer to the nearest object to decide where the duplicated artifact comes from, which mostly come in square patches. However, to see if there exists shape breakage, one must refer to the adjacent edge or silhouette to examine whether the shape integrity is damaged, in this case we design both vertical and horizontal window shapes to allow breakage detection. As mentioned earlier, we choose a small sliding step size in order to protect region continuity, here we implement \(stepsize = 100\) for both error types. Then we merge the adjacent patches with the same type of error, thus obtaining the coarse localization of error for the entire scene. As Fig. 5 shows, the integrity of continuous distorted region is basically preserved.
Error-activation guided refinement: By projecting back the weights of the output layer on to the convolutional feature maps, we obtain the error-activation mapping for the image, as demonstrated in Fig. 6. We can see that the discriminative regions of the images for each error type is high-lighted. We also observe that the discriminative regions for different error types are different for a given image, this suggest that the error-activation guidance works as expected.
We apply the error-activation guidance as a filter, the input is the coarsely localized regions. The results are quite impressive as Fig. 7 shows. The regions containing errors are prominently refined, which explicitly describe the distorted regions. Based on the properly refined regions, the following quantification is enabled to be reliable defining each error type.
Error quantification: We compute the quantified error for each location of error, and then for the entire image, according to Eqs. 5 and 6 which we introduced in the last section. The relative importance parameters we choose are \(\alpha =1\) and \(\beta =10\). To illustrate the objectiveness of the metric, we make extensive comparisons among the error patches. As demonstrated in Fig. 7, we compare local errors of similar size with differentiated level of distortion, and those with similar score. Location a and g are both from structured scenes with similar size, however, the shape of television in g is much more polluted than a, thus obtaining a relatively higher score of error. Similarly, location f has extensive error range but relatively slight distortion, thus the quantified score is reduced by the distortion level. We also compare the results of various stitched scenes. An interesting observation is that in natural scenes with less structured context, the metric is still capable of locating distortions that are much less noticeable for human vision. The phenomenon reveals the error-localization ability of our method.
5 Conclusion
In this paper we propose an error-activation-guided metric for stitched panoramic image quality assessment, which requires complete no reference. Our method not only provides a proper evaluation of the stitched image quality, but also directly indicates the explicit locations of error. The method is constructed by three main powerful steps: coarse error localization, error-activation-guided refinement and error quantification. Results reveal the error localization ability of the proposed method, and the extensive comparisons also suggest the effectiveness of the our metric and its ability to distinguish minor distortion levels in detail.
References
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1597–1604. IEEE (2009)
Bellver, M., Giró-i Nieto, X., Marqués, F., Torres, J.: Hierarchical object detection with deep reinforcement learning. arXiv preprint arXiv:1611.03718 (2016)
Bergamo, A., Bazzani, L., Anguelov, D., Torresani, L.: Self-taught object localization with deep networks. arXiv preprint arXiv:1409.3964 (2014)
Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: a benchmark. IEEE Trans. Image Process. 24(12), 5706–5722 (2015)
Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 74(1), 59–73 (2007)
Chen, M.J., Su, C.C., Kwon, D.K., Cormack, L.K., Bovik, A.C.: Full-reference quality assessment of stereopairs accounting for rivalry. Sig. Process. Image Commun. 28(9), 1143–1155 (2013)
Dessein, A., Smith, W.A., Wilson, R.C., Hancock, E.R.: Seamless texture stitching on a 3D mesh by Poisson blending in patches. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 2031–2035. IEEE (2014)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Leorin, S., Lucchese, L., Cutler, R.G.: Quality assessment of panorama video for videoconferencing applications. In: 2005 IEEE 7th Workshop on Multimedia Signal Processing, pp. 1–4. IEEE (2005)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Paalanen, P., Kämäräinen, J.-K., Kälviäinen, H.: Image based quantitative mosaic evaluation with artificial video. In: Salberg, A.-B., Hardeberg, J.Y., Jenssen, R. (eds.) SCIA 2009. LNCS, vol. 5575, pp. 470–479. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02230-2_48
Qureshi, H., Khan, M., Hafiz, R., Cho, Y., Cha, J.: Quantitative quality assessment of stitched panoramic images. IET Image Process. 6(9), 1348–1358 (2012)
Solh, M., AlRegib, G.: MIQM: a novel multi-view images quality measure. In: 2009 International Workshop on Quality of Multimedia Experience, QoMEx 2009, pp. 186–191. IEEE (2009)
Szeliski, R.: Image alignment and stitching: a tutorial. Found. Trends® Comput. Graph. Vis. 2(1), 1–104 (2006)
Szeliski, R., Uyttendaele, M., Steedly, D.: Fast Poisson blending using multi-splines. In: 2011 IEEE International Conference on Computational Photography (ICCP), pp. 1–8. IEEE (2011)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2004, vol. 2, pp. 1398–1402. IEEE (2003)
Xiong, Y., Pulli, K.: Fast panorama stitching for high-quality panoramic images on mobile phones. IEEE Trans. Consum. Electron. 56(2), 298–306 (2010)
Xu, W., Mulligan, J.: Performance evaluation of color correction approaches for automatic multi-view image and video stitching. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 263–270. IEEE (2010)
Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807 (2015)
Zaragoza, J., Chin, T.J., Brown, M.S., Suter, D.: As-projective-as-possible image stitching with moving DLT. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2339–2346 (2013)
Zhang, F., Liu, F.: Parallax-tolerant image stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3262–3269 (2014)
Zhang, L., Shen, Y., Li, H.: VSI: a visual saliency-induced index for perceptual image quality assessment. IEEE Trans. Image Process. 23(10), 4270–4281 (2014)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No. 61571071), Wenfeng innovation and start-up project of Chongqing University of Posts and Telecommunications (No. WF201404).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, L., Liu, J., Gao, C. (2017). An Error-Activation-Guided Blind Metric for Stitched Panoramic Image Quality Assessment. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_22
Download citation
DOI: https://doi.org/10.1007/978-981-10-7302-1_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)