Localizing web videos using social images
Introduction
With the explosive growth of social networks, massive and heterogeneous multimedia data available in the Web provides a unique opportunity to bridge digital corpus and our physical world, which has become the mainstream of the ongoing multimedia research. To facilitate accurate recommendations by exploiting social media knowledge, web media tagging is also becoming an emerging important research direction [27], [26], [25], [28], [9].
Despite the widely studied semantic tagging scenario, annotating geographical locations of social media recently arises to be a major trend, which includes plenty of novel research subjects such as landmark retrieval and recognition, visual tour guide, geography-aware image search, mobile device localization, and aiding virtual reality.
Geographically tagging user-contributed images has been investigated in recent years [16], [13]. However, tagging web videos has been less explored so far, in which inferring geo-locations of web videos is especially vital to location-based video services. The key difficulty of video geo-location inference is twofold. On the device level, current mobile phones and digital cameras typically do not record the geo-information when shotting videos; on the other hand, there are limited geographic tags available for web videos, so it may be infeasible to train accurate classifiers or search engines in order to geographically tag the web videos.
The previous attempts of web video geo-location inference almost resorted to using metadata of social networks, such as titles, descriptions, and comments [17]. To improve the inference accuracy, extra evidence, e.g., visual and acoustic features, is incorporated [15]. However, limited work has successfully exploited visual content to assist the geo-location inference of web videos, mainly due to insufficient training examples that cause a challenge in effectively modeling the landmark appearances. This challenge originates from two aspects: (1) the low quality of web videos, which results in a limited amount of SIFT features being extracted and therefore compromises the accuracy of near-duplicate visual content matching; (2) the difficulty to diversify different scenes in the same video, which prevents assigning the correct geographic tags to the corresponding locations or physical scenes. We illustrate the challenge in Fig. 1.
In this paper, we propose to tackle this challenge from a novel transfer learning perspective, i.e., transferring an accurate and easy-to-learn video geo-tagging model from the image domain. Nowadays, there is an increasing amount of geo-tagged images available on the Web. Such massive image data prompts us to “transfer” the geo-tags from web images to web videos.
To do knowledge transfer across the image and video domains, we need to address two major issues: first, the tags of web images are usually noisy; second, the visual features of images and videos appear quite different due to their large variations. We address the first issue by proposing a web image trustworthy measurement to remove the “untrustworthy” web images. Afterwards, we perform a view-specific spectral clustering over the images of a given landmark to diversify different “views” of a single location. To build an effective video location inference model, a novel search-based transfer learning algorithm is proposed by constructing an AdaBoost [6] classifier in each view, and the outputs of multi-view AdaBoost classifiers are then combined into an overall landmark recognition model. In addition, we incorporate the temporal consistency to further improve the inference accuracy, which leverages the fact that temporally related video frames within the same video shot are more likely to be captured from the same view of a landmark or location.
To verify the proposed video geo-location inference model, we collect more than 50,000 geo-tagged images from Flickr1 and 2000 video clips from YouTube,2 respectively. Experimental comparisons on the two datasets show that our method achieves remarkable improvements over various competing methods. Besides, the proposed method can easily be integrated into the applications involving geo-location based web video browsing.
The rest of this paper is organized as follows: Section 2 briefly introduces the related work. The overall framework of the proposed method is presented in Section 3. Section 4 builds the effective landmark model using social images, which is further transferred to the video domain as described in Section 5. We conduct the experimental validations in Section 6. Finally, we conclude the paper and discuss the future work in Section 7.
Section snippets
Related work
Geography related image analysis has attracted intensive research attention from both academia and industry. One of the pioneering work comes from Kennedy et al. [16], which analyzed geographic tags (e.g., landmarks) of millions of Flickr images and built an appearance model for each landmark by incorporating both metadata and visual features of images. In mobile devices, Microsoft Photo2Search [13] implemented a geo-location sensitive image search system to facilitate visual information driven
Framework
The framework overview of the proposed method is displayed in Fig. 2. In this paper, our main efforts lie in the following two aspects: mining a visual model from Flickr images (the offline image processing component), and effective transfer learning from the Flickr image domain to the YouTube video domain (the transfer learning component).
To learn a reliable geo-location inference model using the Flickr images and their metadata, we first propose an image trustworthy measurement regarding the
Building the geo-model from Flickr images
Given a set of Flickr images associated with a specific geographic tag like Statue of Liberty or Time Square, in this section, we first address the problem of noisy image filtering by proposing an image trustworthy measurement. Then, the corresponding geo-model is built with a data-driven approach. We evaluate the image trustworthiness for every image in the dataset according to the attached metadata in Section 4.2. While in Section 4.3, the images are firstly clustered into landmark-level
Search-based transfer learning
Due to the gap between the cross-domain feature spaces, the model learned from Flickr images cannot be directly applied to the YouTube videos. Typically, the Flickr images and YouTube videos are with different resolutions and photographing quality. To solve this problem, we propose a search-based transfer learning algorithm in this section.
Experimental results
In this section, we will first introduce the used datasets and experimental settings. Then we will present in details a group of experiments conducted on the collected datasets. Finally, we will further demonstrate an application of our approach, which targets at placing web videos on a geographic map.
Conclusion and future work
This paper serves as the first work endeavoring to infer the geographical locations of web videos. The core technique of our approach is directly transferring the geo-tagging knowledge from web images to web videos, without training any video-domain geographical inference model that suffers from inadequate labeled data. There are several key components in our proposed novel geo-location inference framework, including a spectral clustering algorithm to diversify multiple views of web images for
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant 61422210 and Grant 61373076, the Fundamental Research Funds for the Central Universities under Grant 2013121026, and Xiamen University 985 Project.
References (33)
- et al.
City-scale landmark identification on mobile devices
- et al.
Mapping the world’s photos
- L. Duan, I.W.-H. Tsang, D. Xu, S.J. Maybank, Domain transfer svm for video concept detection, in: IEEE International...
- L. Duan, D. Xu, I.W.-H. Tsang, J. Luo, Visual event recognition in videos by learning from web data, in: IEEE Computer...
- et al.
Liblinear: a library for large linear classification
J. Mach. Learn. Res.
(2008) - et al.
A decision-theoretic generalization of on-line learning and an application to boosting
- et al.
Camera constraint-free view-based 3-d object retrieval
IEEE Trans. Image Process.
(2012) - et al.
3-D object retrieval and recognition with hypergraph analysis
IEEE Trans. Image Process.
(2012) - et al.
Visual–textual joint relevance learning for tag-based social image search
IEEE Trans. Image Process.
(2013) - et al.
On feature combination for multiclass object classification
im2gps: estimating geographic information from a single image
Multi-modal, multi-resource methods for placing Flickr videos on the map
Cited by (7)
Link-aware semi-supervised hypergraph
2020, Information SciencesCitation Excerpt :Modeling of high-order data correlations is crucial in various learning tasks such as content-based image or video retrieval [1,3,8,36,43], analysis of Magnetic Resonance Imaging [4], human pose recover from an image [10] and image annotation [7,25].
A survey on Flickr multimedia research challenges
2016, Engineering Applications of Artificial IntelligenceCitation Excerpt :Their approach on tags differed from the others, since they did not filter any of the tags, but they rather used them for evidence, i.e., certain words may be associated with certain countries. Cao et al. (2015) worked on the problem of localizing web videos. They followed a near-duplication retrieval strategy between video frames and geo-tagged images.
Spatio-temporal video re-localization by warp LSTM
2019, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern RecognitionCross domain recommender systems: A systematic literature review
2017, ACM Computing SurveysStochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval
2017, IEEE Transactions on Multimedia