Localizing web videos using social images

doi:10.1016/j.ins.2014.08.017

Information Sciences

Volume 302, 1 May 2015, Pages 122-131

https://doi.org/10.1016/j.ins.2014.08.017 Get rights and content

Abstract

While inferring the geo-locations of web images has been widely studied, there is limited work engaging in geo-location inference of web videos due to inadequate labeled samples available for training. However, such a geographical localization functionality is of great importance to help existing video sharing websites provide location-aware services, such as location-based video browsing, video geo-tag recommendation, and location sensitive video search on mobile devices. In this paper, we address the problem of localizing web videos through transferring large-scale web images with geographic tags to web videos, where near-duplicate detection between images and video frames is conducted to link the visually relevant web images and videos. To perform our approach, we choose the trustworthy web images by evaluating the consistency between the visual features and associated metadata of the collected images, therefore eliminating the noisy images. In doing so, a novel transfer learning algorithm is proposed to align the landmark prototypes across both domains of images and video frames, leading to a reliable prediction of the geo-locations of web videos. A group of experiments are carried out on two datasets which collect Flickr images and YouTube videos crawled from the Web. The experimental results demonstrate the effectiveness of our video geo-location inference approach which outperforms several competing approaches using the traditional frame-level video geo-location inference.

Introduction

With the explosive growth of social networks, massive and heterogeneous multimedia data available in the Web provides a unique opportunity to bridge digital corpus and our physical world, which has become the mainstream of the ongoing multimedia research. To facilitate accurate recommendations by exploiting social media knowledge, web media tagging is also becoming an emerging important research direction [27], [26], [25], [28], [9].

Despite the widely studied semantic tagging scenario, annotating geographical locations of social media recently arises to be a major trend, which includes plenty of novel research subjects such as landmark retrieval and recognition, visual tour guide, geography-aware image search, mobile device localization, and aiding virtual reality.

Geographically tagging user-contributed images has been investigated in recent years [16], [13]. However, tagging web videos has been less explored so far, in which inferring geo-locations of web videos is especially vital to location-based video services. The key difficulty of video geo-location inference is twofold. On the device level, current mobile phones and digital cameras typically do not record the geo-information when shotting videos; on the other hand, there are limited geographic tags available for web videos, so it may be infeasible to train accurate classifiers or search engines in order to geographically tag the web videos.

The previous attempts of web video geo-location inference almost resorted to using metadata of social networks, such as titles, descriptions, and comments [17]. To improve the inference accuracy, extra evidence, e.g., visual and acoustic features, is incorporated [15]. However, limited work has successfully exploited visual content to assist the geo-location inference of web videos, mainly due to insufficient training examples that cause a challenge in effectively modeling the landmark appearances. This challenge originates from two aspects: (1) the low quality of web videos, which results in a limited amount of SIFT features being extracted and therefore compromises the accuracy of near-duplicate visual content matching; (2) the difficulty to diversify different scenes in the same video, which prevents assigning the correct geographic tags to the corresponding locations or physical scenes. We illustrate the challenge in Fig. 1.

In this paper, we propose to tackle this challenge from a novel transfer learning perspective, i.e., transferring an accurate and easy-to-learn video geo-tagging model from the image domain. Nowadays, there is an increasing amount of geo-tagged images available on the Web. Such massive image data prompts us to “transfer” the geo-tags from web images to web videos.

To do knowledge transfer across the image and video domains, we need to address two major issues: first, the tags of web images are usually noisy; second, the visual features of images and videos appear quite different due to their large variations. We address the first issue by proposing a web image trustworthy measurement to remove the “untrustworthy” web images. Afterwards, we perform a view-specific spectral clustering over the images of a given landmark to diversify different “views” of a single location. To build an effective video location inference model, a novel search-based transfer learning algorithm is proposed by constructing an AdaBoost [6] classifier in each view, and the outputs of multi-view AdaBoost classifiers are then combined into an overall landmark recognition model. In addition, we incorporate the temporal consistency to further improve the inference accuracy, which leverages the fact that temporally related video frames within the same video shot are more likely to be captured from the same view of a landmark or location.

To verify the proposed video geo-location inference model, we collect more than 50,000 geo-tagged images from Flickr¹ and 2000 video clips from YouTube,² respectively. Experimental comparisons on the two datasets show that our method achieves remarkable improvements over various competing methods. Besides, the proposed method can easily be integrated into the applications involving geo-location based web video browsing.

The rest of this paper is organized as follows: Section 2 briefly introduces the related work. The overall framework of the proposed method is presented in Section 3. Section 4 builds the effective landmark model using social images, which is further transferred to the video domain as described in Section 5. We conduct the experimental validations in Section 6. Finally, we conclude the paper and discuss the future work in Section 7.

Section snippets

Related work

Geography related image analysis has attracted intensive research attention from both academia and industry. One of the pioneering work comes from Kennedy et al. [16], which analyzed geographic tags (e.g., landmarks) of millions of Flickr images and built an appearance model for each landmark by incorporating both metadata and visual features of images. In mobile devices, Microsoft Photo2Search [13] implemented a geo-location sensitive image search system to facilitate visual information driven

Framework

The framework overview of the proposed method is displayed in Fig. 2. In this paper, our main efforts lie in the following two aspects: mining a visual model from Flickr images (the offline image processing component), and effective transfer learning from the Flickr image domain to the YouTube video domain (the transfer learning component).

To learn a reliable geo-location inference model using the Flickr images and their metadata, we first propose an image trustworthy measurement regarding the

Building the geo-model from Flickr images

Given a set of Flickr images associated with a specific geographic tag like Statue of Liberty or Time Square, in this section, we first address the problem of noisy image filtering by proposing an image trustworthy measurement. Then, the corresponding geo-model is built with a data-driven approach. We evaluate the image trustworthiness for every image in the dataset according to the attached metadata in Section 4.2. While in Section 4.3, the images are firstly clustered into landmark-level

Search-based transfer learning

Due to the gap between the cross-domain feature spaces, the model learned from Flickr images cannot be directly applied to the YouTube videos. Typically, the Flickr images and YouTube videos are with different resolutions and photographing quality. To solve this problem, we propose a search-based transfer learning algorithm in this section.

Experimental results

In this section, we will first introduce the used datasets and experimental settings. Then we will present in details a group of experiments conducted on the collected datasets. Finally, we will further demonstrate an application of our approach, which targets at placing web videos on a geographic map.

Conclusion and future work

This paper serves as the first work endeavoring to infer the geographical locations of web videos. The core technique of our approach is directly transferring the geo-tagging knowledge from web images to web videos, without training any video-domain geographical inference model that suffers from inadequate labeled data. There are several key components in our proposed novel geo-location inference framework, including a spectral clustering algorithm to diversify multiple views of web images for

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant 61422210 and Grant 61373076, the Fundamental Research Funds for the Central Universities under Grant 2013121026, and Xiamen University 985 Project.

References (33)

D.M. Chen et al.
City-scale landmark identification on mobile devices
D.J. Crandall et al.
Mapping the world’s photos
L. Duan, I.W.-H. Tsang, D. Xu, S.J. Maybank, Domain transfer svm for video concept detection, in: IEEE International...
L. Duan, D. Xu, I.W.-H. Tsang, J. Luo, Visual event recognition in videos by learning from web data, in: IEEE Computer...
R.-E. Fan et al.
Liblinear: a library for large linear classification
J. Mach. Learn. Res.
(2008)
Y. Freund et al.
A decision-theoretic generalization of on-line learning and an application to boosting
Y. Gao et al.
Camera constraint-free view-based 3-d object retrieval
IEEE Trans. Image Process.
(2012)
Y. Gao et al.
3-D object retrieval and recognition with hypergraph analysis
IEEE Trans. Image Process.
(2012)
Y. Gao et al.
Visual–textual joint relevance learning for tag-based social image search
IEEE Trans. Image Process.
(2013)
P. Gehler et al.
On feature combination for multiclass object classification

J. Hays et al.

im2gps: estimating geographic information from a single image

Rongrong Ji, L.-Y. D. J. C. H. Y. J. Y. Y. R., W. Gao, Location discriminative vocabulary coding for mobile landmark...

R. Ji, X. Xie, H. Yao, W.-Y. Ma, Mining city landmarks from blogs by graph modeling, in: ACM Multimedia, 2009, pp....

Y.-G. Jiang, C.-W. Ngo, S.-F. Chang, Semantic context transfer across heterogeneous sources for domain adaptive video...

P. Kelm et al.

Multi-modal, multi-resource methods for placing Flickr videos on the map

L. Kennedy, M. Naaman, S. Ahern, R. Nair, T. Rattenbury, How Flickr helps us make sense of the world: context and...

Cited by (7)

Link-aware semi-supervised hypergraph
2020, Information Sciences
Citation Excerpt :
Modeling of high-order data correlations is crucial in various learning tasks such as content-based image or video retrieval [1,3,8,36,43], analysis of Magnetic Resonance Imaging [4], human pose recover from an image [10] and image annotation [7,25].
Hypergraph learning has been widely applied to various learning tasks. To ensure learning accuracy, it is essential to construct an informative hypergraph structure that effectively modulates data correlations. However, existing hypergraph construction methods essentially resort to an unsupervised learning paradigm, which ignores supervisory information, such as pairwise links/non-links. In this article, to exploit the supervisory information, we propose a novel link-aware hypergraph learning model, which modulates high-order correlations of data samples in a semi-supervised manner. To construct a hypergraph, a coefficients matrix of the entire dataset is first calculated by solving a linear regression problem. Then, pairwise link constraints are exploited and propagated to the unconstrained samples, upon which the coefficients matrix is adjusted accordingly. Finally, the adjusted coefficients are used to generate a set of the hyperedges, as well as calculate the corresponding weights. We have validated the proposed link-aware semi-supervised hypergraph model on the problem of image clustering. Superior performance over the state-of-the-art methods demonstrates the effectiveness of the proposed hypergraph model.
A survey on Flickr multimedia research challenges
2016, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Their approach on tags differed from the others, since they did not filter any of the tags, but they rather used them for evidence, i.e., certain words may be associated with certain countries. Cao et al. (2015) worked on the problem of localizing web videos. They followed a near-duplication retrieval strategy between video frames and geo-tagged images.
Multimedia content sharing within social networks has become one of the most interesting and trending research fields over the last few years. This undoubted emerge of related research works is rather twofold, namely it includes both the analysis and management techniques of the content itself, as well as new ways for its accompanied meaningful interpretation and exploitation. In this paper, we review the recent advances in the above fields in the humanistic framework of the popular Flickr social network. In addition, the major research challenges in the area are demonstrated and discussed, which include current state-of-the-art approaches with respect to interesting humanistic data collection and interpretation research fields, such as multimedia information retrieval, (semi-) automatic tag manipulation, travel applications, semantic knowledge extraction, human activity tracking, as well as related benchmarking efforts. At the end of this survey, we also discuss the main challenges and propose a number of future research directions for interested fellow researchers to continue investigation in the field.
Spatio-temporal video re-localization by warp LSTM
2019, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Spatio-temporal video re-localization by warp LSTM
2019, arXiv
Cross domain recommender systems: A systematic literature review
2017, ACM Computing Surveys
Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval
2017, IEEE Transactions on Multimedia

View all citing articles on Scopus

View full text

Localizing web videos using social images

Abstract

Introduction

Section snippets

Related work

Framework

Building the geo-model from Flickr images

Search-based transfer learning

Experimental results

Conclusion and future work

Acknowledgment

City-scale landmark identification on mobile devices

Mapping the world’s photos

Liblinear: a library for large linear classification

J. Mach. Learn. Res.

A decision-theoretic generalization of on-line learning and an application to boosting

Camera constraint-free view-based 3-d object retrieval

IEEE Trans. Image Process.

3-D object retrieval and recognition with hypergraph analysis

IEEE Trans. Image Process.

Visual–textual joint relevance learning for tag-based social image search

IEEE Trans. Image Process.

On feature combination for multiclass object classification

im2gps: estimating geographic information from a single image

Multi-modal, multi-resource methods for placing Flickr videos on the map