Rebuilding Visual Vocabulary via Spatial-temporal Context Similarity for Video Retrieval

Wang, Lei; Elyan, Eyad; Song, Dawei

doi:10.1007/978-3-319-04114-8_7

Lei Wang²²,
Eyad Elyan²² &
Dawei Song^23,24

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8325))

Included in the following conference series:

International Conference on Multimedia Modeling

3385 Accesses
2 Citations

Abstract

The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC_WEB_VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cao, L., Tian, Y., Liu, Z., Yao, B., Zhang, Z., Huang, T.S.: Action detection using multiple spatial-temporal interest point features. In: ICME, pp. 340–345 (2010)
Google Scholar
Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall ii: Query expansion revisited. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 889–896 (June 2011)
Google Scholar
Hsu, W.H., Chang, S.-F.: Visual cue cluster construction via information bottleneck principle and kernel density estimation. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 82–91. Springer, Heidelberg (2005)
Chapter Google Scholar
Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. Int. J. Comput. Vision 87, 316–336 (2010)
Article Google Scholar
Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012)
Google Scholar
Jiang, Y.-G., Ngo, C.-W.: Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval. Comput. Vis. Image Underst. 113, 405–414 (2009)
Article Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2169–2178. IEEE Computer Society, Washington, DC (2006)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Article Google Scholar
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. Int. J. Comput. Vision 65(1-2), 43–72 (2005)
Article Google Scholar
Peng, Y., Ngo, C.-W.: EMD-based video clip retrieval by many-to-many matching. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 71–81. Springer, Heidelberg (2005)
Chapter Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007) (June 2007)
Google Scholar
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (MIR 2006), pp. 321–330. ACM Press, New York (2006)
Google Scholar
Wang, H., Yuan, J., Tan, Y.-P.: Combining feature context and spatial context for image pattern discovery. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining (ICDM 2011), pp. 764–773. IEEE Computer Society, Washington, DC (2011)
Google Scholar
Wang, L., Song, D., Elyan, E.: Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), pp. 1303–1312. ACM, New York (2012)
Google Scholar
Wu, X., Ngo, C.-W., Hauptmann, A.G., Tan, H.-K.: Real-time near-duplicate elimination for web video search with content and context. Trans. Multi. 11, 196–207 (2009)
Article Google Scholar
Yuan, J., Wu, Y.: Context-aware clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008)
Google Scholar
Zhao, W.-L., Wu, X., Ngo, C.-W.: On the Annotation of Web Videos by Efficient Near-Duplicate Search. IEEE Transactions on Multimedia 12(5), 448–461 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Science and Digital Media, Robert Gordon University, Aberdeen, AB10 7GJ, UK
Lei Wang & Eyad Elyan
Tianjin Key Laboratory of Cognitive Computing and Applications, School of Computer Science and Technology, Tianjin University, Tianjin, China
Dawei Song
Department of Computing and Communications, The Open University, Milton Keynes, UK
Dawei Song

Authors

Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Eyad Elyan
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, Dublin City University, Dublin 9, Ireland
Cathal Gurrin
Fakultät IV für Elektrotechnik und Informatik, Technische Universität Berlin / DAI-Labor, 10587, Berlin, Germany
Frank Hopfgartner
Department of Information and Computing Sciences, Universiteit Utrecht, 3584 CC, Utrecht, The Netherlands
Wolfgang Hurst
UiT The Arctic University of Norway, 9019, Tromsø, Norway
Håvard Johansen
Singapore University of Technology and Design, Singapore
Hyowon Lee
School of Electrical Engineering, Dublin City University, Ireland
Noel O’Connor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Elyan, E., Song, D. (2014). Rebuilding Visual Vocabulary via Spatial-temporal Context Similarity for Video Retrieval. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds) MultiMedia Modeling. MMM 2014. Lecture Notes in Computer Science, vol 8325. Springer, Cham. https://doi.org/10.1007/978-3-319-04114-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-04114-8_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04113-1
Online ISBN: 978-3-319-04114-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics