Skip to main content
Log in

Discovering joint audio–visual codewords for video event detection

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bi-modal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bi-modal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Normally event detection is performed on video level, i.e., to detect whether a video contains an event of interest. Therefore, we represent each video by a feature vector.

References

  1. http://en.wikipedia.org/wiki/P-value (2011)

  2. http://www.nist.gov/itl/iad/mig/med10.cfm (2010)

  3. http://www.nist.gov/itl/iad/mig/med11.cfm (2011)

  4. Bao, L., et al.: Informedia @ TRECVID 2011. In: TRECVID Workshop (2011)

  5. Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)

    Article  Google Scholar 

  6. Boureau, Y.-L., Ponce, J., Lecun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning (2010)

  7. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004)

  8. Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia (2007)

  9. Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM Conference on Knowledge Discovery and Data Mining (2001)

  10. Gehler, P., Nowozin, S.: On feature combination for multiclass object detection. In: IEEE International Conference on Computer Vision (2009)

  11. Jhuo, I.H., Lee, D.-T.: Boosting-based Multiple Kernel Learning forImage Re-ranking. In: ACM International Conference on Multimedia (2010)

  12. Jiang, W., Cotton, C., Chang, S.-F., Ellis, D., Loui, A.: Short-term audio-visual atoms for generic video concept classification. In: ACM International Conference on Multimedia (2009)

  13. Jiang, W., Loui, A.: Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. In: ACM International Conference on Multimedia (2011)

  14. Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., Loui, A.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ACM International Conference on Multimedia Retrieval (2011)

  15. Jiang, Y.-G., et al.: Columbia-ucf trecvid2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)

  16. Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., Shah, M.: High-level event recognition in unconstrained videos. In: International Journal of Multimedia Information Retrieval, Vol. 2(2), pp. 73–101 (2012)

  17. Kembhavi, A., Siddiquie, B., Miezianko, R., McCloskey, S., Davis, L.S.: Incremental multiple kernel learning for object recognition. In: IEEE International Conference on Computer Vision (2009)

  18. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  19. Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)

    Article  Google Scholar 

  20. Laptev, I., Marszlek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. IEEE Conf. Comput. Vision Pattern Recognit. 60(1), 63–86 (2008)

    Google Scholar 

  21. Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)

  22. Lutkepohl, H.: Handbook of Matrices. Wiley, Chichester (1997)

    Google Scholar 

  23. Manning, C., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  24. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vision 60(1), 63–86 (2004)

    Article  Google Scholar 

  25. Natarajan, P., et al.: BBN VISER TRECVID 2011 multimedia event detection system. In: NIST TRECVID Workshop (2011)

  26. Pan, S., Nu, X., Sun, J.T., Yang, Q., Chen, Z.: Co-clustering documents and words using bipartite spectral graph partitioning. In: International World Wide Web Conference (2010)

  27. Pols, L.: Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Free University, Amsterdam (1966)

    Google Scholar 

  28. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in visual and audio-visual speech processing (2004)

  29. Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2512 (2009)

    MathSciNet  Google Scholar 

  30. Vedaldi, A., Gulshan, V., Varma, M., Zisserman. A.: Multiple kernels for object detection. In: IEEE International Conference on Computer Vision (2009)

  31. Wang, J.-C., Yang, Y.-H., Jhuo, I.-H., Lin, Y.-Y., Wang, H.-M.: The acousticvisual emotion Guassians model for automatic generation of music video. In: ACM International Conference on Multimedia (2012)

  32. Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)

  33. Ye, G., Jhuo, I.-H., Liu, D., Jiang, Y.G., Lee, D.-T., Chang, S.-F.: Joint audio-visual bi-modal codewords for video event detection. In: ACM International Conference on Multimedia Retrieval (2012)

Download references

Acknowledgments

This work is supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20071. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. T. Lee.

Additional information

The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. Y.-G. Jiang is supported by two grants from NSF China (#61201387 and #61228205), two grants from STCSM (#12XD1400900 and #12511501602), and a New Teachers Fund for Doctoral Stations, Ministry of Education (#20120071120026), China.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jhuo, IH., Ye, G., Gao, S. et al. Discovering joint audio–visual codewords for video event detection. Machine Vision and Applications 25, 33–47 (2014). https://doi.org/10.1007/s00138-013-0567-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-013-0567-0

Keywords

Navigation