skip to main content
10.1145/2911996.2912048acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection

Published: 06 June 2016 Publication History

Abstract

Multimedia event detection (MED) is the task of detecting given events (e.g. parade, birthday party) in a large collection of video clips. While the most useful information comes from visual features and speech recognition, a lot can also be inferred from the non-speech audio content, either alone or in conjunction with visual and speech cues. This paper studies MED with non-speech audio information only. MED is usually performed in two stages. The first stage generates a representation for each clip in the form of either a single vector or a sequence of vectors, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether each target event occurs in each clip. Common classifiers used for the second stage include support vector machines (SVMs), feed-forward deep neural networks (DNNs), and recurrent neural networks (RNNs).
In this paper, we propose to classify clips for events using "recurrent SVMs". These models combine the kernel mapping and the large-margin optimization criterion of SVMs, and the ability to process sequences of variable lengths of RNNs. Reinforced with data augmentation, recurrent SVMs have achieved higher mean average precision (MAP) on the TRECVID 2011 MED task than both SVMs and RNNs.

References

[1]
W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using GMM supervectors for speaker verification. Signal Processing Letters, IEEE, 13(5):308--311, 2006.
[2]
Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze. Event-based video retrieval using audio. In Proceedings of INTERSPEECH, page 2085, 2012.
[3]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4):788--798, 2011.
[4]
X. Zhuang, S. Tsakalidis, S. Wu, P. Natarajan, R. Prasad, and P. Natarajan. Compact audio representation for event detection in consumer media. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[5]
B. Elizalde, H. Lei, and G. Friedland. An i-vector representation of acoustic environments for audio-based video event detection on user generated content. In Multimedia (ISM), International Symposium on, pages 114--117. IEEE, 2013.
[6]
S. Pancoast and M. Akbacak. Bag-of-audio-words approach for multimedia event classification. In Interspeech, pages 2105--2108, 2012.
[7]
B. Byun, I. Kim, S. M. Siniscalchi, and C.-H. Lee. Consumer-level multimedia event detection through unsupervised audio signal modeling. In INTERSPEECH, pages 2081--2084, 2012.
[8]
S. Chaudhuri, M. Harvilla, and B. Raj. Unsupervised learning of acoustic unit descriptors for audio content representation and classification. In Interspeech, pages 2265--2268, 2011.
[9]
S. Burger, Q. Jin, P. F. Schulam, and F. Metze. Noisemes: Manual annotation of environmental noise in audio streams. 2012.
[10]
Z. Kons and O. Toledo-Ronen. Audio event classification using deep neural networks. In INTERSPEECH, pages 1482--1486, 2013.
[11]
O. Gencoglu, T. Virtanen, and H. Huttunen. Recognition of acoustic events using deep neural networks. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 506--510. IEEE, 2014.
[12]
E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen. Polyphonic sound event detection using multi label deep neural networks. In Neural Networks (IJCNN), 2015 International Joint Conference on, pages 1--7. IEEE, 2015.
[13]
M. Ravanelli, B. Elizalde, K. Ni, and G. Friedland. Audio concept classification with hierarchical deep neural networks. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 606--610. IEEE, 2014.
[14]
K. Ashraf et al. Audio-based multimedia event detection with DNNs and sparse sampling. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 611--614. ACM, 2015.
[15]
Y. Wang, S. Rawat, and F. Metze. Exploring audio semantic concepts for event-based video retrieval. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 1360--1364. IEEE, 2014.
[16]
Y. Wang and F. Metze. Audio-based multimedia event detection using deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
[17]
S. Rawat, P. F. Schulam, S. Burger, D. Ding, Y. Wang, and F. Metze. Robust audio-codebooks for large-scale event detection in consumer videos. 2013.
[18]
L. Bao et al. Infomedia @ TRECVID 2011. In Proceedings of TREC Video Retrieval Evaluation.
[19]
NIST. TRECVID multimedia event detection evaluation plan. {Online} http://www.nist.gov/itl/iad/mig/upload/MED11-EvalPlan-V03--20110801a.pdf, 2011.
[20]
F. Metze, S. Rawat, and Y. Wang. Improved audio features for large-scale multimedia event detection. In Multimedia and Expo (ICME), 2014 IEEE International Conference on, pages 1--6. IEEE, 2014.
[21]
Y. Yan, Y. Yang, D. Meng, G. Liu, W. Tong, A. G. Hauptmann, and N. Sebe. Event oriented dictionary learning for complex event detection. Image Processing, IEEE Transactions on, 24(6):1867--1878, 2015.
[22]
F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia, pages 835--838. ACM, 2013.
[23]
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):480--492, 2012.
[24]
A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177--1184, 2007.
[25]
Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
[26]
S. Chen and Y. Wang. Convolutional neural network and convex optimization. Dept. of Elect. and Comput. Eng., Univ. of California at San Diego, San Diego, CA, USA, Tech. Rep, 2014.
[27]
R.-E. Fan et al. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:1871--1874, 2008.
[28]
L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, 1986.
[29]
J. Bergstra et al. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
[30]
Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). In Soviet Mathematics Doklady, volume 27, pages 372--376, 1983.
[31]
J. Schmidhuber, M. Gagliolo, D. Wierstra, and F. Gomez. Evolino for recurrent support vector machines. arXiv preprint cs/0512062, 2005.
[32]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Cited By

View all
  • (2021)Multimodal person detection systemMultimedia Tools and Applications10.1007/s11042-020-10307-880:9(13389-13406)Online publication date: 1-Apr-2021
  • (2019)How Deep Features Have Improved Event Recognition in MultimediaACM Transactions on Multimedia Computing, Communications, and Applications10.1145/330624015:2(1-27)Online publication date: 5-Jun-2019
  • (2018)Class-aware Self-Attention for Audio Event RecognitionProceedings of the 2018 ACM on International Conference on Multimedia Retrieval10.1145/3206025.3206067(28-36)Online publication date: 5-Jun-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval
June 2016
452 pages
ISBN:9781450343596
DOI:10.1145/2911996
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data augmentation
  2. hinge loss
  3. kernel mapping
  4. large margin
  5. multimedia event detection (MED)
  6. noisemes
  7. recurrent neural networks (RNNs)
  8. support vector machines (SVMs)

Qualifiers

  • Short-paper

Conference

ICMR'16
Sponsor:
ICMR'16: International Conference on Multimedia Retrieval
June 6 - 9, 2016
New York, New York, USA

Acceptance Rates

ICMR '16 Paper Acceptance Rate 20 of 120 submissions, 17%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Multimodal person detection systemMultimedia Tools and Applications10.1007/s11042-020-10307-880:9(13389-13406)Online publication date: 1-Apr-2021
  • (2019)How Deep Features Have Improved Event Recognition in MultimediaACM Transactions on Multimedia Computing, Communications, and Applications10.1145/330624015:2(1-27)Online publication date: 5-Jun-2019
  • (2018)Class-aware Self-Attention for Audio Event RecognitionProceedings of the 2018 ACM on International Conference on Multimedia Retrieval10.1145/3206025.3206067(28-36)Online publication date: 5-Jun-2018
  • (2017)Research Survey on Support Vector MachineProceedings of the 10th EAI International Conference on Mobile Multimedia Communications10.4108/eai.13-7-2017.2270596(95-103)Online publication date: 8-Dec-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media