short-paper

Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection

Authors:

Florian MetzeAuthors Info & Claims

ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Pages 265 - 269

https://doi.org/10.1145/2911996.2912048

Published: 06 June 2016 Publication History

Abstract

Multimedia event detection (MED) is the task of detecting given events (e.g. parade, birthday party) in a large collection of video clips. While the most useful information comes from visual features and speech recognition, a lot can also be inferred from the non-speech audio content, either alone or in conjunction with visual and speech cues. This paper studies MED with non-speech audio information only. MED is usually performed in two stages. The first stage generates a representation for each clip in the form of either a single vector or a sequence of vectors, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether each target event occurs in each clip. Common classifiers used for the second stage include support vector machines (SVMs), feed-forward deep neural networks (DNNs), and recurrent neural networks (RNNs).

In this paper, we propose to classify clips for events using "recurrent SVMs". These models combine the kernel mapping and the large-margin optimization criterion of SVMs, and the ability to process sequences of variable lengths of RNNs. Reinforced with data augmentation, recurrent SVMs have achieved higher mean average precision (MAP) on the TRECVID 2011 MED task than both SVMs and RNNs.

References

[1]

W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using GMM supervectors for speaker verification. Signal Processing Letters, IEEE, 13(5):308--311, 2006.

[2]

Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze. Event-based video retrieval using audio. In Proceedings of INTERSPEECH, page 2085, 2012.

[3]

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4):788--798, 2011.

Digital Library

[4]

X. Zhuang, S. Tsakalidis, S. Wu, P. Natarajan, R. Prasad, and P. Natarajan. Compact audio representation for event detection in consumer media. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.

[5]

B. Elizalde, H. Lei, and G. Friedland. An i-vector representation of acoustic environments for audio-based video event detection on user generated content. In Multimedia (ISM), International Symposium on, pages 114--117. IEEE, 2013.

Digital Library

[6]

S. Pancoast and M. Akbacak. Bag-of-audio-words approach for multimedia event classification. In Interspeech, pages 2105--2108, 2012.

[7]

B. Byun, I. Kim, S. M. Siniscalchi, and C.-H. Lee. Consumer-level multimedia event detection through unsupervised audio signal modeling. In INTERSPEECH, pages 2081--2084, 2012.

[8]

S. Chaudhuri, M. Harvilla, and B. Raj. Unsupervised learning of acoustic unit descriptors for audio content representation and classification. In Interspeech, pages 2265--2268, 2011.

[9]

S. Burger, Q. Jin, P. F. Schulam, and F. Metze. Noisemes: Manual annotation of environmental noise in audio streams. 2012.

[10]

Z. Kons and O. Toledo-Ronen. Audio event classification using deep neural networks. In INTERSPEECH, pages 1482--1486, 2013.

[11]

O. Gencoglu, T. Virtanen, and H. Huttunen. Recognition of acoustic events using deep neural networks. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 506--510. IEEE, 2014.

[12]

E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen. Polyphonic sound event detection using multi label deep neural networks. In Neural Networks (IJCNN), 2015 International Joint Conference on, pages 1--7. IEEE, 2015.

[13]

M. Ravanelli, B. Elizalde, K. Ni, and G. Friedland. Audio concept classification with hierarchical deep neural networks. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 606--610. IEEE, 2014.

[14]

K. Ashraf et al. Audio-based multimedia event detection with DNNs and sparse sampling. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 611--614. ACM, 2015.

Digital Library

[15]

Y. Wang, S. Rawat, and F. Metze. Exploring audio semantic concepts for event-based video retrieval. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 1360--1364. IEEE, 2014.

[16]

Y. Wang and F. Metze. Audio-based multimedia event detection using deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

Digital Library

[17]

S. Rawat, P. F. Schulam, S. Burger, D. Ding, Y. Wang, and F. Metze. Robust audio-codebooks for large-scale event detection in consumer videos. 2013.

[18]

L. Bao et al. Infomedia @ TRECVID 2011. In Proceedings of TREC Video Retrieval Evaluation.

[19]

NIST. TRECVID multimedia event detection evaluation plan. {Online} http://www.nist.gov/itl/iad/mig/upload/MED11-EvalPlan-V03--20110801a.pdf, 2011.

[20]

F. Metze, S. Rawat, and Y. Wang. Improved audio features for large-scale multimedia event detection. In Multimedia and Expo (ICME), 2014 IEEE International Conference on, pages 1--6. IEEE, 2014.

[21]

Y. Yan, Y. Yang, D. Meng, G. Liu, W. Tong, A. G. Hauptmann, and N. Sebe. Event oriented dictionary learning for complex event detection. Image Processing, IEEE Transactions on, 24(6):1867--1878, 2015.

[22]

F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia, pages 835--838. ACM, 2013.

Digital Library

[23]

A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):480--492, 2012.

Digital Library

[24]

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177--1184, 2007.

Digital Library

[25]

Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.

[26]

S. Chen and Y. Wang. Convolutional neural network and convex optimization. Dept. of Elect. and Comput. Eng., Univ. of California at San Diego, San Diego, CA, USA, Tech. Rep, 2014.

[27]

R.-E. Fan et al. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:1871--1874, 2008.

Digital Library

[28]

L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, 1986.

[29]

J. Bergstra et al. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.

[30]

Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). In Soviet Mathematics Doklady, volume 27, pages 372--376, 1983.

[31]

J. Schmidhuber, M. Gagliolo, D. Wierstra, and F. Gomez. Evolino for recurrent support vector machines. arXiv preprint cs/0512062, 2005.

[32]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

Cited By

Barello PHossain M(2021)Multimodal person detection systemMultimedia Tools and Applications10.1007/s11042-020-10307-880:9(13389-13406)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.1007/s11042-020-10307-8
Ahmad KConci N(2019)How Deep Features Have Improved Event Recognition in MultimediaACM Transactions on Multimedia Computing, Communications, and Applications10.1145/330624015:2(1-27)Online publication date: 5-Jun-2019
https://dl.acm.org/doi/10.1145/3306240
Chen SChen JJin QHauptmann AAizawa KLew MSatoh S(2018)Class-aware Self-Attention for Audio Event RecognitionProceedings of the 2018 ACM on International Conference on Multimedia Retrieval10.1145/3206025.3206067(28-36)Online publication date: 5-Jun-2018
https://dl.acm.org/doi/10.1145/3206025.3206067
Show More Cited By

Index Terms

Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection

Recommendations

Support Vector Echo-State Machine for Chaotic Time-Series Prediction

A novel chaotic time-series prediction method based on support vector machines (SVMs) and echo-state mechanisms is proposed. The basic idea is replacing "kernel trick" with "reservoir trick" in dealing with nonlinearity, that is, performing linear ...
Fuzzy one-class support vector machines

In one-class classification, the problem is to distinguish one class of data from the rest of the feature space. It is important in many applications where one of the classes is characterized well, while no measurements are available for the other ...
Wavelet twin support vector machines based on glowworm swarm optimization

Twin support vector machine is a machine learning algorithm developing from standard support vector machine. The performance of twin support vector machine is always better than support vector machine on datasets that have cross regions. Recently ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

June 2016

452 pages

ISBN:9781450343596

DOI:10.1145/2911996

General Chairs:
John R. Kender
Columbia University, USA
,
John R. Smith
IBM Research, USA
,
Program Chairs:
Jiebo Luo
University of Rochester, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Winston Hsu
National Taiwan University, Taiwan

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

ICMR'16

Sponsor:

SIGMM

ICMR'16: International Conference on Multimedia Retrieval

June 6 - 9, 2016

New York, New York, USA

Acceptance Rates

ICMR '16 Paper Acceptance Rate 20 of 120 submissions, 17%;

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
100
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Barello PHossain M(2021)Multimodal person detection systemMultimedia Tools and Applications10.1007/s11042-020-10307-880:9(13389-13406)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.1007/s11042-020-10307-8
Ahmad KConci N(2019)How Deep Features Have Improved Event Recognition in MultimediaACM Transactions on Multimedia Computing, Communications, and Applications10.1145/330624015:2(1-27)Online publication date: 5-Jun-2019
https://dl.acm.org/doi/10.1145/3306240
Chen SChen JJin QHauptmann AAizawa KLew MSatoh S(2018)Class-aware Self-Attention for Audio Event RecognitionProceedings of the 2018 ACM on International Conference on Multimedia Retrieval10.1145/3206025.3206067(28-36)Online publication date: 5-Jun-2018
https://dl.acm.org/doi/10.1145/3206025.3206067
Wang HXiong JYao ZLin MRen J(2017)Research Survey on Support Vector MachineProceedings of the 10th EAI International Conference on Mobile Multimedia Communications10.4108/eai.13-7-2017.2270596(95-103)Online publication date: 8-Dec-2017
https://dl.acm.org/doi/10.4108/eai.13-7-2017.2270596

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten