research-article

Cross-modal categorisation of user-generated video sequences

Authors:

Sebastian Schmiedeke,

Thomas SikoraAuthors Info & Claims

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

Article No.: 25, Pages 1 - 8

https://doi.org/10.1145/2324796.2324828

Published: 05 June 2012 Publication History

Abstract

This paper describes the possibilities of cross-modal classification of multimedia documents in social media platforms. Our framework predicts the user-chosen category of consumer-produced video sequences based on their textual and visual features. These text resources---includes metadata and automatic speech recognition transcripts---are represented as bags of words and the video content is represented as a bag of clustered local visual features. The contribution of the different modalities is investigated and how they should be combined if sequences lack certain resources. Therefore, several classification methods are evaluated, varying the resources. The paper shows an approach that achieves a mean average precision of 0.3977 using user-contributed metadata in combination with clustered SURF.

References

[1]

H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3):346--359, 2008.

Digital Library

[2]

D. Borth, J. Hees, M. Koch, A. Ulges, C. Schulze, T. Breuel, and R. Paredes. Tubefiler: an automatic web video categorizer. In Proceedings of the 17th ACM international conference on Multimedia, MM '09, pages 1111--1112, New York, NY, USA, 2009. ACM.

Digital Library

[3]

D. Brezeale and D. Cook. Automatic video classification: A survey of the literature. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 38(3):416--430, may 2008.

Digital Library

[4]

S. Chatzichristofis and Y. Boutalis. Cedd: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. In A. Gasteratos, M. Vincze, and J. Tsotsos, editors, Computer Vision Systems, volume 5008 of Lecture Notes in Computer Science, pages 312--322. Springer Berlin/Heidelberg, 2008. 10.1007/978-3-540-79547-6-30.

Digital Library

[5]

H. K. Ekenel, T. Semela, and R. Stiefelhagen. Content-based video genre classification using multiple cues. In Proceedings of the 3rd international workshop on Automated information extraction in media production, AIEMPro '10, pages 21--26, New York, NY, USA, 2010. ACM.

Digital Library

[6]

R. Glasberg, S. Schmiedeke, P. Kelm, and T. Sikora. An automatic system for real-time video-genres detection using high-level-descriptors and a set of classifiers. In Proc. IEEE International Symposium on Consumer Electronics ISCE 2008, pages 1--4, 14--16 April 2008.

[7]

B. Ionescu, K. Seyerlehner, C. Rasche, C. Vertan, and P. Lambert. Content-based video description for automatic video genre categorization. In The 18th International Conference on MultiMedia Modeling, 4--6 January 2012. Klagenfurt, Austria.

Digital Library

[8]

P. Kelm, S. Schmiedeke, and T. Sikora. Feature-based video key frame extraction for low quality video. In Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2009), pages pp.25--28, London, UK, May 2009. ISBN: 978-1-4244-3609-5.

[9]

P. Kelm, S. Schmiedeke, and T. Sikora. A hierarchical, multi-modal approach for placing videos on the map using millions of flickr photographs. In Proceedings of the 2011 ACM workshop on Social and behavioural networked media access, SBNMA '11, pages 15--20, New York, NY, USA, 2011. ACM.

Digital Library

[10]

L. Lamel and J.-L. Gauvain. Speech processing for audio indexing. In B. NordstrÃűm and A. Ranta, editors, Advances in Natural Language Processing, volume 5221 of Lecture Notes in Computer Science, pages 4--15. Springer Berlin/Heidelberg, 2008. 10.1007/978-3-540-85287-2-2.

Digital Library

[11]

M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S. Schmiedeke, and G. J. F. Jones. Overview of mediaeval 2011 rich speech retrieval task and genre tagging task. In Working Notes Proceedings of the MediaEval 2011 Workshop, Pisa, Italy, September 1--2, 2011. CEUR-WS.org. ISSN 1613-0073.

[12]

M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and G. J. F. Jones. Automatic Tagging and Geotagging in Video Collections and Communities. In ACM International Conference on Multimedia Retrieval (ICMR 2011), 2011.

Digital Library

[13]

S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169--2178. IEEE, 2006.

Digital Library

[14]

Z. Li, K.-H. Yap, and X.-M. Chen. Beyond bag of words: Combining generative and discriminative models for natural scene categorization. In ICASSP, pages 965--968, 2011.

[15]

K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1582--1596, sept. 2010.

Digital Library

[16]

L. Yang, J. Liu, X. Yang, and X.-S. Hua. Multi-modality web video categorization. In Proceedings of the international workshop on Workshop on multimedia information retrieval, MIR '07, pages 265--274, New York, NY, USA, 2007. ACM.

Digital Library

[17]

N. Zhang and L. Guan. An efficient framework on large-scale video genre classification. In Multimedia Signal Processing (MMSP), 2010 IEEE International Workshop on, pages 481--486, oct. 2010.

Cited By

Kofler CBhattacharya SLarson MTao Chen Hanjalic AShih-Fu Chang (2015)Uploader Intent for Online Video: Typology, Inference, and ApplicationsIEEE Transactions on Multimedia10.1109/TMM.2015.244557317:8(1200-1212)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1109/TMM.2015.2445573
Kelm PSchmiedeke SSchockaert SSikora TTrevisiol MVan Laere O(2014)Georeferencing Flickr Resources Based on Multimodal FeaturesMultimodal Location Estimation of Videos and Images10.1007/978-3-319-09861-6_8(127-152)Online publication date: 5-Oct-2014
https://doi.org/10.1007/978-3-319-09861-6_8
Schmiedeke SKelm PSikora T(2013)DCT-based features for categorisation of social media in compressed domain2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP)10.1109/MMSP.2013.6659304(295-300)Online publication date: Sep-2013
https://doi.org/10.1109/MMSP.2013.6659304

Index Terms

Cross-modal categorisation of user-generated video sequences
1. Information systems
  1. Information retrieval

Recommendations

Learning heterogeneous data for hierarchical web video classification
MM '11: Proceedings of the 19th ACM international conference on Multimedia

Web videos such as YouTube are hard to obtain sufficient precisely labeled training data and analyze due to the complex ontology. To deal with these problems, we present a hierarchical web video classification framework by learning heterogeneous web ...
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social media

Nowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Tagging products using image classification
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Associating labels with online products can be a labor-intensive task. We study the extent to which a standard "bag of visual words" image classifier can be used to tag products with useful information, such as whether a sneaker has laces or velcro ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

June 2012

489 pages

ISBN:9781450313292

DOI:10.1145/2324796

Conference Chairs:
Horace H. S. Ip
City University of Hong Kong
,
Yong Rui
Microsoft, China

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Seventh Framework Programme

Conference

ICMR '12

Sponsor:

SIGMM

ICMR '12: International Conference on Multimedia Retrieval

June 5 - 8, 2012

Hong Kong, China

Acceptance Rates

ICMR '12 Paper Acceptance Rate 50 of 145 submissions, 34%;

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
147
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kofler CBhattacharya SLarson MTao Chen Hanjalic AShih-Fu Chang (2015)Uploader Intent for Online Video: Typology, Inference, and ApplicationsIEEE Transactions on Multimedia10.1109/TMM.2015.244557317:8(1200-1212)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1109/TMM.2015.2445573
Kelm PSchmiedeke SSchockaert SSikora TTrevisiol MVan Laere O(2014)Georeferencing Flickr Resources Based on Multimodal FeaturesMultimodal Location Estimation of Videos and Images10.1007/978-3-319-09861-6_8(127-152)Online publication date: 5-Oct-2014
https://doi.org/10.1007/978-3-319-09861-6_8
Schmiedeke SKelm PSikora T(2013)DCT-based features for categorisation of social media in compressed domain2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP)10.1109/MMSP.2013.6659304(295-300)Online publication date: Sep-2013
https://doi.org/10.1109/MMSP.2013.6659304

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten