Deep learning for content-based video retrieval in film and television production

Mühling, Markus; Korfhage, Nikolaus; Müller, Eric; Otto, Christian; Springstein, Matthias; Langelage, Thomas; Veith, Uli; Ewerth, Ralph; Freisleben, Bernd

doi:10.1007/s11042-017-4962-9

Deep learning for content-based video retrieval in film and television production

Published: 05 July 2017

Volume 76, pages 22169–22194, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Markus Mühling ORCID: orcid.org/0000-0001-7391-264X¹,
Nikolaus Korfhage¹,
Eric Müller^3,4,
Christian Otto^3,4,
Matthias Springstein³,
Thomas Langelage²,
Uli Veith²,
Ralph Ewerth^3,4 &
…
Bernd Freisleben¹

1514 Accesses
24 Citations
Explore all metrics

Abstract

While digitization has changed the workflow of professional media production, the content-based labeling of image sequences and video footage, necessary for all subsequent stages of film and television production, archival or marketing is typically still performed manually and thus quite time-consuming. In this paper, we present deep learning approaches to support professional media production. In particular, novel algorithms for visual concept detection, similarity search, face detection, face recognition and face clustering are combined in a multimedia tool for effective video inspection and retrieval. The analysis algorithms for concept detection and similarity search are combined in a multi-task learning approach to share network weights, saving almost half of the computation time. Furthermore, a new visual concept lexicon tailored to fast video retrieval for media production and novel visualization components are introduced. Experimental results show the quality of the proposed approaches. For example, concept detection achieves a mean average precision of approximately 90% on the top-100 video shots, and face recognition clearly outperforms the baseline on the public Movie Trailers Face Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Visualizing and Understanding Convolutional Networks

Notes

References

Blanco G, Bedo MVN, Cazzolato MT, Santos LFD, Jorge AES, Traina C, Azevedo-Marques PM, Traina AJM (2016) A label-scaled similarity measure for content-based image retrieval. In: 2016 IEEE international symposium on multimedia (ISM), pp 20–25
Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F (2013) High-performance OCR for printed english and fraktur using LSTM networks. In: proceedings of international conference on document analysis and recognition, pp 683–687
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British machine vision conference, pp 1–11
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR ’09), pp 2–9
Ding C, Tao D (2016) Trunk-branch ensemble convolutional neural networks for video-based face recognition. arXiv:1607.05427
Ewerth R, Freisleben B (2004) Video cut detection without thresholds. In: Proceedings of the 11th international workshop on signals, systems and image processing (IWSSIP ’04). Poznan, Poland, pp 227–230
Ewerth R, Freisleben B (2009) Unsupervised detection of gradual video shot changes with motion-based false alarm removal. In: International conference on advanced concepts for intelligent vision systems, pp 253–264
Ewerth R, Mühling M, Freisleben B (2007) Self-supervised learning of face appearances in tv casts and movies. Int J Semant Comput 1(2):185–204
Article Google Scholar
Farfade SS, Saberian MJ, Li LJ (2015) Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 643–650
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2013), pp 6645–6649
Guo Y, Zhang L, Hu Y, He X, Gao J (2016) Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European conference on computer vision. Springer, pp 87–102
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Hudelist MA, Cobârzan C, Beecks C, Van de Werken R, Kletz S, Hürst W, Schoeffmann K (2016) Collaborative video search combining video retrieval with human-based visual inspection. In: 22nd international conference on multimedia modelling. Springer International Publishing, FL, USA, pp 400–405
Jain V, Learned-Miller E (2010) Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of ACM international conference on multimedia, pp 675–678
Jiang H, Learned-Miller E (2016) Face detection with the faster r-cnn. arXiv:1606.03473
Klare BF, Klein B, Taborsky E, Blanton A, Cheney J, Allen K, Grother P, Mah A, Burge M, Jain AK (2015) Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1931–1939
Krizhevsky A, Hinton G (2011) Using very deep autoencoders for content-based image retrieval. In: Proceedings of European symposium on artificial neural networks, pp 1–7
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1–9
Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and simile classifiers for face verification. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 365–372
Kumar V, Namboodiri AM, Jawahar C (2014) Face recognition in videos by label propagation. In: 22nd international conference on pattern recognition (ICPR). IEEE, pp 303–308
Learned-Miller GBHE (2014) Labeled faces in the wild: updates and new reporting procedures. Technical Report UM-CS-2014-003. University of Massachusetts, Amherst
Google Scholar
Lin K, Yang HF, Hsiao JH, Chen CS (2015) Deep learning of binary hash codes for fast image retrieval Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 27–35
Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE international conference on computer vision, vol 2, pp 1150–1157
Masi I, Tran AT, Leksut JT, Hassner T, Medioni G (2016) Do we really need to collect millions of faces for effective face recognition? arXiv:1603.07057
Meddeb M, Karray H, Alimi AM (2016) Content-based arabic speech similarity search and emotion detection. In: Hassanien AE, Shaalan K, Gaber T, Azar AT, Tolba MF (eds) Proceedings of the international conference on advanced intelligent systems and informatics. Springer International Publishing, pp 530–539
Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In: Doklady an SSSR, vol 269, pp 543–547
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987
Article MATH Google Scholar
Ortiz EG, Wright A, Shah M (2013) Face recognition in movie trailers via mean sequence sparse representation-based classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3531–3538
Otto C, Wang D, Jain AK (2016) Clustering millions of faces by identity. arXiv:1604.00989
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: British machine vision conference, pp 1–6
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978
Article Google Scholar
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Smeulders AW, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Article Google Scholar
Song M (2008) Handbook of research on text and web mining technologies. IGI Global
Sun Y, Liang D, Wang X, Tang X (2015) Deepid3: Face recognition with very deep neural networks. arXiv:1502.00873
Sutskever I, Martens J, Dahl GE, Hinton G (2013) On the importance of initialization and momentum in deep learning 30th international conference on machine learning, vol 28, pp 1139–1147
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9
Taigman Y, Yang M, Ranzato M, Wolf L (2014) DeepFace: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition, vol 1, pp 511–518
Wan J, Wang D, Hoi SCH, Wu P (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the ACM international conference on multimedia (MM), pp 157–166
Wang J, Shi L, Wang H, Meng J, Wang JJ, Sun Q, Gu Y (2016) Optimizing top precision performance measure of content-based image retrieval by learning similarity function. In: 23nd international conference on pattern recognition (ICPR)
Wei Y, Xia W, Huang J, Ni B, Dong J, Zhao Y, Yan S (2014) CNN: Single-label to Multi-label. pp 1–14
Yang S, Luo P, Loy CC, Tang X (2016) Wider face: A face detection benchmark. In: IEEE conference on computer vision and pattern recognition (CVPR)
Yi D, Lei Z, Liao S, Li SZ (2014) Learning face representation from scratch. arXiv:1411.7923
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Annual ACM-SIAM symposium on discrete algorithms, pp 311–321
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. Adv Neural Inf Process Syst 27:487–495
Google Scholar

Download references

Acknowledgements

This work is financially supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) in the ZIM Programme.

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032, Marburg, Germany
Markus Mühling, Nikolaus Korfhage & Bernd Freisleben
taglicht media Film- & Fernsehproduktion GmbH, Cäsarstraße 58, 50968, Köln, Germany
Thomas Langelage & Uli Veith
German National Library of Science and Technology (TIB), Welfengarten 1B, D-30167, Hannover, Germany
Eric Müller, Christian Otto, Matthias Springstein & Ralph Ewerth
L3S Research Center, Leibniz Universität Hannover, Appelstr. 4, D-30167, Hannover, Germany
Eric Müller, Christian Otto & Ralph Ewerth

Authors

Markus Mühling
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Korfhage
View author publications
You can also search for this author in PubMed Google Scholar
Eric Müller
View author publications
You can also search for this author in PubMed Google Scholar
Christian Otto
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Springstein
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Langelage
View author publications
You can also search for this author in PubMed Google Scholar
Uli Veith
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Ewerth
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Freisleben
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Mühling.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mühling, M., Korfhage, N., Müller, E. et al. Deep learning for content-based video retrieval in film and television production. Multimed Tools Appl 76, 22169–22194 (2017). https://doi.org/10.1007/s11042-017-4962-9

Download citation

Received: 01 December 2016
Revised: 02 June 2017
Accepted: 18 June 2017
Published: 05 July 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11042-017-4962-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for content-based video retrieval in film and television production

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visualizing and Understanding Convolutional Networks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep learning for content-based video retrieval in film and television production

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visualizing and Understanding Convolutional Networks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation