Aligning plot synopses to videos for story-based retrieval

Tapaswi, Makarand; Bäuml, Martin; Stiefelhagen, Rainer

doi:10.1007/s13735-014-0065-9

Aligning plot synopses to videos for story-based retrieval

Regular Paper
Published: 11 September 2014

Volume 4, pages 3–16, (2015)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Makarand Tapaswi¹,
Martin Bäuml¹ &
Rainer Stiefelhagen¹

512 Accesses
16 Citations
Explore all metrics

Abstract

We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Label-Based Automatic Alignment of Video with Narrative Sentences

Towards Automatic Textual Summarization of Movies

SODA: Story Oriented Dense Video Captioning Evaluation Framework

Notes

buffyworld.com/buffy/transcripts/079_tran.html.
en.wikipedia.org/wiki/Buffy_vs._Dracula#Plot.
For \(z \sim 100\), \(N_S \sim 40\) and \(N_T \sim 700\) DTW3 takes a couple of minutes to solve with our unoptimized Matlab implementation.

References

Buffy Plot Synopsis Text-Video Alignment Data. https://cvhci.anthropomatik.kit.edu/~mtapaswi/projects/story_based_retrieval.html. Accessed 3 July 2014
NLP Toolbox. http://nlp.stanford.edu/software/. Accessed 4 July 2014
SubRip. http://en.wikipedia.org/wiki/SubRip. Accessed 4 July 2014
Whoosh - a Python full text indexing and search library. http://pypi.python.org/pypi/Whoosh. Accessed 4 July 2014
Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer Vision
Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern Recognition
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10
Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media Inc
Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognition
Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognition
Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognition
Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 Workshop
Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM Multimedia
Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conference
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Article Google Scholar
Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM Multimedia
Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognition
Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demo
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21
Article Google Scholar
Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognition
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognition
Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrieval
Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM Multimedia
Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learning
Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89
Article MATH Google Scholar
Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414
Article Google Scholar
Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognition
Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409
Nagel H (2004) Steps toward a cognitive vision system. AI Mag 25(2):31–50
MathSciNet Google Scholar
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453
Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modeling
Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech
Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105
Article Google Scholar
Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New York
Sang J, Xu C (2010) Character-based movie summarization. In: ACM Multimedia
Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conference
Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognition
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrieval
Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986
Article Google Scholar
Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322
Google Scholar
Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia
Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognition
Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrieval
Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognition
Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computing
Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM Multimedia
Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355
Article Google Scholar
Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, Berlin
Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC

Download references

Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG — German Research Foundation) under contract no. STI-598/2-1. The views expressed herein are the authors’ responsibility and do not necessarily reflect those of DFG.

Author information

Authors and Affiliations

Computer Vision for Human Computer Interaction Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany
Makarand Tapaswi, Martin Bäuml & Rainer Stiefelhagen

Authors

Makarand Tapaswi
View author publications
You can also search for this author in PubMed Google Scholar
Martin Bäuml
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Stiefelhagen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Makarand Tapaswi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tapaswi, M., Bäuml, M. & Stiefelhagen, R. Aligning plot synopses to videos for story-based retrieval. Int J Multimed Info Retr 4, 3–16 (2015). https://doi.org/10.1007/s13735-014-0065-9

Download citation

Received: 04 July 2014
Revised: 04 August 2014
Accepted: 19 August 2014
Published: 11 September 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s13735-014-0065-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aligning plot synopses to videos for story-based retrieval

Abstract

Access this article

Similar content being viewed by others

Label-Based Automatic Alignment of Video with Narrative Sentences

Towards Automatic Textual Summarization of Movies

SODA: Story Oriented Dense Video Captioning Evaluation Framework

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Aligning plot synopses to videos for story-based retrieval

Abstract

Access this article

Similar content being viewed by others

Label-Based Automatic Alignment of Video with Narrative Sentences

Towards Automatic Textual Summarization of Movies

SODA: Story Oriented Dense Video Captioning Evaluation Framework

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation