Skip to main content
Log in

Aligning plot synopses to videos for story-based retrieval

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. buffyworld.com/buffy/transcripts/079_tran.html.

  2. en.wikipedia.org/wiki/Buffy_vs._Dracula#Plot.

  3. For \(z \sim 100\), \(N_S \sim 40\) and \(N_T \sim 700\) DTW3 takes a couple of minutes to solve with our unoptimized Matlab implementation.

References

  1. Buffy Plot Synopsis Text-Video Alignment Data. https://cvhci.anthropomatik.kit.edu/~mtapaswi/projects/story_based_retrieval.html. Accessed 3 July 2014

  2. NLP Toolbox. http://nlp.stanford.edu/software/. Accessed 4 July 2014

  3. SubRip. http://en.wikipedia.org/wiki/SubRip. Accessed 4 July 2014

  4. Whoosh - a Python full text indexing and search library. http://pypi.python.org/pypi/Whoosh. Accessed 4 July 2014

  5. Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer Vision

  6. Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern Recognition

  7. Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10

  8. Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media Inc

  9. Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognition

  10. Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognition

  11. Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognition

  12. Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 Workshop

  13. Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM Multimedia

  14. Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conference

  15. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382

    Article  Google Scholar 

  16. Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM Multimedia

  17. Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognition

  18. Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demo

  19. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21

    Article  Google Scholar 

  20. Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognition

  21. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognition

  22. Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrieval

  23. Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM Multimedia

  24. Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learning

  25. Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89

    Article  MATH  Google Scholar 

  26. Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414

    Article  Google Scholar 

  27. Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognition

  28. Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409

  29. Nagel H (2004) Steps toward a cognitive vision system. AI Mag 25(2):31–50

    MathSciNet  Google Scholar 

  30. Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453

  31. Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modeling

  32. Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech

  33. Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105

    Article  Google Scholar 

  34. Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New York

  35. Sang J, Xu C (2010) Character-based movie summarization. In: ACM Multimedia

  36. Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conference

  37. Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognition

  38. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrieval

  39. Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986

    Article  Google Scholar 

  40. Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322

    Google Scholar 

  41. Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia

  42. Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognition

  43. Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrieval

  44. Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognition

  45. Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computing

  46. Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM Multimedia

  47. Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355

    Article  Google Scholar 

  48. Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, Berlin

  49. Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC

Download references

Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG — German Research Foundation) under contract no. STI-598/2-1. The views expressed herein are the authors’ responsibility and do not necessarily reflect those of DFG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Makarand Tapaswi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tapaswi, M., Bäuml, M. & Stiefelhagen, R. Aligning plot synopses to videos for story-based retrieval. Int J Multimed Info Retr 4, 3–16 (2015). https://doi.org/10.1007/s13735-014-0065-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-014-0065-9

Keywords

Navigation