Abstract
With the ever increasing growth of videos, automatic video summarization has become an important task which has attracted lot of interest in the research community. One of the challenges which makes it a hard problem to solve is presence of multiple ‘correct answers’. Because of the highly subjective nature of the task, there can be different “ideal” summaries of a video. Modelling user intent in the form of queries has been posed in literature as a way to alleviate this problem. The query-focused summary is expected to contain shots which are relevant to the query in conjunction with other important shots. For practical deployments in which very long videos need to be summarized, this need to capture user’s intent becomes all the more pronounced. In this work, we propose a simple two stage method which takes user query and video as input and generates a query-focused summary. Specifically, in the first stage, we employ attention within a segment and across all segments, combined with the query to learn the feature representation of each shot. In the second stage, such learned features are again fused with the query to learn the score of each shot by regressing through fully connected layers. We then assemble the summary by arranging the top scoring shots in chronological order. Extensive experiments on a benchmark query-focused video summarization dataset for long videos give better results as compared to the current state of the art, thereby demonstrating the effectiveness of our method even without employing computationally expensive architectures like LSTMs, variational autoencoders, GANs or reinforcement learning, as done by most past works.
S. Nalla and M. Agrawal—Equal contribution.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I.: Unsupervised video summarization via attention-driven adversarial learning. In: Ro, Y.M., et al. (eds.) MMM 2020. LNCS, vol. 11961, pp. 492–504. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37731-1_40
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, X., Li, X., Lu, X.: Representative and diverse video summarization. In: 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 142–146. IEEE (2015)
Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3584–3592 (2015)
De Avila, S.E.F., Lopes, A.P.B., da Luz Jr, A., de Albuquerque Araújo, A.: VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 32(1), 56–68 (2011)
Elfeki, M., Borji, A.: Video summarization via actionness ranking. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 754–763. IEEE (2019)
Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., Remagnino, P.: Summarizing videos with attention. In: Carneiro, G., You, S. (eds.) ACCV 2018. LNCS, vol. 11367, pp. 39–54. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21074-8_4
Fu, T.J., Tai, S.H., Chen, H.T.: Attentive and adversarial learning for video summarization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1579–1587. IEEE (2019)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)
Ge, R., Gao, J., Chen, K., Nevatia, R.: MAC: mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp. 2069–2077 (2014)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
Huang, J.H., Worring, M.: Query-controllable video summarization. arXiv preprint arXiv:2004.03661 (2020)
Ji, Z., Xiong, K., Pang, Y., Li, X.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. Circ. Syst. Video Technol. (2019)
Jiang, P., Han, Y.: Hierarchical variational network for user-diversified & query-focused video summarization. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 202–206 (2019)
Jung, Y., Cho, D., Kim, D., Woo, S., Kweon, I.S.: Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8537–8544 (2019)
Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705 (2013)
Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4225–4232 (2014)
Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends® Mach. Learn. 5(2–3), 123–286 (2012)
Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346–1353. IEEE (2012)
Li, Y., Wang, L., Yang, T., Gong, B.: How local is the local diversity? Reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–167 (2018)
Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of the tenth ACM International Conference on Multimedia, pp. 533–542. ACM (2002)
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Panda, R., Mithun, N.C., Roy-Chowdhury, A.K.: Diversity-aware multi-video summarization. IEEE Trans. Image Process. 26(10), 4712–4724 (2017)
Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5781–5789 (2017)
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35
Pritch, Y., Ratovitch, S., Hendel, A., Peleg, S.: Clustered synopsis of surveillance video. In: 2009 Advanced Video and Signal Based Surveillance, pp. 195–200. IEEE (2009)
Pritch, Y., Rav-Acha, A., Gutman, A., Peleg, S.: Webcam synopsis: peeking around the world. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)
Pritch, Y., Rav-Acha, A., Peleg, S.: Nonchronological video synopsis and indexing. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1971–1984 (2008)
Rav-Acha, A., Pritch, Y., Peleg, S.: Making a long video short: dynamic video synopsis. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 435–441. IEEE (2006)
Shao, D., Xiong, Y., Zhao, Y., Huang, Q., Qiao, Y., Lin, D.: Find and focus: retrieve and localize video events with natural language queries. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 200–216 (2018)
Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_1
Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: dataset, evaluation, and a memory network based approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4788–4797 (2017)
Vasudevan, A.B., Gygli, M., Volokitin, A., Van Gool, L.: Query-adaptive video summarization via quality-aware relevance estimation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 582–590 (2017)
Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 334–343 (2019)
Wolf, W.: Key frame selection by motion analysis. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96. Conference Proceedings, vol. 2, pp. 1228–1231. IEEE (1996)
Xiao, S., Zhao, Z., Zhang, Z., Guan, Z., Cai, D.: Query-biased self-attentive network for query-focused video summarization. IEEE Trans. Image Process. 29, 5889–5899 (2020)
Xiao, S., Zhao, Z., Zhang, Z., Yan, X., Yang, M.: Convolutional hierarchical attention network for query-focused video summarization. arXiv preprint arXiv:2002.03740 (2020)
Yuan, L., Tay, F.E., Li, P., Zhou, L., Feng, J.: Cycle-sum: cycle-consistent adversarial LSTM networks for unsupervised video summarization. arXiv preprint arXiv:1904.08265 (2019)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhang, Y., Kampffmeyer, M., Liang, X., Tan, M., Xing, E.P.: Query-conditioned three-player adversarial network for video summarization. arXiv preprint arXiv:1807.06677 (2018)
Zhang, Y., Kampffmeyer, M., Zhao, X., Tan, M.: Deep reinforcement learning for query-conditioned video summarization. Appl. Sci. 9(4), 750 (2019)
Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zhu, X., Loy, C.C., Gong, S.: Learning from multiple sources for video summarisation. Int. J. Comput. Vis. 117(3), 247–268 (2016)
Acknowledgements
This work is supported in part by the Ekal Fellowship (www.ekal.org) and National Center of Excellence in Technology for Internal Security, IIT Bombay (NCETIS, https://rnd.iitb.ac.in/node/101506).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Nalla, S., Agrawal, M., Kaushal, V., Ramakrishnan, G., Iyer, R. (2020). Watch Hours in Minutes: Summarizing Videos with User Intent. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12539. Springer, Cham. https://doi.org/10.1007/978-3-030-68238-5_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-68238-5_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68237-8
Online ISBN: 978-3-030-68238-5
eBook Packages: Computer ScienceComputer Science (R0)