Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing

Zhang, Shuying; Zhang, Jing; Wang, Yizhou; Zhuo, Li

doi:10.1007/s00530-022-01031-4

Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing

Regular Paper
Published: 04 December 2022

Volume 29, pages 981–1000, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Shuying Zhang¹,
Jing Zhang^1,2,
Yizhou Wang¹ &
…
Li Zhuo^1,2

306 Accesses
1 Citation
Explore all metrics

Abstract

As one of the most prevalent we-media, short video has exponentially grown and gradually fallen into the disaster area of infringement. Video fingerprint extraction technology is conducive to the intelligent identification of short video. In view of various tampering attacks, a short video fingerprint extraction method from audio–visual fingerprint fusion to multi-index hashing is proposed, including: (1) the shot-level fingerprint of short video is extracted by audio–visual fingerprint fusion after analyzing the consistency to eliminate the uncertainty at the decision-making layer, in which the visual fingerprint is generated by R(2 + 1)D network, and the audio fingerprint is combined by extracting audio features with masked audio spectral keypoints (MASK) and convolutional recurrent neural network (CRNN); (2) the shot-level fingerprints are assembled into the data-level fingerprint of short video by constructing the data-shot-key frame relationship model of data structure; (3) the short video fingerprint is matched by measuring the weighted Hamming distance by creating the multi-index hashing of the data-level fingerprint. Five experiments are conducted on the CC_Web_Video dataset and the Moments_in_Time_Raw_v2 dataset, and the results show that our method can effectively raise the overall performance of short video fingerprint.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature

Unsupervised Video Hashing via Deep Neural Network

Article 17 March 2018

Binary feature representation learning for scene retrieval in micro-video

Article 16 April 2019

Data availability statement

Data is openly available in a public repository that issues datasets. The datasets generated during and/or analyzed during the current study are available in the CC_Web_Video repository at http://vireo.cs.cityu.edu.hk/webvideo/ and the Moments_in_Time_Raw_v2 repository at http://moments.csail.mit.edu/.

References

The 49th Statistical Report on Internet Development in China. http://www.cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/202202/P020220407403488048001.pdf
Nie, X., Yin, Y., Sun, J., Li, J., Cui, C.: Comprehensive feature-based robust video fingerprinting using tensor model. IEEE Trans. Multimed. 19(4), 785–796 (2017)
Article Google Scholar
Wary, A., Neelima, A.: Ring decomposition based video copy detection using global ordinal measure features and local features. Multimed. Tools Appl. 79(11), 8287–8323 (2020)
Article Google Scholar
Liu, M., Po, L., Ur Rehman, Y.A., Xu, X., Li, Y., Feng, L.: Video copy detection by conducting fast searching of inverted files. Multimed. Tools Appl. 78(8), 10601–10624 (2019)
Article Google Scholar
Gu, J., Zhao, R., Jiang, Y.: A survey of video copy detection methods. J. Comput. Res. Dev. 54(6), 1238–1250 (2017)
Google Scholar
Wray, M., Doughty, H., Damen, D.: On semantic similarity in video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3649−3659. Nashville (2021)
Wu, Y., Liu, X., Qin, H., Xia, K., Hu, S., Ma, Y., Wang, M.: Boosting temporal binary coding for large-scale video search. IEEE Trans. Multimed. 23, 353–364 (2020)
Article Google Scholar
Anuranji, R., Srimathi, H.J.: A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications. Digital Signal Process. 102, 102729 (2020)
Article Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., Lecun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450‒6459. Salt Lake City (2018)
Anguera, X., Garzon, A., Adamek, T.: MASK: robust local features for audio fingerprinting. In: IEEE International Conference on Multimedia and Expo, pp. 455−460. Kobe (2012)
Fu, X., Ch'ng, E., Aickelin, U., Simon, S.: CRNN: a joint neural network for redundancy detection. In: IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1−8. Hong Kong (2017)
Wang, L., Bao, Y., Li, H., Xin, F., Luo, Z.: Compact CNN based video representation for efficient video copy detection. In: International Conference on Multimedia Modelingpp, pp. 576‒587. Reykjavik (2017)
Li, Y., Chen, X.: Robust and compact video descriptor learned by deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2162‒2166. Los Angeles (2017)
Li, S., Chen, Z., Li, X., Lu, J., Zhou, J.: Unsupervised variational video hashing with 1D-CNN-LSTM networks. IEEE Trans. Multimed. 22(6), 1542–1554 (2020)
Article Google Scholar
Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L., Huang, T., Kot, A.C., Gao, W.: Compact deep invariant descriptors for video retrieval. In: Data Compression Conference, pp. 420‒429. Snowbird (2017)
Liong, V., Lu, J., Tan, Y., Zhou, J.: Deep video hashing. IEEE Trans. Multimed. 19(6), 1209–1219 (2016)
Article Google Scholar
Wang, M., Liu, X., Sun, K., Wang, Z.: Optimal video subsets and video spatiotemporal retrieval. Chin. J. Comput. 42(9), 2004–2023 (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768‒4777. Hawaii (2017)
Fei, K., Wang, C., Zhang, J., Liu, Y., Xie, X., Tu, Z.: Flow-pose Net: an effective two-stream network for fall detection. Vis. Comput. (2021). https://doi.org/10.1007/s00371-022-02416-2
Article Google Scholar
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 677‒691. Hawaii (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489‒4497. Santiago (2015)
Tran, D., Ray, J., Zheng, S., Chang, S., Paluri, M.: ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038v1 (2017)
Ali, A., Taylor, G.W.: Real-time end-to-end action detection with two-stream networks. In: Conference on Computer and Robot Vision, pp. 31‒38. Toronto (2018)
Hu, Y., Lu, X.: Learning spatial–temporal features for video copy detection by the combination of CNN and RNN. J. Vis. Commun. Image Represent. 55(8), 21–29 (2018)
Article Google Scholar
Long, C., Basharat, A., Hoogs, A.: Video frame deletion and duplication. In: Sencar, H.T., Verdoliva, L., Memon, N. (eds.) Multimedia Forensics, pp. 333–362. Springer, Singapore (2022)
Chapter Google Scholar
Hou, R., Chen, C., Sukthankar, R., Shah, M.: An efficient 3D CNN for action/object segmentation in video. arXiv preprint arXiv:1907.08895 (2019)
Wang, Y., Nie, X., Shi, Y., Zhou, X., Yin, Y.: Attention-based video hashing for large-scale video retrieval. IEEE Trans. Cogn. Dev. Syst. 13(3), 491–502 (2021)
Article Google Scholar
Zhi, H., Yu, H., Li, S., Gao, C., Wang, Y.: A video classification method based on deep metric learning. J. Electron. Inf. Technol. 40(11), 2562–2569 (2018)
Google Scholar
Nguyen, T.P., Pham, C.C., Ha, S.V.U., Jeon, J.W.: Change detection by training a triplet network for motion feature extraction. IEEE Trans. Circuits Syst. Video Technol. 29(2), 433–446 (2018)
Article Google Scholar
Bhople, A.R., Prakash, S.: Learning similarity and dissimilarity in 3D faces with triplet network. Multimed. Tools Appl. 80(28), 35973–35991 (2021)
Article Google Scholar
Wary, A., Neelima, A.: A review on robust video copy detection. Int. J. Multimed. Inf. Retrieval 8(1), 61–78 (2019)
Article Google Scholar
Pan, X., Yu, X., Deng, J., Yang, W., Wang, H.: Audio fingerprinting based on local energy centroid. In: IET International Communication Conference on Wireless Mobile and Computing, pp. 351‒354. Shanghai (2011)
Jiang, T., Wu, R., Li, J., Xiang, K., Dai, F.: A real-time peak discovering method for audio fingerprinting. In: International Conference on Internet Multimedia Computing and Service, pp. 368‒371. Huangshan (2013)
Wang, A.: An industrial-strength audio search algorithm. In: International Conference on Music Information Retrieval, vol. 2, No. 2, pp. 7‒13 (2000)
Jiang, Y., Wu, C., Deng, K., Wu, Y.: An audio fingerprinting extraction algorithm based on lifting wavelet packet and improved optimal-basis selection. Multimed. Tools Appl. 78(21), 30011–30025 (2019)
Article Google Scholar
Chowdhury, A., Ross, A.: Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans. Inf. Forensics Secur. 15, 1616–1629 (2019)
Article Google Scholar
Gao, Z., Song, Y., Mcloughlin, I., Li, P., Jiang, Y., Dai, L.: Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system. In: Interspeech, pp. 361‒365. Graz (2019)
Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H., Virtanen, T.: Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)
Article Google Scholar
Chen, M., He, X., Yang, J., Zhang, H.: 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Article Google Scholar
Ma, X., Wu, Z., Jia, J., Xu, M., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Interspeech, pp. 3683‒3687. Hyderabad (2018)
Kao, C.C., Wang, W., Sun, M., Wang, C.: R-CRNN: region-based convolutional recurrent neural network for audio event detection. In: Interspeech, pp. 1358‒1362. Hyderabad (2018)
Peng, H., Deng, C., An, L., Gao, X., Tao, D.: Learning to multimodal hash for robust video copy detection. In: IEEE International Conference on Image Processing, pp. 4482‒4486. Melbourne (2013)
Lee, F., Zhao, J., Kotani, K., Chen, Q.: Video copy detection using histogram based spatio-temporal features. In: International Congress on Image and Signal Processing, pp. 1‒5. Shanghai (2017)
Li, J., Zhang, H., Wan, W., Sun, J.: Two-class 3D-CNN classifiers combination for video copy detection. Multimed. Tools Appl. 79(7), 4749–4761 (2020)
Article Google Scholar
Zhang, X., Xie, Y., Luan, X., He, J., Zhang, L., Wu, L.: Video copy detection based on deep CNN features and graph-based sequence matching. Wireless Pers. Commun. 103(1), 401–416 (2018)
Article Google Scholar
Zhou, W., Liu, W., Lei, J., Luo, T., Yu, L.: Deep binocular fixation prediction using a hierarchical multimodal fusion network. IEEE Trans. Cogn. Dev. Syst. (2021). https://doi.org/10.1109/TCDS.2021.3051010
Article Google Scholar
Li, T., Nian, F., Wu, X., Gao, Q., Lu, Y.: Efficient video copy detection using multi-modality and dynamic path search. Multimed. Syst. 22(1), 29–39 (2016)
Article Google Scholar
Ortega, J.D.S., Senoussaoui, M., Granger, E., Pedersoli, M., Cardinal, P., Koerich, A.L.: Multimodal fusion with deep neural networks for audio–video emotion recognition. arXiv preprint arXiv:1907.03196 (2019)
Zhang, D.: Image indexing. In: Zhang, D. (ed.) Fundamentals of Image Data Mining, pp. 293–301. Springer, Cham (2021)
Chapter Google Scholar
Li, Z., Drew, M.S., Liu, J.: Content-based retrieval in digital libraries. In: Li, Z., Drew, M.S., Liu, J. (eds.) Fundamentals of Multimedia, pp. 763–809. Springer, Cham (2021)
Chapter MATH Google Scholar
Weiss, Y., Torralbaa, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21, pp. 1753‒1760. Vancouver (2008)
Xie, H., Mao, Z., Zhang, Y., Deng, H., Yan, C., Chen, Z.: Double-bit quantization and index hashing for nearest neighbor search. IEEE Trans. Multimedia 21(5), 1248–1260 (2018)
Article Google Scholar
Hansen, C., Simonsen, J.G., Alstrup, S.: Unsupervised multi-index semantic hashing. In: The Web Conference, pp. 2879‒2889. Ljubljana (2021)
Wang, Y., Zhang, J., Zhang, S., Zhuo, L.: Short video fingerprinting extraction based on R(2+1)D triplet Siamese networks. Meas. Control Technol. 41(4), 11–18 (2022)
Google Scholar
Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates from web video search. In: ACM International Conference on Multimedia, pp. 218‒227. Augsburg (2007)
Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S.A., Yan, T., Brown, L., Fan, Q., Gutfreund, D., Carl, V., Oliva, A.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770‒778. Las Vegas (2016)
Qian, R., Meng, T., Gong, B., Yang, M., Wang, H., Belongie, S.: Spatiotemporal contrastive video representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964‒6974. Nashville (2021)
Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: ACM International Conference on Multimedia, pp. 4165‒4173. Chengdu (2021)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11205−11214. Nashville (2021)
Coskun, H., Zareian, A., Moore, J.L., Tombari, F., Chen, W.: GOCA: guided online cluster assignment for self-supervised video representation Learning. arXiv preprint arXiv:2207.10158 (2022)

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 61971016 and 61531006 and in part by Beijing Municipal Education Commission Cooperation Beijing Natural Science Foundation under Grant KZ201910005007.

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, China
Shuying Zhang, Jing Zhang, Yizhou Wang & Li Zhuo
Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing, China
Jing Zhang & Li Zhuo

Authors

Shuying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhuo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors made significant contributions to the work. SZ, JZ, YW wrote the main manuscript text and prepared figures. SZ, JZ, YW, and LZ proposed the conception of this work and devised the algorithm. SZ and YW prepared formal analysis and did experiments. SZ and JZ checked experiments as well as revised this paper. JZ and LZ provide instrumentation and computing resources for this study. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jing Zhang.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, S., Zhang, J., Wang, Y. et al. Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing. Multimedia Systems 29, 981–1000 (2023). https://doi.org/10.1007/s00530-022-01031-4

Download citation

Received: 30 August 2022
Accepted: 23 November 2022
Published: 04 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00530-022-01031-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing

Abstract

Access this article

Similar content being viewed by others

Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature

Unsupervised Video Hashing via Deep Neural Network

Binary feature representation learning for scene retrieval in micro-video

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing

Abstract

Access this article

Similar content being viewed by others

Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature

Unsupervised Video Hashing via Deep Neural Network

Binary feature representation learning for scene retrieval in micro-video

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation