Abstract
As one of the most prevalent we-media, short video has exponentially grown and gradually fallen into the disaster area of infringement. Video fingerprint extraction technology is conducive to the intelligent identification of short video. In view of various tampering attacks, a short video fingerprint extraction method from audio–visual fingerprint fusion to multi-index hashing is proposed, including: (1) the shot-level fingerprint of short video is extracted by audio–visual fingerprint fusion after analyzing the consistency to eliminate the uncertainty at the decision-making layer, in which the visual fingerprint is generated by R(2 + 1)D network, and the audio fingerprint is combined by extracting audio features with masked audio spectral keypoints (MASK) and convolutional recurrent neural network (CRNN); (2) the shot-level fingerprints are assembled into the data-level fingerprint of short video by constructing the data-shot-key frame relationship model of data structure; (3) the short video fingerprint is matched by measuring the weighted Hamming distance by creating the multi-index hashing of the data-level fingerprint. Five experiments are conducted on the CC_Web_Video dataset and the Moments_in_Time_Raw_v2 dataset, and the results show that our method can effectively raise the overall performance of short video fingerprint.
Similar content being viewed by others
Data availability statement
Data is openly available in a public repository that issues datasets. The datasets generated during and/or analyzed during the current study are available in the CC_Web_Video repository at http://vireo.cs.cityu.edu.hk/webvideo/ and the Moments_in_Time_Raw_v2 repository at http://moments.csail.mit.edu/.
References
The 49th Statistical Report on Internet Development in China. http://www.cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/202202/P020220407403488048001.pdf
Nie, X., Yin, Y., Sun, J., Li, J., Cui, C.: Comprehensive feature-based robust video fingerprinting using tensor model. IEEE Trans. Multimed. 19(4), 785–796 (2017)
Wary, A., Neelima, A.: Ring decomposition based video copy detection using global ordinal measure features and local features. Multimed. Tools Appl. 79(11), 8287–8323 (2020)
Liu, M., Po, L., Ur Rehman, Y.A., Xu, X., Li, Y., Feng, L.: Video copy detection by conducting fast searching of inverted files. Multimed. Tools Appl. 78(8), 10601–10624 (2019)
Gu, J., Zhao, R., Jiang, Y.: A survey of video copy detection methods. J. Comput. Res. Dev. 54(6), 1238–1250 (2017)
Wray, M., Doughty, H., Damen, D.: On semantic similarity in video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3649−3659. Nashville (2021)
Wu, Y., Liu, X., Qin, H., Xia, K., Hu, S., Ma, Y., Wang, M.: Boosting temporal binary coding for large-scale video search. IEEE Trans. Multimed. 23, 353–364 (2020)
Anuranji, R., Srimathi, H.J.: A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications. Digital Signal Process. 102, 102729 (2020)
Tran, D., Wang, H., Torresani, L., Ray, J., Lecun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450‒6459. Salt Lake City (2018)
Anguera, X., Garzon, A., Adamek, T.: MASK: robust local features for audio fingerprinting. In: IEEE International Conference on Multimedia and Expo, pp. 455−460. Kobe (2012)
Fu, X., Ch'ng, E., Aickelin, U., Simon, S.: CRNN: a joint neural network for redundancy detection. In: IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1−8. Hong Kong (2017)
Wang, L., Bao, Y., Li, H., Xin, F., Luo, Z.: Compact CNN based video representation for efficient video copy detection. In: International Conference on Multimedia Modelingpp, pp. 576‒587. Reykjavik (2017)
Li, Y., Chen, X.: Robust and compact video descriptor learned by deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2162‒2166. Los Angeles (2017)
Li, S., Chen, Z., Li, X., Lu, J., Zhou, J.: Unsupervised variational video hashing with 1D-CNN-LSTM networks. IEEE Trans. Multimed. 22(6), 1542–1554 (2020)
Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L., Huang, T., Kot, A.C., Gao, W.: Compact deep invariant descriptors for video retrieval. In: Data Compression Conference, pp. 420‒429. Snowbird (2017)
Liong, V., Lu, J., Tan, Y., Zhou, J.: Deep video hashing. IEEE Trans. Multimed. 19(6), 1209–1219 (2016)
Wang, M., Liu, X., Sun, K., Wang, Z.: Optimal video subsets and video spatiotemporal retrieval. Chin. J. Comput. 42(9), 2004–2023 (2019)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768‒4777. Hawaii (2017)
Fei, K., Wang, C., Zhang, J., Liu, Y., Xie, X., Tu, Z.: Flow-pose Net: an effective two-stream network for fall detection. Vis. Comput. (2021). https://doi.org/10.1007/s00371-022-02416-2
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 677‒691. Hawaii (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489‒4497. Santiago (2015)
Tran, D., Ray, J., Zheng, S., Chang, S., Paluri, M.: ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038v1 (2017)
Ali, A., Taylor, G.W.: Real-time end-to-end action detection with two-stream networks. In: Conference on Computer and Robot Vision, pp. 31‒38. Toronto (2018)
Hu, Y., Lu, X.: Learning spatial–temporal features for video copy detection by the combination of CNN and RNN. J. Vis. Commun. Image Represent. 55(8), 21–29 (2018)
Long, C., Basharat, A., Hoogs, A.: Video frame deletion and duplication. In: Sencar, H.T., Verdoliva, L., Memon, N. (eds.) Multimedia Forensics, pp. 333–362. Springer, Singapore (2022)
Hou, R., Chen, C., Sukthankar, R., Shah, M.: An efficient 3D CNN for action/object segmentation in video. arXiv preprint arXiv:1907.08895 (2019)
Wang, Y., Nie, X., Shi, Y., Zhou, X., Yin, Y.: Attention-based video hashing for large-scale video retrieval. IEEE Trans. Cogn. Dev. Syst. 13(3), 491–502 (2021)
Zhi, H., Yu, H., Li, S., Gao, C., Wang, Y.: A video classification method based on deep metric learning. J. Electron. Inf. Technol. 40(11), 2562–2569 (2018)
Nguyen, T.P., Pham, C.C., Ha, S.V.U., Jeon, J.W.: Change detection by training a triplet network for motion feature extraction. IEEE Trans. Circuits Syst. Video Technol. 29(2), 433–446 (2018)
Bhople, A.R., Prakash, S.: Learning similarity and dissimilarity in 3D faces with triplet network. Multimed. Tools Appl. 80(28), 35973–35991 (2021)
Wary, A., Neelima, A.: A review on robust video copy detection. Int. J. Multimed. Inf. Retrieval 8(1), 61–78 (2019)
Pan, X., Yu, X., Deng, J., Yang, W., Wang, H.: Audio fingerprinting based on local energy centroid. In: IET International Communication Conference on Wireless Mobile and Computing, pp. 351‒354. Shanghai (2011)
Jiang, T., Wu, R., Li, J., Xiang, K., Dai, F.: A real-time peak discovering method for audio fingerprinting. In: International Conference on Internet Multimedia Computing and Service, pp. 368‒371. Huangshan (2013)
Wang, A.: An industrial-strength audio search algorithm. In: International Conference on Music Information Retrieval, vol. 2, No. 2, pp. 7‒13 (2000)
Jiang, Y., Wu, C., Deng, K., Wu, Y.: An audio fingerprinting extraction algorithm based on lifting wavelet packet and improved optimal-basis selection. Multimed. Tools Appl. 78(21), 30011–30025 (2019)
Chowdhury, A., Ross, A.: Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans. Inf. Forensics Secur. 15, 1616–1629 (2019)
Gao, Z., Song, Y., Mcloughlin, I., Li, P., Jiang, Y., Dai, L.: Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system. In: Interspeech, pp. 361‒365. Graz (2019)
Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H., Virtanen, T.: Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)
Chen, M., He, X., Yang, J., Zhang, H.: 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Ma, X., Wu, Z., Jia, J., Xu, M., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Interspeech, pp. 3683‒3687. Hyderabad (2018)
Kao, C.C., Wang, W., Sun, M., Wang, C.: R-CRNN: region-based convolutional recurrent neural network for audio event detection. In: Interspeech, pp. 1358‒1362. Hyderabad (2018)
Peng, H., Deng, C., An, L., Gao, X., Tao, D.: Learning to multimodal hash for robust video copy detection. In: IEEE International Conference on Image Processing, pp. 4482‒4486. Melbourne (2013)
Lee, F., Zhao, J., Kotani, K., Chen, Q.: Video copy detection using histogram based spatio-temporal features. In: International Congress on Image and Signal Processing, pp. 1‒5. Shanghai (2017)
Li, J., Zhang, H., Wan, W., Sun, J.: Two-class 3D-CNN classifiers combination for video copy detection. Multimed. Tools Appl. 79(7), 4749–4761 (2020)
Zhang, X., Xie, Y., Luan, X., He, J., Zhang, L., Wu, L.: Video copy detection based on deep CNN features and graph-based sequence matching. Wireless Pers. Commun. 103(1), 401–416 (2018)
Zhou, W., Liu, W., Lei, J., Luo, T., Yu, L.: Deep binocular fixation prediction using a hierarchical multimodal fusion network. IEEE Trans. Cogn. Dev. Syst. (2021). https://doi.org/10.1109/TCDS.2021.3051010
Li, T., Nian, F., Wu, X., Gao, Q., Lu, Y.: Efficient video copy detection using multi-modality and dynamic path search. Multimed. Syst. 22(1), 29–39 (2016)
Ortega, J.D.S., Senoussaoui, M., Granger, E., Pedersoli, M., Cardinal, P., Koerich, A.L.: Multimodal fusion with deep neural networks for audio–video emotion recognition. arXiv preprint arXiv:1907.03196 (2019)
Zhang, D.: Image indexing. In: Zhang, D. (ed.) Fundamentals of Image Data Mining, pp. 293–301. Springer, Cham (2021)
Li, Z., Drew, M.S., Liu, J.: Content-based retrieval in digital libraries. In: Li, Z., Drew, M.S., Liu, J. (eds.) Fundamentals of Multimedia, pp. 763–809. Springer, Cham (2021)
Weiss, Y., Torralbaa, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21, pp. 1753‒1760. Vancouver (2008)
Xie, H., Mao, Z., Zhang, Y., Deng, H., Yan, C., Chen, Z.: Double-bit quantization and index hashing for nearest neighbor search. IEEE Trans. Multimedia 21(5), 1248–1260 (2018)
Hansen, C., Simonsen, J.G., Alstrup, S.: Unsupervised multi-index semantic hashing. In: The Web Conference, pp. 2879‒2889. Ljubljana (2021)
Wang, Y., Zhang, J., Zhang, S., Zhuo, L.: Short video fingerprinting extraction based on R(2+1)D triplet Siamese networks. Meas. Control Technol. 41(4), 11–18 (2022)
Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates from web video search. In: ACM International Conference on Multimedia, pp. 218‒227. Augsburg (2007)
Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S.A., Yan, T., Brown, L., Fan, Q., Gutfreund, D., Carl, V., Oliva, A.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770‒778. Las Vegas (2016)
Qian, R., Meng, T., Gong, B., Yang, M., Wang, H., Belongie, S.: Spatiotemporal contrastive video representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964‒6974. Nashville (2021)
Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: ACM International Conference on Multimedia, pp. 4165‒4173. Chengdu (2021)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11205−11214. Nashville (2021)
Coskun, H., Zareian, A., Moore, J.L., Tombari, F., Chen, W.: GOCA: guided online cluster assignment for self-supervised video representation Learning. arXiv preprint arXiv:2207.10158 (2022)
Funding
This work was supported in part by the National Natural Science Foundation of China under Grants 61971016 and 61531006 and in part by Beijing Municipal Education Commission Cooperation Beijing Natural Science Foundation under Grant KZ201910005007.
Author information
Authors and Affiliations
Contributions
All the authors made significant contributions to the work. SZ, JZ, YW wrote the main manuscript text and prepared figures. SZ, JZ, YW, and LZ proposed the conception of this work and devised the algorithm. SZ and YW prepared formal analysis and did experiments. SZ and JZ checked experiments as well as revised this paper. JZ and LZ provide instrumentation and computing resources for this study. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, S., Zhang, J., Wang, Y. et al. Short video fingerprint extraction: from audio–visual fingerprint fusion to multi-index hashing. Multimedia Systems 29, 981–1000 (2023). https://doi.org/10.1007/s00530-022-01031-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-01031-4