Abstract
In this paper, we propose a retrieval algorithm for encrypted speech based on the convolution neural network (CNN) and deep hashing. It is used to overcome the feature extraction defects of the existing content-based encrypted speech retrieval methods, and solve the problem of low retrieval accuracy caused by high dimensional and temporality of audio data. Firstly, the study encrypts the original speech by the three-dimensional chaotic encryption algorithm and uploads it to the encryption speech library in the cloud. Since CNN can well capture the basic semantic structure features of speech data, we use CNN as a feature extractor to extract deep features from Log-Mel spectrogram/MFCC. The batch normalization algorithm is introduced in the training process, which improves the speed of network fitting, reduces the training time, and improves the retrieval efficiency of the system. Secondly, the deep features extracted from CNN are combined with the hash function to construct the system hashing index table. Finally, the retrieval is implemented by the normalized Hamming distance algorithm. The experimental results show that the proposed algorithm has better discrimination, robustness to amplitude change compared with the existing methods. Meanwhile, the proposed algorithm has a high recall, precision, and retrieval efficiency after various content preserving operations.
Similar content being viewed by others
References
Alamodi AOA, Sun K, Ai W, Chen C, Peng D (2019) Design new chaotic maps based on dimension expansion. Chinese physics B 28(2): 020503. CNKI:SUN:ZGWL.0.2019-02-016
Cummins N, Amiriparian S, Hagerer G, Batliner A, Steidl S, Schuller BW (2017) An image-based deep spectrum feature representation for the recognition of emotional speech. In International Conference on Multimedia, 25th ACM international conference on. ACM, 2017: 478–484. https://doi.org/10.1145/3123266.3123371
De Santana LMQ, Santos RM, Matos LN, Macedo HT (2018) Deep neural networks for acoustic modeling in the presence of noise. IEEE Lat Am Trans 16(3):918–925. https://doi.org/10.1109/TLA.2018.8358674
Dhiraj BR, Ghattamaraju N (2018) An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimedia Tools and Applications 1–24. https://doi.org/10.1007/s11042-018-6706-x
Elizalde B, Zarar S, Raj B (2019) Cross modal audio search and retrieval with joint embeddings based on text and audio. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019-2019 IEEE International Conference on. IEEE 4095–4099. https://doi.org/10.1109/ICASSP.2019.8682632
Gupta BB, Yamaguchi S, Agrawal DP (2018) Advances in security and privacy of multimedia big data in mobile and cloud computing. Multimed Tools Appl 77(7):9203–9208. https://doi.org/10.1007/s11042-017-5301-x
He SF, Zhao H (2017) A retrieval algorithm of encrypted speech based on syllable-level perceptual hashing. Comput Sci Inf Syst 14(3):703–718. https://doi.org/10.2298/CSIS170112024H
Hertel L, Barth E, Käster T, Martinetz T (2015) Deep convolutional neural networks as generic feature extractors. In International Joint Conference on Neural Networks (IJCNN), 2015 International Joint Conference on. IEEE 1–4. https://doi.org/10.1109/IJCNN.2015.7280683
Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning. In International Joint Conference on Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE 3407–3411. https://doi.org/10.1109/IJCNN.2016.7727635
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, the 32nd International Conference on International Conference on Machine Learning. 37(448-456)
Juvela L, Bollepalli B, Wang X, Kameoka H, Airaksinen M, Yamagishi J, Alku P (2018) Speech waveform synthesis from MFCC sequences with generative adversarial networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE 5679-5683. https://doi.org/10.1109/ICASSP.2018.8461852
Keras: The Python Deep Learning library. https://github.com/keras-team/keras/tree/master/docs. Accessed 14 Oct 2019
Li Y, Xu Y, Miao Z, Li H, Wang J, Zhang Y (2016) Deep feature hash codes framework for content-based image retrieval. In 2016 8th international conference on Wireless Communications & Signal Processing (WCSP). IEEE 1–6. https://doi.org/10.1109/WCSP.2016.7752525
Lin K, Yang HF, Hsiao JH, Chen CH (2015) Deep learning of binary hash codes for fast image retrieval. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), the IEEE Conference on. IEEE 27–35. https://doi.org/10.1109/CVPRW.2015.7301269
Liu H, Wang R, Shan S, Chen X (2016) Deep supervised hashing for fast image retrieval. In conference on computer vision and pattern recognition, the IEEE conference on. IEEE 2064–2072. https://doi.org/10.1109/CVPR.2016.227
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference (SCIPY 2015). 8: 18-24. https://doi.org/10.25080/Majora-7b98e3ed-003
Nayyar RK, Nair S, Patil O, Pawar R, Lolage A (2017) Content-based auto-tagging of audios using deep learning. In International Conference on Big Data, IoT and Data Science, 2017 International Conference on. IEEE 30–36. https://doi.org/10.1109/BID.2017.8336569
Pons J, Serra X (2019) Randomly weighted CNNs for (music) audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019-2019 IEEE International Conference on. IEEE 336–340. https://doi.org/10.1109/ICASSP.2019.8682912
Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283. https://doi.org/10.1109/LSP.2017.2657381
Shen F, Shen C, Liu W, Tao SH (2015) Supervised discrete hashing. In proceedings of the IEEE conference on computer vision and pattern recognition. IEEE 37–45. https://doi.org/10.1109/CVPR.2015.7298598
Spring R, Shrivastava A (2017) Scalable and sustainable deep learning via randomized hashing. In International Conference on Knowledge Discovery and Data Mining, 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM 445–454. https://doi.org/10.1145/3097983.3098035
Sun C, Yang Y, Wen C, Xie K, Wen F (2018) Voiceprint identification for limited dataset using the deep migration hybrid model based on transfer learning. Sensors 18(7):2399. https://doi.org/10.3390/s18072399
Thangavel M, Varalakshmi P, Renganayaki S, Subhapriya GR, Preethi T, Banu AZ (2016) SMCSRC—secure multimedia content storage and retrieval in cloud. In 2016 international conference on recent trends in information technology (ICRTIT). IEEE 1–6. https://doi.org/10.1109/ICRTIT.2016.7569581
Valenti M, Squartini S, Diment A, Parascandolo G, Virtanen T (2017) A convolutional neural network approach for acoustic scene classification. In International Joint Conference on Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE 1547–1554. https://doi.org/10.1109/IJCNN.2017.7966035
Wang HX, Hao GY (2015) Encryption speech perceptual hashing algorithm and retrieval scheme based on time and frequency domain change characteristics. China patent, CN104835499A, 2015-08-12
Wang D, Zhang XW (2015) Thchs-30: a free Chinese speech corpus. arXiv preprint arXiv:1512.01882
Wang H, Zhou L, Zhang W, Liu S (2013) Watermarking-based perceptual hashing search over encrypted speech. In International Workshop on Digital Watermarking. Springer Berlin Heidelberg 423–434. https://doi.org/10.1007/978-3-662-43886-2_3
Wu Y, Lee T (2018) Reducing model complexity for DNN based large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE 331–335. https://doi.org/10.1109/ICASSP.2018.8462168
Wu JF, Qin HB, Hua YZ, Fan LY (2018) Pitch estimation and voicing classification using reconstructed spectrum from MFCC. IEICE Trans Inf Syst 101(2):556–559. https://doi.org/10.1587/transinf.2017EDL8162
Xu Y, Kong Q, Wang W, Plumbley MD (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 121–125. https://doi.org/10.1109/ICASSP.2018.8461975
Zhang Q, Zhou L, Zhang T, Zhang D (2019) A retrieval algorithm of encrypted speech based on short-term cross-correlation and perceptual hashing, Multimedia Tools and Applications 1–22. https://doi.org/10.1007/s11042-019-7180-9
Zhao H, He SF (2016) A retrieval algorithm for encrypted speech based on perceptual hashing. In 2016 12th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE 1840–1845. https://doi.org/10.1109/FSKD.2016.7603458
Zhao S, Zhang Y, Xu H, Han T (2019) Ensemble classification based on feature selection for environmental sound recognition. Mathematical Problems in Engineering 1–7. https://doi.org/10.1155/2019/4318463
Zheng W, Mo Z, Xing X, Zhao G (2018) CNNs-based acoustic scene classification using multi-spectrogram fusion and label expansions. arXiv preprint arXiv:1809.01543 1-7.
Zhu H, Long M, Wang J, Cao Y (2016) Deep hashing network for efficient similarity retrieval. In proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). AAAI 2415-2421.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No. 61862041, 61363078). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Qy., Li, Yz. & Hu, Yj. A retrieval algorithm for encrypted speech based on convolutional neural network and deep hashing. Multimed Tools Appl 80, 1201–1221 (2021). https://doi.org/10.1007/s11042-020-09748-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09748-y