Frame-synchronized blind speech watermarking via improved adaptive mean modulation and perceptual-based additive modulation in DWT domain

doi:10.1016/j.dsp.2019.01.006

Digital Signal Processing

Volume 87, April 2019, Pages 75-85

https://doi.org/10.1016/j.dsp.2019.01.006 Get rights and content

Highlights

•
A DWT-based algorithm is introduced to perform efficient blind speech watermarking.
•
Schemes for embedding synchronization codes and information bits are different.
•
Perceptual additive modulation renders imperceptible and reliable detection.
•
Improved adaptive mean modulation secures an accurate retrieval of watermark bits.
•
The proposed algorithm outperforms three others with capacity set at 200 bps.

Abstract

This paper presents a blind speech watermarking algorithm that adopts different strategies to embed synchronization codes and information bits into separate DWT subbands. Speech frames for data hiding are chosen based on intensity thresholding. Under the guidance of the source-filter theory, a bipolar synchronization code sequence is substituted for the noisy part of the inverse filtered excitation in the 2nd level detail subband. Such a design allows the increase of embedding strength without adversely disturbing the pitch harmonics of voiced speech. Experiment results show that a synchronization code of size 640 is sufficient to render reliable detection. The PESQ metrics indicate that the quality degradation due to the synchronization code embedding is almost negligible. As for the binary embedding in the 2nd level approximation subband, we improve the adaptive mean modulation scheme to secure the retrieval of watermark bits on a frame basis. Experiment results confirm that, with the payload capacity set at 200 bit per second (bps), the proposed scheme demonstrates better robustness than three other DWT-based methods in the presence of commonly encountered signal processing attacks. Furthermore, reducing the capacity down to 100 bps alters the low-frequency spectral distribution, which contributes noteworthy improvements in robustness and imperceptibility.

Introduction

With the rapid development of network and information technologies, nowadays people can reproduce and disseminate digital multimedia data throughout the world much easier than ever before. However, the illegal use of multimedia data is also rampant in the current era. The copyright infringement problem has gradually become a serious issue to be solved [22]. Digital watermarking technology is a promising technique, widely used for several purposes including content authentication, ownership verification, covert communication, broadcast monitoring and fingerprinting, etc. It has become a hot topic in the field of communication and information security in recent years [6], [12], [22], [32].

Watermarking technology generally takes into consideration of four factors: imperceptibility, security, robustness, and capacity [6], [12]. An ideal watermarking algorithm is expected to hide a sufficient amount of information into the host signal in an imperceptible manner but still capable of resisting against malicious attacks. There are several ways to classify watermarking techniques. One such a classification is “robust” versus “fragile”. Robust watermarks are strongly resistant to attacks, whereas fragile watermarks are supposed to crumble under any attempt of tampering. Furthermore, depending on the need of the original host signal in the extraction process, watermarking schemes can be categorized as blind, semi-blind, and non-blind. Blind watermarking is designed to recover an embedded watermark without the participation of the original signal, while the non-blind approach can only be carried out using the original source. Semi-blind watermarking involves situations where information other than the source itself is needed for watermark extraction.

Over the past two decades, numerous watermarking methods have been developed for multimedia data such as images, audios and videos. Nonetheless, far less attention was paid to the watermarking of speech signals. Speech is a specific type of audio signal; therefore, the techniques developed for audio watermarking are potentially applicable to speech watermarking. However, speech differs from typical audios in aspects including the temporal continuity, spectral intensity distribution, production modeling, and processing scenarios [24], [30], [33]. The techniques developed for audio watermarking may not be suitable for speech watermarking [32].

Among the watermarking methods dedicated to speech signals, the one developed by Hofbauer et al. [14] exploited the fact that human ears are insensitive to the phase of non-voiced speech. Their method was focused on replacing the excitation signal of an autoregressive representation in non-voiced segments. On the other hand, Coumou and Sharma [5] embedded data via pitch modification in voiced segments. To avoid insertion, deletion, and substitution errors in the estimates of embedded data, they also resorted to a concatenated coding scheme to safeguard synchronization and error recovery. In another watermarking scheme developed based on the codebook-excited linear prediction (CELP) codec, Chen and Liu [3] modified the position indices of selected excitation pulses.

The exploitation of the spectral envelope of the speech signal has also been attempted. Several methods emerged from the linear prediction (LP) analysis of speech. In [30], Faundez-Zanuy et al. hid the watermark information below the formant peaks of the LP spectrum. Chen and Zhu [2] achieved robust watermarking by inserting watermark bits inside codebook indices, while applying multistage vector quantization (MSVQ) to the derived LP coefficients. Yan and Guo [43] converted the LP coefficients to inverse sine (IS) parameters. Watermark embedding was achieved by manipulating the IS parameters using odd–even modulation [23].

The implementation of speech watermarking can certainly take reference from existing audio watermarking methods. In order to take advantage of signal characteristics and/or auditory properties [22], many researchers tended to conduct the watermarking process in transform domains such as discrete Fourier transform (DFT) [7], [27], [31], [36], discrete cosine transform (DCT) [16], [25], [26], [39], [44], discrete wavelet transform (DWT) [1], [4], [18], [38], [39], [40], [41], and cepstrum [15], [28], [29]. Among the transforms used to perform audio watermarking, DWT is currently the most popular due to its perfect reconstruction and good multi-resolution properties. The effectiveness of the DWT-based approach in audio watermarking suggests that the same technique may work for speech watermarking as well if speech characteristics can be adequately taken into account.

Lei et al. [24] proposed a sophisticated wavelet-based scheme especially for breath sounds. Their scheme required the use of lifting wavelet transform (LWT), discrete cosine transform (DCT), singular value decomposition (SVD) along with the use of particle swarm optimization (PSO) technique to optimize the quantization steps for dither modulation (DM). Nematollahi et al. [33] embedded watermark bits by quantizing the eigenvalue derived from the singular value decomposition of the approximation coefficients in the DWT domain. Hu et al. [19] took advantage of the multi-resolution analysis properties of DWT, which eventually led to the development of a synchronous package scheme. In their design, watermark bits and synchronization codes were embedded inside selected frames in low-frequency DWT subbands using a watermarking scheme referred to as adaptive mean modulation (AMM).

In this study, to enhance the performance in robustness and processing efficiency, we follow the DWT-based framework developed in [19] and further introduce two efficient schemes for embedding synchronization codes and binary information in tandem. The remainder of this paper is organized as follows. Section 2 discusses the watermarking framework, whereby information bits and synchronization codes are embedded in selected DWT subbands. The way of implementing frame synchronization is also addressed here. Section 3 describes how to apply a spectrally shaped filter to achieve perceptual-based additive modulation for synchronization code embedding. Section 4 presents an improved AMM scheme for executing robust watermarking in the 2nd level approximation subband. Section 5 gives experiment results regarding the quality evaluation of watermarked speech and the watermark robustness against commonly encountered attacks. Conclusions are drawn in Section 6.

Section snippets

Packaged watermark information

In [19] Hu et al. had demonstrated how to perform watermarking using the packaged frame synchronization. The required procedure is as follows. A speech signal is first partitioned into frames of size $l_{f}$ . The frames suitable for watermarking are then identified according to the intensity levels of the selected DWT subbands. The frame selection can be expressed as $Λ (k) = {\begin{matrix} 1 (“embeddable”), & if σ_{a} (k) \geq ψ_{a} & σ_{d} (k) \geq ψ_{d}; \\ 0 (“non-embeddable”), & otherwise, \end{matrix}$ with $σ_{a} (k) = \sqrt{\frac{1}{l_{f}} \sum_{i = 0}^{l_{f} - 1} {(c_{a (2)} (q_{k} + i))}^{2}}; ψ_{a} = 0.035 \max {σ_{a} (k)};$ $σ_{d} (k) =$

Synchronization code embedding via spectrally shaped additive modulation

Hu et al. [19] introduced a method referred to as adaptive mean modulation (AMM) to embed synchronization codes into the 2nd level detail subband. Although AMM performs well in the detection of synchronization codes, the determination of each bit requires gathering multiple coefficients in advance. The matched filter responsible for the synchronization code detection must be applied to every possible combination of the involved coefficients. This imposes a computational burden for seeking the

Binary embedding via improved adaptive mean modulation

After embedding the synchronization codes, we then shift the focus on how to hide binary bits into the 2nd level approximation coefficients. The AMM in [19] adopted a first-order recursive low-pass filter to obtain a local energy estimate later for pursuing an adaptive quantization step used in quantization index modulation (QIM). The performance of AMM thus depends on the accuracy of the energy estimate. Any excessive perturbation of the local energy may end up with failure in watermark

Performance evaluation

The test materials consisted of 192 sentences uttered by 24 speakers (16 males and 8 females) drawn from the core set of the TIMIT database [9]. Speech files were recorded at 16 kHz with 16-bit resolution. For the convenience of computer simulation, utterances with the same dialect region were cascaded to form a long file. Thus, a total of 8 speech files was tested in this study. The watermark bits for the test were a series of alternate 1s and 0s of sufficient length to cover the entire host

Conclusions

In this study, we have exploited the DWT properties to achieve efficient blind speech watermarking. After taking a two-level DWT of the host speech, the 2nd level approximation and detail coefficients with sufficient intensity were chosen for embedding information bits and synchronization codes, respectively. Two different schemes were developed to carry out the watermarking process. The synchronization code aiming at frame alignment was inserted at the leading position of each embeddable

Conflict of interest statement

The authors declared that they have no conflicts of interest to this work.

Acknowledgements

This research work was supported by the Ministry of Science and Technology, Taiwan, ROC under grant MOST 106-2221-E-197-025.

Hwai-Tsu Hu received his B.S. degree from National Cheng Kung University, Taiwan, in 1985, and both M.S. and Ph.D. degrees from the University of Florida, USA, in 1990 and 1993, respectively, all in Electrical Engineering. Since 1998, he has been a Professor in the Department of Electronic Engineering at National I-Lan University, Taiwan. His research interests include speech, audio and image signal processing.

References (44)

N. Chen et al.
Multipurpose speech watermarking based on multistage vector quantization of linear prediction coefficients
J. China Univ. Post Telecommun.
(2007)
S.-T. Chen et al.
Adaptive audio watermarking via the optimization point of view on the wavelet-based entropy
Digit. Signal Process.
(2013)
M.-Q. Fan et al.
Statistical characteristic-based robust audio watermarking for resolving playback speed modification
Digit. Signal Process.
(2011)
X. He et al.
An enhanced psychoacoustic model based on the discrete wavelet packet transform
J. Franklin Inst.
(2006)
H.-T. Hu et al.
A dual cepstrum-based watermarking scheme with self-synchronization
Signal Process.
(2012)
H.-T. Hu et al.
Robust, transparent and high-capacity audio watermarking in DCT domain
Signal Process.
(2015)
H.-T. Hu et al.
Variable-dimensional vector modulation for perceptual-based DWT blind audio watermarking with adjustable payload capacity
Digit. Signal Process.
(2014)
H.-T. Hu et al.
Effective blind speech watermarking via adaptive mean modulation and package synchronization in DWT domain
EURASIP J. Audio Speech Music Process.
(2017)
B. Lei et al.
Robust and secure watermarking scheme for breath sound
J. Syst. Softw.
(2013)
B. Lei et al.
A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition
Signal Process.
(2012)

B.Y. Lei et al.

Blind and robust audio watermarking scheme based on SVD–DCT

Signal Process.

(2011)

D. Megías et al.

Efficient self-synchronised blind audio watermarking system based on time domain and FFT amplitude modification

Signal Process.

(2010)

M.A. Nematollahi et al.

Blind digital speech watermarking based on eigen-value quantization in DWT

J. King Saud Univ, Comput. Inf. Sci.

(2015)

R. Tachibana et al.

An audio watermarking method using a two-dimensional pseudo-random array

Signal Process.

(2002)

X.-Y. Wang et al.

A robust digital audio watermarking based on statistics characteristics

Pattern Recognit.

(2009)

X. Wang et al.

A norm-space, adaptive, and blind audio watermarking algorithm by discrete wavelet transform

Signal Process.

(2013)

A. Al-Haj

An imperceptible and robust audio watermarking algorithm

EURASIP J. Audio Speech Music Process.

(2014)

O.T.C. Chen et al.

Content-dependent watermarking scheme in compressed speech with identifying manner and location of attacks

IEEE Trans. Audio Speech Lang. Process.

(2007)

D.J. Coumou et al.

Insertion, deletion codes with feature-based embedding: a new paradigm for watermark synchronization with applications to speech watermarking

IEEE Trans. Inf. Forensics Secur.

(2008)

N. Cvejic et al.

Digital Audio Watermarking Techniques and Technologies: Applications and Benchmarks

(2008)

G. Fant

Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations

(1970)

W. Fisher et al.

The DARPA speech recognition research database: specifications and status

Cited by (13)

Robust speech watermarking by a jointly trained embedder and detector using a DNN
2022, Digital Signal Processing: A Review Journal
Citation Excerpt :
The modified patchwork algorithm [34] improves performance with respect to watermark embedding into audio signals in terms of robustness and imperceptibility. Many transform domain methods perform the aforementioned techniques on coefficients of orthogonal transformations, such as discrete Fourier transform (DFT) [35], discrete wavelet transform (DWT) [26,31], or discrete cosine transform (DCT) [24,29,36]. However, a significant number of approaches simultaneously use two or more of these transformations [27,28,37] or could be equally applied with any of these transformations [34].
This paper utilizes deep neural networks for robust digital watermarking and authentication of speech signals. We present two adversarial neural networks, called embedder and detector. The embedder network tends to achieve imperceptible watermark embedding by minimizing the differences between the original and watermarked signals. The detector strives towards errorless watermark detection. We have proposed joint optimization of these two networks to achieve a trade-off between these requirements. Two models are trained, one with raw speech signals and another with STFT signal representations. Proposed techniques are tested on attacked audio recordings with measures such as SNR, PESQ, BER, and bps. The proposed system achieves high robustness to common watermarking attacks with BERs well below 1%. SNR values greater than 38 dB and PESQ values of 4.33 demonstrate that the difference in original and watermarked signals is negligible. In comparison with other approaches, our scheme shows good overall performance in terms of imperceptibility, robustness, and capacity.
Robust and reliable image copyright protection scheme using downsampling and block transform in integer wavelet domain
2020, Digital Signal Processing: A Review Journal
In this paper, a robust watermarking algorithm in integer wavelet domain using downsampling is proposed. The innovations of this paper can be summarized as follows. First, after downsampling both the host image and watermark, each sub-watermark is embedded into the corresponding sub-host image. Second, four sub-watermarks are encrypted by pixel scrambling and fractional random transform, and the right principal component of watermark is embedded in the host image to avoid false positive problem. Third, two-level singular value decomposition and block cosine transform are performed on host image and watermark respectively, and the dual embedding of watermark is realized. Fourth, guided dynamic particle swarm optimization is used to optimize the embedding factors. The simulation results show that the proposed watermarking algorithm satisfies the requirements of robust watermarking very well. It has large capacity and strong robustness to common attacks. Moreover, compared with the existing related algorithms, this algorithm has obvious advantages.
DNN-based speech watermarking resistant to desynchronization attacks
2023, International Journal of Wavelets, Multiresolution and Information Processing
A novel audio watermarking algorithm robust against recapturing attacks
2023, Multimedia Tools and Applications
De-END: Decoder-Driven Watermarking Network
2023, IEEE Transactions on Multimedia
Encoded Feature Enhancement in Watermarking Network for Distortion in Real Scenes
2023, IEEE Transactions on Multimedia

View all citing articles on Scopus

Tung-Tsun Lee received his B.S. and M.S. degree from National Chiao Tung University, Taiwan, in Computer Science and Engineering, in 1983 and 1985. Since 1992, he has been a lecturer in the Department of Electronic Engineering at National I-Lan University, Taiwan. His research interests include software engineering and computer network.

View full text

Frame-synchronized blind speech watermarking via improved adaptive mean modulation and perceptual-based additive modulation in DWT domain

Highlights

Abstract

Introduction

Section snippets

Packaged watermark information

Synchronization code embedding via spectrally shaped additive modulation

Binary embedding via improved adaptive mean modulation

Performance evaluation

Conclusions

Conflict of interest statement

Acknowledgements

J. China Univ. Post Telecommun.

Digit. Signal Process.

Digit. Signal Process.

J. Franklin Inst.

Signal Process.

Signal Process.

Digit. Signal Process.

EURASIP J. Audio Speech Music Process.

J. Syst. Softw.

Signal Process.

Signal Process.

Signal Process.

J. King Saud Univ, Comput. Inf. Sci.

Signal Process.

Pattern Recognit.

Signal Process.

An imperceptible and robust audio watermarking algorithm

EURASIP J. Audio Speech Music Process.

Content-dependent watermarking scheme in compressed speech with identifying manner and location of attacks

IEEE Trans. Audio Speech Lang. Process.

Insertion, deletion codes with feature-based embedding: a new paradigm for watermark synchronization with applications to speech watermarking

IEEE Trans. Inf. Forensics Secur.

Digital Audio Watermarking Techniques and Technologies: Applications and Benchmarks

Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations

The DARPA speech recognition research database: specifications and status