Elsevier

Digital Signal Processing

Volume 87, April 2019, Pages 75-85
Digital Signal Processing

Frame-synchronized blind speech watermarking via improved adaptive mean modulation and perceptual-based additive modulation in DWT domain

https://doi.org/10.1016/j.dsp.2019.01.006Get rights and content

Highlights

  • A DWT-based algorithm is introduced to perform efficient blind speech watermarking.

  • Schemes for embedding synchronization codes and information bits are different.

  • Perceptual additive modulation renders imperceptible and reliable detection.

  • Improved adaptive mean modulation secures an accurate retrieval of watermark bits.

  • The proposed algorithm outperforms three others with capacity set at 200 bps.

Abstract

This paper presents a blind speech watermarking algorithm that adopts different strategies to embed synchronization codes and information bits into separate DWT subbands. Speech frames for data hiding are chosen based on intensity thresholding. Under the guidance of the source-filter theory, a bipolar synchronization code sequence is substituted for the noisy part of the inverse filtered excitation in the 2nd level detail subband. Such a design allows the increase of embedding strength without adversely disturbing the pitch harmonics of voiced speech. Experiment results show that a synchronization code of size 640 is sufficient to render reliable detection. The PESQ metrics indicate that the quality degradation due to the synchronization code embedding is almost negligible. As for the binary embedding in the 2nd level approximation subband, we improve the adaptive mean modulation scheme to secure the retrieval of watermark bits on a frame basis. Experiment results confirm that, with the payload capacity set at 200 bit per second (bps), the proposed scheme demonstrates better robustness than three other DWT-based methods in the presence of commonly encountered signal processing attacks. Furthermore, reducing the capacity down to 100 bps alters the low-frequency spectral distribution, which contributes noteworthy improvements in robustness and imperceptibility.

Introduction

With the rapid development of network and information technologies, nowadays people can reproduce and disseminate digital multimedia data throughout the world much easier than ever before. However, the illegal use of multimedia data is also rampant in the current era. The copyright infringement problem has gradually become a serious issue to be solved [22]. Digital watermarking technology is a promising technique, widely used for several purposes including content authentication, ownership verification, covert communication, broadcast monitoring and fingerprinting, etc. It has become a hot topic in the field of communication and information security in recent years [6], [12], [22], [32].

Watermarking technology generally takes into consideration of four factors: imperceptibility, security, robustness, and capacity [6], [12]. An ideal watermarking algorithm is expected to hide a sufficient amount of information into the host signal in an imperceptible manner but still capable of resisting against malicious attacks. There are several ways to classify watermarking techniques. One such a classification is “robust” versus “fragile”. Robust watermarks are strongly resistant to attacks, whereas fragile watermarks are supposed to crumble under any attempt of tampering. Furthermore, depending on the need of the original host signal in the extraction process, watermarking schemes can be categorized as blind, semi-blind, and non-blind. Blind watermarking is designed to recover an embedded watermark without the participation of the original signal, while the non-blind approach can only be carried out using the original source. Semi-blind watermarking involves situations where information other than the source itself is needed for watermark extraction.

Over the past two decades, numerous watermarking methods have been developed for multimedia data such as images, audios and videos. Nonetheless, far less attention was paid to the watermarking of speech signals. Speech is a specific type of audio signal; therefore, the techniques developed for audio watermarking are potentially applicable to speech watermarking. However, speech differs from typical audios in aspects including the temporal continuity, spectral intensity distribution, production modeling, and processing scenarios [24], [30], [33]. The techniques developed for audio watermarking may not be suitable for speech watermarking [32].

Among the watermarking methods dedicated to speech signals, the one developed by Hofbauer et al. [14] exploited the fact that human ears are insensitive to the phase of non-voiced speech. Their method was focused on replacing the excitation signal of an autoregressive representation in non-voiced segments. On the other hand, Coumou and Sharma [5] embedded data via pitch modification in voiced segments. To avoid insertion, deletion, and substitution errors in the estimates of embedded data, they also resorted to a concatenated coding scheme to safeguard synchronization and error recovery. In another watermarking scheme developed based on the codebook-excited linear prediction (CELP) codec, Chen and Liu [3] modified the position indices of selected excitation pulses.

The exploitation of the spectral envelope of the speech signal has also been attempted. Several methods emerged from the linear prediction (LP) analysis of speech. In [30], Faundez-Zanuy et al. hid the watermark information below the formant peaks of the LP spectrum. Chen and Zhu [2] achieved robust watermarking by inserting watermark bits inside codebook indices, while applying multistage vector quantization (MSVQ) to the derived LP coefficients. Yan and Guo [43] converted the LP coefficients to inverse sine (IS) parameters. Watermark embedding was achieved by manipulating the IS parameters using odd–even modulation [23].

The implementation of speech watermarking can certainly take reference from existing audio watermarking methods. In order to take advantage of signal characteristics and/or auditory properties [22], many researchers tended to conduct the watermarking process in transform domains such as discrete Fourier transform (DFT) [7], [27], [31], [36], discrete cosine transform (DCT) [16], [25], [26], [39], [44], discrete wavelet transform (DWT) [1], [4], [18], [38], [39], [40], [41], and cepstrum [15], [28], [29]. Among the transforms used to perform audio watermarking, DWT is currently the most popular due to its perfect reconstruction and good multi-resolution properties. The effectiveness of the DWT-based approach in audio watermarking suggests that the same technique may work for speech watermarking as well if speech characteristics can be adequately taken into account.

Lei et al. [24] proposed a sophisticated wavelet-based scheme especially for breath sounds. Their scheme required the use of lifting wavelet transform (LWT), discrete cosine transform (DCT), singular value decomposition (SVD) along with the use of particle swarm optimization (PSO) technique to optimize the quantization steps for dither modulation (DM). Nematollahi et al. [33] embedded watermark bits by quantizing the eigenvalue derived from the singular value decomposition of the approximation coefficients in the DWT domain. Hu et al. [19] took advantage of the multi-resolution analysis properties of DWT, which eventually led to the development of a synchronous package scheme. In their design, watermark bits and synchronization codes were embedded inside selected frames in low-frequency DWT subbands using a watermarking scheme referred to as adaptive mean modulation (AMM).

In this study, to enhance the performance in robustness and processing efficiency, we follow the DWT-based framework developed in [19] and further introduce two efficient schemes for embedding synchronization codes and binary information in tandem. The remainder of this paper is organized as follows. Section 2 discusses the watermarking framework, whereby information bits and synchronization codes are embedded in selected DWT subbands. The way of implementing frame synchronization is also addressed here. Section 3 describes how to apply a spectrally shaped filter to achieve perceptual-based additive modulation for synchronization code embedding. Section 4 presents an improved AMM scheme for executing robust watermarking in the 2nd level approximation subband. Section 5 gives experiment results regarding the quality evaluation of watermarked speech and the watermark robustness against commonly encountered attacks. Conclusions are drawn in Section 6.

Section snippets

Packaged watermark information

In [19] Hu et al. had demonstrated how to perform watermarking using the packaged frame synchronization. The required procedure is as follows. A speech signal is first partitioned into frames of size lf. The frames suitable for watermarking are then identified according to the intensity levels of the selected DWT subbands. The frame selection can be expressed asΛ(k)={1(“embeddable”),if σa(k)ψa&σd(k)ψd;0(“non-embeddable”),otherwise, withσa(k)=1lfi=0lf1(ca(2)(qk+i))2;ψa=0.035max{σa(k)};σd(k)=

Synchronization code embedding via spectrally shaped additive modulation

Hu et al. [19] introduced a method referred to as adaptive mean modulation (AMM) to embed synchronization codes into the 2nd level detail subband. Although AMM performs well in the detection of synchronization codes, the determination of each bit requires gathering multiple coefficients in advance. The matched filter responsible for the synchronization code detection must be applied to every possible combination of the involved coefficients. This imposes a computational burden for seeking the

Binary embedding via improved adaptive mean modulation

After embedding the synchronization codes, we then shift the focus on how to hide binary bits into the 2nd level approximation coefficients. The AMM in [19] adopted a first-order recursive low-pass filter to obtain a local energy estimate later for pursuing an adaptive quantization step used in quantization index modulation (QIM). The performance of AMM thus depends on the accuracy of the energy estimate. Any excessive perturbation of the local energy may end up with failure in watermark

Performance evaluation

The test materials consisted of 192 sentences uttered by 24 speakers (16 males and 8 females) drawn from the core set of the TIMIT database [9]. Speech files were recorded at 16 kHz with 16-bit resolution. For the convenience of computer simulation, utterances with the same dialect region were cascaded to form a long file. Thus, a total of 8 speech files was tested in this study. The watermark bits for the test were a series of alternate 1s and 0s of sufficient length to cover the entire host

Conclusions

In this study, we have exploited the DWT properties to achieve efficient blind speech watermarking. After taking a two-level DWT of the host speech, the 2nd level approximation and detail coefficients with sufficient intensity were chosen for embedding information bits and synchronization codes, respectively. Two different schemes were developed to carry out the watermarking process. The synchronization code aiming at frame alignment was inserted at the leading position of each embeddable

Conflict of interest statement

The authors declared that they have no conflicts of interest to this work.

Acknowledgements

This research work was supported by the Ministry of Science and Technology, Taiwan, ROC under grant MOST 106-2221-E-197-025.

Hwai-Tsu Hu received his B.S. degree from National Cheng Kung University, Taiwan, in 1985, and both M.S. and Ph.D. degrees from the University of Florida, USA, in 1990 and 1993, respectively, all in Electrical Engineering. Since 1998, he has been a Professor in the Department of Electronic Engineering at National I-Lan University, Taiwan. His research interests include speech, audio and image signal processing.

References (44)

  • B.Y. Lei et al.

    Blind and robust audio watermarking scheme based on SVD–DCT

    Signal Process.

    (2011)
  • D. Megías et al.

    Efficient self-synchronised blind audio watermarking system based on time domain and FFT amplitude modification

    Signal Process.

    (2010)
  • M.A. Nematollahi et al.

    Blind digital speech watermarking based on eigen-value quantization in DWT

    J. King Saud Univ, Comput. Inf. Sci.

    (2015)
  • R. Tachibana et al.

    An audio watermarking method using a two-dimensional pseudo-random array

    Signal Process.

    (2002)
  • X.-Y. Wang et al.

    A robust digital audio watermarking based on statistics characteristics

    Pattern Recognit.

    (2009)
  • X. Wang et al.

    A norm-space, adaptive, and blind audio watermarking algorithm by discrete wavelet transform

    Signal Process.

    (2013)
  • A. Al-Haj

    An imperceptible and robust audio watermarking algorithm

    EURASIP J. Audio Speech Music Process.

    (2014)
  • O.T.C. Chen et al.

    Content-dependent watermarking scheme in compressed speech with identifying manner and location of attacks

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • D.J. Coumou et al.

    Insertion, deletion codes with feature-based embedding: a new paradigm for watermark synchronization with applications to speech watermarking

    IEEE Trans. Inf. Forensics Secur.

    (2008)
  • N. Cvejic et al.

    Digital Audio Watermarking Techniques and Technologies: Applications and Benchmarks

    (2008)
  • G. Fant

    Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations

    (1970)
  • W. Fisher et al.

    The DARPA speech recognition research database: specifications and status

  • Cited by (13)

    • Robust speech watermarking by a jointly trained embedder and detector using a DNN

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      The modified patchwork algorithm [34] improves performance with respect to watermark embedding into audio signals in terms of robustness and imperceptibility. Many transform domain methods perform the aforementioned techniques on coefficients of orthogonal transformations, such as discrete Fourier transform (DFT) [35], discrete wavelet transform (DWT) [26,31], or discrete cosine transform (DCT) [24,29,36]. However, a significant number of approaches simultaneously use two or more of these transformations [27,28,37] or could be equally applied with any of these transformations [34].

    • DNN-based speech watermarking resistant to desynchronization attacks

      2023, International Journal of Wavelets, Multiresolution and Information Processing
    • De-END: Decoder-Driven Watermarking Network

      2023, IEEE Transactions on Multimedia
    View all citing articles on Scopus

    Hwai-Tsu Hu received his B.S. degree from National Cheng Kung University, Taiwan, in 1985, and both M.S. and Ph.D. degrees from the University of Florida, USA, in 1990 and 1993, respectively, all in Electrical Engineering. Since 1998, he has been a Professor in the Department of Electronic Engineering at National I-Lan University, Taiwan. His research interests include speech, audio and image signal processing.

    Tung-Tsun Lee received his B.S. and M.S. degree from National Chiao Tung University, Taiwan, in Computer Science and Engineering, in 1983 and 1985. Since 1992, he has been a lecturer in the Department of Electronic Engineering at National I-Lan University, Taiwan. His research interests include software engineering and computer network.

    View full text