Skip to main content

Treating Speech as Personally Identifiable Information and Its Impact in Machine Translation

  • Chapter
  • First Online:
Towards Responsible Machine Translation

Part of the book series: Machine Translation: Technologies and Applications ((MATRA,volume 4))

Abstract

Speech is the most natural and immediate form of communication. It is ubiquitous. The tremendous progress in language technologies that we have witnessed in the past few years has led to the use of speech as input/output modality in a panoply of applications which have been mostly reserved for text until recently. Machine translation is one of the technologies which traditionally has dealt with text input and output. However, speech-to-speech translation is no longer a research-only topic, and one can only anticipate its growing use in our multilingual world. Many of these applications run on cloud-based platforms that provide remote access to powerful models, enabling the automation of time-consuming tasks such as document translation, or transcribing speech, and helping users to perform everyday tasks (e.g. voice-based virtual assistants). When a biometric signal such as speech is sent to a remote server for processing, however, this input signal can be used to determine information about the user, including his/her preferences, personality traits, mood, health, political opinions, among other data such as gender, age range, height, accent, etc. Moreover, information can also be extracted about the recording environment. Although there is a growing society awareness about user data protection (the GDPR in Europe is an example), most users of such remote servers are unaware of the amount of information that can be extracted from a handful of their sentences. In fact, most users are unaware of the potential for misuse allowed by this new generation of speech technology systems. For instance, most users do not know how many sentences in their own voice are necessary for cloning it, nor have they heard about spoofing speaker recognition systems, or understand to what extent their recordings can be anonymised. Moreover, most users do not realize that adversarial techniques now enable the effective injection of hidden commands in spoken messages without being audible. The recent progress in speech and language technologies is also reflected in speech-to-speech translation systems, where the traditional cascade of “speech recognition—machine translation—speech synthesis” is being replaced by end-to-end systems that allow the sentences in the target language to sound as the voice of the speaker in the source language, opening a world of possibilities. All these privacy and security issues are becoming more and more pressing in an era where speech must be legally regarded as PII (Personally Identifiable Information).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.compare.openaudio.eu/.

  2. 2.

    https://www.robots.ox.ac.uk/~vgg/data/voxceleb/.

  3. 3.

    https://kaldi-asr.org/.

  4. 4.

    https://www.finextra.com/newsarticle/37989/hsbcs-voice-id-prevents-249-million-of-attempted-fraud.

  5. 5.

    A vocoder (short for voice encoder) is a synthesis system which was initially developed to reproduce human speech.

  6. 6.

    https://www.synsig.org/index.php/Blizzard_Challenge.

  7. 7.

    https://www.bbc.com/news/technology-49263260.

  8. 8.

    https://www.theguardian.com/technology/2019/aug/02/apple-halts-practice-of-contractors-listening-in-to-users-on-siri.

  9. 9.

    https://techcrunch.com/2019/08/02/google-ordered-to-halt-human-review-of-voice-ai-recordings-over-privacy-risks/.

  10. 10.

    https://www.theguardian.com/technology/2018/may/24/amazon-alexa-recorded-conversation.

  11. 11.

    https://www.bbc.com/news/business-43044693.

  12. 12.

    https://techcrunch.com/2019/06/12/laliga-fined-280k-for-soccer-apps-privacy-violating-spy-mode.

  13. 13.

    https://techcrunch.com/2016/12/27/an-amazon-echo-may-be-the-key-to-solving-a-murder-case/.

  14. 14.

    https://sdlgbtn.com/news/2016/12/29/siri-and-alexa-could-become-witnesses-against-you-court-some-day.

  15. 15.

    https://fortune.com/2021/05/04/voice-cloning-fraud-ai-deepfakes-phone-scams/.

  16. 16.

    https://www.washingtonpost.com/technology/2019/09/04/an-artificial-intelligence-first-voice-mimicking-software-reportedly-used-major-theft/.

  17. 17.

    https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/.

  18. 18.

    https://www.newyorker.com/culture/annals-of-gastronomy/the-haunting-afterlife-of-anthony-bourdain.

  19. 19.

    https://blog.sdl.com/blog/The-Issue-of-Data-Security-and-Machine%20Translation.html.

  20. 20.

    https://edpb.europa.eu/our-work-tools/documents/public-consultations/2021/guidelines-022021-virtual-voice-assistants_en.

References

  • Abad A, Bell P, Carmantini A, Renais S (2020) Cross lingual transfer learning for zero-resource domain adaptation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6909–6913

    Google Scholar 

  • Baevski A, Hsu WN, Conneau A, Auli M (2021) Unsupervised speech recognition. ArXiv preprint, 2105.11084

    Google Scholar 

  • Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 792–799

    Chapter  Google Scholar 

  • Batliner A, Hantke S, Schuller BW (2020) Ethics and good practice in computational paralinguistics. IEEE Trans Affect Comput. Manuscript. Preliminary Version

    Google Scholar 

  • Ben-Or M, Goldwasser S, Wigderson A (1988) Completeness theorems for non-cryptographic fault-tolerant distributed computation. In: 20th Annual ACM Symposium on Theory of Computing, pp 1–10

    Google Scholar 

  • Bernardo L, Giquel M, Quintas S, Dimas P, Moniz H, Trancoso I (2019) Unbabel Talk - human verified translations for voice instant messaging. In: Interspeech, pp 3691–3692

    Google Scholar 

  • Black AW, Zen H, Tokuda K (2007) Statistical parametric speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 4, pp IV-1229–IV-1232

    Google Scholar 

  • Boufounos P, Rane S (2011) Secure binary embeddings for privacy preserving nearest neighbors. In: IEEE Workshop on Information Forensics and Security (WIFS), pp 1–6

    Google Scholar 

  • Brasser F, Frassetto T, Riedhammer K, Sadeghi A-R., Schneider T, Weinert C (2018) VoiceGuard: secure and private speech processing. In: Interspeech, pp 1303–1307

    Google Scholar 

  • Casanova E, Shulby C, Gölge E, Müller NM, de Oliveira FS, Candido Jr A, da Silva Soares A, Aluisio SM, Ponti MA (2021) SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. In: Interspeech, pp 3645–3649

    Google Scholar 

  • Cohen-Hadria A, Cartwright M, McFee B, Bello JP (2019) Voice anonymization in urban sound recordings. In: IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp 1–6

    Google Scholar 

  • Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, Quatieri TF (2015) A review of depression and suicide risk assessment using speech analysis. Speech Commun 71:10–49

    Article  Google Scholar 

  • Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  • Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv preprint, 1810.04805

    Google Scholar 

  • Dias M, Abad A, Trancoso I (2018) Exploring hashing and Cryptonet based approaches for privacy-preserving speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2057–2061

    Google Scholar 

  • Elgamal T (1985) A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans Inf Theory 31(4):469–472

    Article  MathSciNet  Google Scholar 

  • Fan J, Vercauteren F (2012) Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012:144. Informal publication

    Google Scholar 

  • Gangi MAD, Negri M, Turchi M (2019) Adapting transformer to end-to-end spoken language translation. In: Interspeech, pp 1133–1137

    Google Scholar 

  • Goldreich O (1999) Secure multi-party computation. Manuscript. Preliminary Version

    Google Scholar 

  • Gontier F, Lagrange M, Lavandier C, Petiot JF (2020) Privacy aware acoustic scene synthesis using deep spectral feature inversion. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 886–890

    Google Scholar 

  • Jia Y, Weiss RJ, Biadsy F, Macherey W, Johnson M, Chen Z, Wu Y (2019) Direct speech-to-speech translation with a sequence-to-sequence model. In: Interspeech, pp 1123–1127

    Google Scholar 

  • Jia Y, Ramanovich MT, Remez T, Pomerantz R (2021) TRANSLATOTRON 2: Robust direct speech-to-speech translation. ArXiv preprint, 2107.08661

    Google Scholar 

  • Jiménez A, Raj B, Portêlo J, Trancoso I (2015) Secure modular hashing. In: IEEE International Workshop on Information Forensics and Security (WIFS), pp 1–6

    Google Scholar 

  • Karita S, Wang X, Watanabe S, Yoshimura T, Zhang W, Chen N, Hayashi T, Hori T, Inaguma H, Jiang Z, Someki M, Enrique N, Soplin Y, Yamamoto R (2019) A comparative study on transformer vs rnn in speech applications. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 449–456

    Chapter  Google Scholar 

  • Kim J, Kim S, Kong J, Yoon S (2020) Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. ArXiv preprint, 2005.11129

    Google Scholar 

  • Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Interspeech, pp 3586–3589

    Google Scholar 

  • Laver J (1994) Principles of phonetics. Cambridge University Press

    Book  Google Scholar 

  • Leroy D, Coucke A, Lavril T, Gisselbrecht T, Dureau J (2019) Federated learning for keyword spotting. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6341–6345

    Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. ArXiv preprint, 1310.4546

    Google Scholar 

  • Mtibaa A, Petrovska-Delacretaz D, Hamida AB (2018) Cancelable speaker verification system based on binary Gaussian mixtures. In: 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pp 1–6

    Google Scholar 

  • Nautsch A, Isadskiy S, Kolberg J, Gomez-Barrero M, Busch C (2018) Homomorphic encryption for speaker recognition: protection of biometric templates and vendor model parameters. In: Speaker and Language Recognition Workshop (Odyssey), pp 16–23

    Chapter  Google Scholar 

  • Nautsch A, Jasserand C, Kindt E, Todisco M, Trancoso I, Evans N (2019a) The GDPR & speech data: reflections of legal and technology communities, first steps towards a common understanding. In: Interspeech, pp 3695–3699

    Google Scholar 

  • Nautsch A, Jiménez A, Treiber A, Kolberg J, Jasserand C, Kindt E, Delgado H, Todisco M, Hmani MA, Mtibaa A, et al (2019b) Preserving privacy in speaker and speech characterisation. Comput Speech Lang 58:441–480

    Article  Google Scholar 

  • Nautsch A, Wang X, Evans N, Kinnunen TH, Vestman V, Todisco M, Delgado H, Sahidullah M, Yamagishi J, Lee KA (2021) ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Trans Biometr Behav Identity Sci 3(2):252–265

    Article  Google Scholar 

  • Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Advances in cryptology, volume 1592 of Lecture Notes in Computer Science, pp 223–238

    Google Scholar 

  • Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5206–5210

    Google Scholar 

  • Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech, pp 2613–2617

    Google Scholar 

  • Pathak M, Portelo J, Raj B, Trancoso I (2012) Privacy-preserving speaker authentication. In: International Conference on Information Security. Springer, pp 1–22

    Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) GloVe: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543

    Google Scholar 

  • Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. ArXiv preprint, 1802.05365

    Google Scholar 

  • Pironkov G, Dupont S, Dutoit T (2016) Multi-task learning for speech recognition: an overview. In: ESANN – European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pp 189–194

    Google Scholar 

  • Portêlo J, Abad A, Raj B, Trancoso I (2013) Secure binary embeddings of front-end factor analysis for privacy preserving speaker cerification. In: Interspeech, pp 2494–2498

    Google Scholar 

  • Portêlo J, Raj B, Abad A, Trancoso I (2014) Privacy-preserving speaker verification using garbled GMMs. In: EUSIPCO, pp 2070–2074

    Google Scholar 

  • Portêlo J, Abad A, Raj B, Trancoso I (2015) Privacy-preserving query-by-example speech search. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1797–1801

    Google Scholar 

  • Qian Y, Soong FK, Yan ZJ (2013) A unified trajectory tiling approach to high quality speech rendering. IEEE Trans Audio Speech Lang Process 21(2):280–290

    Article  Google Scholar 

  • Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4779–4783

    Google Scholar 

  • Singh R (2019) Profiling humans from their voice. Springer

    Book  Google Scholar 

  • Sisman B, Yamagishi J, King S, Li H (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning. IEEE/ACM Trans Audio Speech Lang Process 29:132–157

    Article  Google Scholar 

  • Snyder D, Ghahremani P, Povey D, Garcia-Romero D, Carmiel Y, Khudanpur S (2016) Deep neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp 165–170

    Chapter  Google Scholar 

  • Teixeira F, Abad A, Trancoso I (2018) Patient privacy in paralinguistic tasks. In: Interspeech, pp 3428–3432

    Google Scholar 

  • Teixeira F, Abad A, Trancoso I (2019) Privacy-preserving paralinguistic tasks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6575–6579

    Google Scholar 

  • Tomashenko N, Srivastava BML, Wang X, Vincent E, Nautsch A, Yamagishi J, Evans N, Patino J, Bonastre JF, Noé PG, Todisco M (2020) Introducing the VoicePrivacy initiative. In: Interspeech, pp 1693–1697

    Google Scholar 

  • van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) WaveNet: A generative model for raw audio. CoRR, abs/1609.03499

    Google Scholar 

  • Vasquez J, Orozco JR, Noeth E (2017) Convolutional neural network to model articulation impairments in patients with Parkinson’s disease. In: Interspeech, pp 314–318

    Google Scholar 

  • Yao AC (1986) How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science (SFCS), pp 162–167

    Google Scholar 

  • Yi Z, Huang WC, Tian X, Yamagishi J, Das RK, Kinnunen T, Ling Z, Toda T (2020) Voice conversion challenge 2020—intralingual semi-parallel and cross-lingual voice conversion. In: Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, pp 909–910

    Google Scholar 

  • Zhang SX, Gong Y, Yu D (2019) Encrypted speech recognition using deep polynomial networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5691–5695

    Google Scholar 

Download references

Acknowledgements

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UIBD/50021/2020 and CMU/TIC/0069/2019, and by the P2020 project MAIA (contract 045909). We would like to thank several colleagues for many interesting discussions on this topic, namely Bhiksha Raj, Helena Moniz, Filipa Calvão, and Andreas Nautsch.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isabel Trancoso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Trancoso, I., Teixeira, F., Botelho, C., Abad, A. (2023). Treating Speech as Personally Identifiable Information and Its Impact in Machine Translation. In: Moniz, H., Parra Escartín, C. (eds) Towards Responsible Machine Translation. Machine Translation: Technologies and Applications, vol 4. Springer, Cham. https://doi.org/10.1007/978-3-031-14689-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14689-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14688-6

  • Online ISBN: 978-3-031-14689-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics