Abstract
In this work, a scheme based on a compressive sampling technique and a fast dictionary learning approach for reconstructing audio content in multimedia streaming is introduced. Audio streaming data are encapsulated in different packets by means of an interleaving technique. The compressive sampling technique is used to reconstruct audio information in case of lost packets, with a sparsifying basis provided by a greedy adaptive dictionary learning algorithm. In order to assess the performance of the methodology, several experiments on speech and musical audio signals are presented.
Similar content being viewed by others
Notes
The software is available on request. The source and reconstructed recordings of the experiments are available at the public websites http://goo.gl/wVZBMzand http://goo.gl/o4UPIh.
See Appendix A and the websites http://goo.gl/wVZBMz and http://goo.gl/o4UPIh for further details on the content of the audio signals.
The recordings of the results can be downloaded and listened from the websites http://goo.gl/wVZBMz and http://goo.gl/o4UPIh.
References
Adler A, Emiya V, Jafari MG, Elad M, Gribonval R, Plumbley MD (2012) Audio inpainting. IEEE Trans Audio Speech Lang Process 20(3):922–932
Bahat Y, Schechner YY, Elad M (2015) Self-content-audio inpainting. Signal Process 111:61–72
Banu JF, Ramachandran V (2013) Study of QoS management techniques for VoiceApplications. Int J Comput Sci Electron Eng (IJCSEE) 1(1)
Candès EJ, Wakin MB (2008) An Introduction To Compressive Sampling. IEEE Signal Process Mag 25(2):21–30
Donoho DL (2006) Compressed sensing. IEEE Transaction on Information Theory 52(4):1289–1306
Duric A, Andersen S (2004) Real-time Transport Protocol (RTP) Payload Format for internet Low Bit Rate Codec (iLBC) Speech. The Internet Society
Feamster N, Balakrishnan H (2002) Packet Loss Recovery for Streaming Video. In: 12th International Packet Video Workshop
Fornasier M, Rauhut H (2008) Iterative thresholding algorithms. Appl Comput Harmon Anal 25(2):187–208
Griffin A, Hirvonen T, Tzagkarakis C, Mouchtaris A, Tsakalides P (2011) Single-Channel and Multi-Channel Sinusoidal Audio Coding Using Compressed Sensing. IEEE Trans Audio Speech Lang Process 19(5):1382–1395
Handley M (1997) An Examination of Mbone Performance, USC/ISI res. rep. ISI/RR-97–450
Hovorka J (2009) Methods for evaluation of speech enhancement algorithms. Adv Mil Technol 4(2)
I. Telecommunication Union, ITU-TG.723.1, http://www.itu.int/rec/T-REC-G.723.1/_page.print
ITU, Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T, Recommendation P. 862, http://www.itu.int/rec/T-REC-P.862/
Kabal P (2011) ITU-TG.723.1 SpeechCoder: A Matlab Implementation. Technical Report, MMSPLab Technical Report, Department of Electrical and Computer Engineering, McGill University
Kleijn WB, Shabestary TZ, Skoglund J (2014) Sinusoidal interpolation across missing data. In: Proceedings of the 14th International Workshop on Acoustic Signal Enhancement, pp 70–74
Jafari MG, Plumbley MD (2011) Fast dictionary learning for sparse representations of speech signals. IEEE J Sel Top Sign Process 5(5):1025–1031
Jensen JR, Christensen MG, Jensens MH, Jensen SH, Larsen T (2009) Robust parametric audio coding using multiple description coding. IEEE Signal Process Lett 16(12):1083–1086
Lindblom J, Hedelin P (2002) Packet loss concealment based on sinusoidal modeling. In: IEEE Workshop Proceedings on Speech Coding
Loizou P, Enhancement Speech (2007) Theory and practice. CRC Press, Boca Raton: FL
Lu X, He H, Tan H (2013) A low complexity packet loss recovery method for audio transmission. In: Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013), pp 1526–1529
Mallat S, Zhang Z (1993) Matching pursuits with time-frequency dictionaries. IEEE Trans Signal Process 41(12):3397–3415
Miller GA, Licklider JCR (1950) The intellegibility of interrupted speech. J Acoust Soc Amer 22(2):167–173
Needell D, Tropp JA (2008) CoSaMP: iterative signal recovery from noisy samples. Appl Comput Harmon Anal 26(3):301–321
Ofir H, Malah D (2006) Packet loss concealment for audio streaming based on the GAPES and MAPES algorithms. In: Proceedings of IEEE 24th Convention of Electrical and Electronics Engineers in Israel
Perkins C, Hodson O, Hardman V (1998) A survey of packet loss recovery techniques for streaming audio. IEEE Network, 1998 12(5):40–48
Pozueco L, Paneda XG, Garcia R, Melendi D, Cabrero S (2013) Adaptable system based on Scalable Video Coding for high-quality video service. Comput Electr Eng
Ramsey JL (1970) Realization of optimum interleavers. IEEE Trans Inf Theory 16:338–345
Rodbro CA, Christensen MG, Andersen SV, Jensen SH (2003) Compressed domain packet loss concealment of sinusoidally coded speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing
Romberg J l 1-Magic, www.acm.caltech.edu/l1magic
Schulzrinne H, et al. (1996) RTP: A Transport Protocol for Real-Time Applications, IETFAudio/Video Transport WG, RFC 1889
Suzuki J, Taka M (1989) Missing packet recovery techniques for low-bit-rate coded speech. IEEE J Sel Areas Commun 7(5):707–717
Toyoshima M, Shimamura T (2014) Packet loss concealment for VoIP based on pitch waveform replication and linear predictive coding. In: Proceedings of 2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp 89–92
Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via Orthogonal Matching Pursuit. IEEE Trans Inf Theory 53(12):4655–4666
Xiang K, Hu R (2014) An improved packet loss concealment method form mobile audio coding. The open Automation and Control Systems Journal 6:188–193
Acknowledgments
This work was partially funded by the “Sostegno alla ricerca individuale per il triennio 2015-2017” project of the University of Naples “Parthenope”.
Author information
Authors and Affiliations
Corresponding author
Appendix A
Appendix A
In the following, we report the basic features of the audio signals used in the experiments.
1.1 A.1 Data set D 1
In this experiment, we consider a male voice (wav format, Mono, 44100 Hz, 32-bit float, 32 seconds). The sentence for constructing the dictionary is:
“Dopo ebbi questa visione, una porta era aperta nel cielo e la voce che prima avevo udita come di tromba parlare con me mi dice: sali quassù e ti mostrerò ciò che deve accadere dopo queste cose, e all’istante fui rapito in spirito ed ecco in cielo c’era un trono e sul trono uno seduto e il seduto era simile nell’aspetto a gemma di diaspro e cornalina e intorno al trono c’era l’arcobaleno simile nell’aspetto...”.
The testing audio signal, with a male voice, is (wav format, Mono, 44100 Hz, 32-bit float, 4.5 seconds):
“Temere Dio è sapienza, astenersi dal male è intelligenza”.
1.2 A.2 Data set D 2
In this experiment, we consider a female voice (wav format, Mono, 44100 Hz, 32-bit float, 35.29 seconds).
The sentence for constructing the dictionary is:
“Ungete uno stampo rotondo con una noce di burro e cospargetelo bene con un poco di pan grattato, tagliate a pezzi i fegatini, l’animella scottata (risata) e pelata in precedenza e i granelli di pollo lasciando poi rosolare il tutto in un poco di burro infine cospargete con pepe e sale nero, no pepe nero e sale. Per la pasta disponete la farina a fontana sulla spianatoia, unite 4 uova e lavorate il composto con le mani versando anche un poco d’acqua.”.
The testing audio signal, with a female voice (wav format, Mono, 44100 Hz, 32-bit float, 4.1 seconds), is:
“C’era una volta un re seduto sul sofa”.
1.3 A.3 Data set D 3
In this experiment, we consider a male voice (wav format, Mono, 44100 Hz, 32-bit float, 33.4 sec).
The sentence for constructing the dictionary is:
“Mi hanno detto di parlare un pò più piano perchè non si capisce niente. Mi hanno detto di vestire un poco meglio perchè sembro un deficiente, e allora mi son detto parlerò più piano e vestirò un pò più elegante. Sono andato in un negozio ed ho comprato un capo molto appariscente. Questa qua e la volta buona che riesco ad integrarmi in società, quante volte lo diceva mammà”.
The testing audio signal, a male voice (wav format, Mono, 44100hz, 32-bit float, 10.5 seconds), is
“Trentatre trentini entrarono in treno tutti e trentatre trotterellando. Mi mi (risate di fondo)”.
1.4 A.4 Data set D 4
In this experiment, we consider a female voice (wav format, Mono, 44100 Hz, 32-bit float, 6.5 seconds).
The sentence for constructing the dictionary is:
“Ciao questa sera stiamo facendo questa cosa, ciao.”.
The testing audio signal, a female voice (wav format, Mono, 44100 Hz, 32-bit float, 5.7 seconds), is
“Mi hanno detto che per far passare il mal di testa devo mettere...”.
1.5 A.5 Data set D 5
In this experiment we consider a female voice (wav format, Mono, 44100 Hz, 32-bit float, 5.7 seconds).
The sentence for constructing the dictionary is:
‘Mi hanno detto che per far passare il mal di testa devo mettere...”.
The testing audio signal, a female voice (wav format, Mono, 44100 Hz, 32-bit float, 6.5 seconds), is
“Ciao questa sera stiamo facendo questa cosa, ciao”.
1.6 A.6 Data set D 6
In this experiment we consider a male voice ((wav format, Mono, 44100 Hz, 32-bit float, 32 seconds). The sentence for constructing the dictionary is the same of the data set D 1.
The testing audio signal, a female voice (wav format, Mono, 44100 Hz, 32-bit float, 4.1 seconds), is
“C’era una volta un re seduto sul sofa”.
Rights and permissions
About this article
Cite this article
Ciaramella, A., Gianfico, M. & Giunta, G. Compressive sampling and adaptive dictionary learning for the packet loss recovery in audio multimedia streaming. Multimed Tools Appl 75, 17375–17392 (2016). https://doi.org/10.1007/s11042-015-3002-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3002-x