Adaptive music resizing with stretching, cropping and insertion

Liu, Zhang; Wang, Chaokun; Wang, Jianmin; Wang, Hao; Bai, Yiyuan

doi:10.1007/s00530-012-0289-6

Adaptive music resizing with stretching, cropping and insertion

A generic content-aware music resizing framework

Regular Paper
Published: 26 July 2012

Volume 19, pages 359–380, (2013)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Zhang Liu¹,
Chaokun Wang²,
Jianmin Wang²,
Hao Wang¹ &
…
Yiyuan Bai¹

349 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

Content-aware music adaption, i.e. music resizing, in temporal constraints starts drawing attention from multimedia communities, because there are plenty of real-world scenarios, e.g. animation production and radio advertisement production. The goal of music resizing is to change the length of a music track to a user preferred length using a series of basic operations, e.g. compression, prolonging, cropping and insertion. The only existing music resizing approach so far, called LyDAR, is based on the lyrics analysis and just utilizes the compression operation to resize a music piece. As a result, LyDAR suffers from some limitations, e.g., it can neither prolong a music track nor compress music pieces with very small stretch rates. In this paper, we propose a content-aware music resizing framework, named MUSIZ. In general, MUSIZ outperforms LyDAR in three aspects: (a) Except for the compression operation, MUSIZ takes advantages of prolonging, cropping and insertion operations to handle the resizing requests of both compression and prolonging. (b) Observing the diversity of quality degradation for different segments, we propose the concept of stretch-resistance to measure the degree of quality degradation after a segment is stretched. The stretch-resistance is modeled based on both acoustical and lyrics features. (c) Cropping and insertion operations are utilized before stretching. We develop the contiguity-preservative cropping and insertion algorithms to remove and insert music segments while smoothing the abrupt change at the joint between the manipulated segments. Comprehensive user studies show that the music tracks resized by MUSIZ achieve better quality than those produced by existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

Frame-Wise Continuity-Based Video Summarization and Stretching

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Article 17 January 2024

Notes

The stretch rate, denoted as α, is defined as the ratio of the user preferred length T _u to the length of the original music piece \({T_{\mathcal{M}}}\), i.e. \({\alpha = T_{u} / T_{\mathcal{M}}}\).
http://en.wikipedia.org/wiki/LRC_file_format.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
ARPAbet symbol set consists of 39 phonemes, which is developed by the Department of Defenses Advanced Research Projects Agency (ARPA) to represent the international phonetic alphabet (IPA) with ASCII characters.
In cases when α ≥ 200 %, the repeating approach can convert α to the range (100, 200 %) as discussed in Sect. 3.
Google Music (http://google.cn/music/homepage?sourceid=cnhp) offers information of singers, albums, lyrics and music tracks free of charge. All the music tracks and lyrics it provides are copyrighted. Please note that the service of free music track downloading is only provided in the region of Chinese mainland due to the copyright restrictions.
http://marsyas.info/.
http://www.surina.net/soundtouch/soundstretch.html.
http://www.surina.net/soundtouch/index.html.

References

Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU Recommendation BS.1116-1 (1994)
General methods for the subjective assessment of sound quality. ITU Recommendation BS.1284-1 (1997)
Abdallah, A., Sandler, B., Rhodes, C., Casey, M.: Using duration models to reduce fragmentation in audio segmentation. Mach. Learn. 65(2–3), 485–515 (2006)
Article Google Scholar
Anh, N.T.N., Yang, W., Cai, J.: Seam carving extension: a compression perspective. In: ACM Multimedia, pp. 825–828 (2009)
Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3), 10 (2007)
Article Google Scholar
Barrington, L., Chan, A., Lanckriet, G.: Modeling music as a dynamic texture. IEEE Trans. Audio Speech Lang. Process. 18(3), 602–612 (2010)
Article Google Scholar
Bartsch, M.A., Wakefield, G.H.: Audio thumbnailing of popular music using chroma-based representations. IEEE Trans. Multimed. 7(1), 96–104 (2005)
Article Google Scholar
Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B.: Voice-selective areas in human auditory cortex. Nature 403, 309–312 (2000)
Article Google Scholar
Bello, J.P., Daudet, L., Abdallah, S.A., Duxbury, C., Davies, M.E., Sandler, M.B.: A tutorial on onset detection in music signals. IEEE Trans. Speech Audio Process. 13(5), 1035–1047 (2005)
Article Google Scholar
Bennett, E.P., McMillan, L.: Computational time-lapse video. ACM Trans. Graph. 26(3), 102 (2007)
Article Google Scholar
Burges, C.J.C., Plastina, D., Platt, J.C., Renshaw, E., Malvar, H.S.: Using audio fingerprinting for duplicate detection and thumbnail generation. In: IEEE ICASSP, pp. 9–12 (2005)
Chai, W., Vercoe, B.: Music thumbnailing via structural analysis. In: ACM Multimedia, pp. 223–226 (2003)
Chen, H.C., Lin, C.H., Chen, A.L.: Music segmentation by rhythmic features and melodic shapes. In: IEEE ICME, pp. 1643–1646 (2004)
de Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)
Article Google Scholar
Cooper, M., Foote, J.: Automatic music summarization via similarity analysis. In: ISMIR, pp. 81–85 (2002)
Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. In: ISMIR, pp. 63–70 (2002)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Dixon, S.: Onset detection revisited. In: International Conference on Digital Audio Effects, pp. 133–137 (2006)
Foote, J.: Visualizing music and audio using self-similarity. In: ACM Multimedia, pp. 77–80 (1999)
Foote, J.: Automatic audio segmentation using a measure of audio novelty. In: IEEE ICME, pp. 452–455 (2000)
Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE ICASSP, pp. 437–440 (2003)
Grofit, S., Lavner, Y.: Time-scale modification of audio signals using enhanced wsola with management of transients. IEEE Trans. Acoust. Speech Signal Process. 16(1), 106–115 (2008)
Google Scholar
Jojic, N., Petrovic, N., Huang, T.: Scene generative models for adaptive video fast forward. In: IEEE ICIP, vols. 2, 3, pp. II-619–II-22 (2003)
Kopf, S., Kiess, J., Lemelson, H., Effelsberg, W.: FSCAV: fast seam carving for size adaptation of videos. In: ACM Multimedia, pp. 321–330 (2009)
Laroche, J., Dolson, M.: Improved phase vocoder time-scale modification of audio. IEEE Trans. Acoust. Speech Signal Process. 7(3), 323–332 (1999)
Article Google Scholar
Lee, E., Nakra, T.M., Borchers, J.: You’re the conductor: a realistic interactive conducting system for children. In: International Conference on New Interfaces for Musical Expressionaris, pp. 68–73 (2004)
Levy, M.: A comparison of timbral and harmonic music segmentation algorithms. In: IEEE ICASSP, pp. 1433–1436 (2007)
Levy, M., Sandler, M., Casey, M.: Extraction of high-level musical structure from audio data and its application to thumbnail generation. In: IEEE ICASSP, pp. 15–16 (2006)
Liu, Z., Wang, C., Bai, Y., Wang, H., Wang, J.: Musiz: a generic framework for music resizing with stretching and cropping. In: ACM Multimedia, pp. 523–532 (2011)
Liu, Z., Wang, C., Guo, L., Bai, Y., Wang, J.: Lydar: a lyrics density based approach to non-homogeneous music resizing. In: IEEE ICME, pp. 310–315 (2010)
Liu, Z., Wang, C., Wang, J., Zheng, W., Shi, S.: Structure-aware music resizing using lyrics. In: WWW, pp. 1155–1156 (2010)
Lu, L., Zhang, H.J.: Automated extraction of music snippets. In: ACM Multimedia, pp. 140–147 (2003)
Lu, L., Zhang, H.J., Li, S.Z.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)
Google Scholar
Nwe, T.L., Shenoy, A., Wang, Y.: Singing voice detection in popular music. In: ACM Multimedia, pp. 324–327 (2004)
Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zero-crossings. IEEE Trans. Multimed. 7(1), 155–166 (2005)
Article Google Scholar
Paulus, J., Müller, M., Klapuri, A.: Audio-based music structure analysis. In: ISMIR, pp. 625–636 (2010)
Petrovic, N., Jojic, N., Huang, T.S.: Adaptive video fast forward. Multimed. Tools Appl. 26, 327–344 (2005)
Article Google Scholar
Plack, C.J., Oxenham, A.J., Fay, R.R., Popper, A.N.: Pitch: neural coding and perception. In: Springer Handbook of Auditory Research, vol. 24. Springer, Berlin (2005)
Roebel, A.: Transient detection and preservation in the phase vocoder. In: International Computer Music Conference, pp. 247–250 (2003)
Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Trans. Graph. 27(3), 1–9 (2008)
Article Google Scholar
Shamir, A., Avidan, S.: Seam carving for media retargeting. Commun. ACM 52(1), 77–85 (2009)
Article Google Scholar
Shepard, R.N.: Circularity in judgments of relative pitch. J. Acoust. Soc. Am. 36(12), 2346–2353 (1964)
Article Google Scholar
Shi, L., Wang, J., Duan, L., Lu, H.: Consumer video retargeting: context assisted spatial-temporal grid optimization. In: ACM Multimedia, pp. 301–310 (2009)
van Son, R.: A study of pitch, formant, and spectral estimation errors introduced by three lossy speech compression algorithms. Acta Acustica United Acustica 91(4), 771–778 (2005)
Google Scholar
Tzanetakis, G.: Music analysis, retrieval and synthesis of audio signals marsyas. In: ACM Multimedia, pp. 931–932 (2009)
Verhelst, W.: Overlap-add methods for time-scaling of speech. Speech Commun. 30(4), 207–221 (2000)
Article Google Scholar
Weiss, R.J., Bello, J.P.: Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. In: ISMIR, pp. 123–128 (2010)
Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven video-retargeting. In: IEEE ICCV, pp. 1–6 (2007)
Xu, C., Zhu, Y., Tian, Q.: Automatic music summarization based on temporal, spectral and cepstral features. In: IEEE ICME, pp. 117–120 (2002)
Zhang, T., Kuo, C.C.J.: Heuristic approach for generic audio data segmentation and annotation. In: ACM Multimedia, pp. 67–76 (1999)

Download references

Acknowledgments

The work is supported by the National Natural Science Foundation of China (No. 60803016, No. 61170064 and No. 61073005), the National Basic Research Program of China (No. 2012AA011002) and the National HeGaoJi Key Project (No. 2010ZX01042-002-002-01). We would like to thank the volunteers for participating the user study. We also thank the anonymous reviewers and the editors for their insightful comments.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, People’s Republic of China
Zhang Liu, Hao Wang & Yiyuan Bai
School of Software, Tsinghua University, Beijing, 100084, People’s Republic of China
Chaokun Wang & Jianmin Wang

Authors

Zhang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chaokun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiyuan Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhang Liu.

Additional information

Communicated by T. Plagemann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Wang, C., Wang, J. et al. Adaptive music resizing with stretching, cropping and insertion. Multimedia Systems 19, 359–380 (2013). https://doi.org/10.1007/s00530-012-0289-6

Download citation

Received: 23 December 2011
Accepted: 20 June 2012
Published: 26 July 2012
Issue Date: July 2013
DOI: https://doi.org/10.1007/s00530-012-0289-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive music resizing with stretching, cropping and insertion

Abstract

Access this article

Similar content being viewed by others

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

Frame-Wise Continuity-Based Video Summarization and Stretching

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptive music resizing with stretching, cropping and insertion

Abstract

Access this article

Similar content being viewed by others

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

Frame-Wise Continuity-Based Video Summarization and Stretching

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation