Skip to main content
Log in

Adaptive music resizing with stretching, cropping and insertion

A generic content-aware music resizing framework

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Content-aware music adaption, i.e. music resizing, in temporal constraints starts drawing attention from multimedia communities, because there are plenty of real-world scenarios, e.g. animation production and radio advertisement production. The goal of music resizing is to change the length of a music track to a user preferred length using a series of basic operations, e.g. compression, prolonging, cropping and insertion. The only existing music resizing approach so far, called LyDAR, is based on the lyrics analysis and just utilizes the compression operation to resize a music piece. As a result, LyDAR suffers from some limitations, e.g., it can neither prolong a music track nor compress music pieces with very small stretch rates. In this paper, we propose a content-aware music resizing framework, named MUSIZ. In general, MUSIZ outperforms LyDAR in three aspects: (a) Except for the compression operation, MUSIZ takes advantages of prolonging, cropping and insertion operations to handle the resizing requests of both compression and prolonging. (b) Observing the diversity of quality degradation for different segments, we propose the concept of stretch-resistance to measure the degree of quality degradation after a segment is stretched. The stretch-resistance is modeled based on both acoustical and lyrics features. (c) Cropping and insertion operations are utilized before stretching. We develop the contiguity-preservative cropping and insertion algorithms to remove and insert music segments while smoothing the abrupt change at the joint between the manipulated segments. Comprehensive user studies show that the music tracks resized by MUSIZ achieve better quality than those produced by existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. The stretch rate, denoted as α, is defined as the ratio of the user preferred length T u to the length of the original music piece \({T_{\mathcal{M}}}\), i.e. \({\alpha = T_{u} / T_{\mathcal{M}}}\).

  2. http://en.wikipedia.org/wiki/LRC_file_format.

  3. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

  4. ARPAbet symbol set consists of 39 phonemes, which is developed by the Department of Defenses Advanced Research Projects Agency (ARPA) to represent the international phonetic alphabet (IPA) with ASCII characters.

  5. In cases when α ≥ 200 %, the repeating approach can convert α to the range (100, 200 %) as discussed in Sect. 3.

  6. Google Music (http://google.cn/music/homepage?sourceid=cnhp) offers information of singers, albums, lyrics and music tracks free of charge. All the music tracks and lyrics it provides are copyrighted. Please note that the service of free music track downloading is only provided in the region of Chinese mainland due to the copyright restrictions.

  7. http://marsyas.info/.

  8. http://www.surina.net/soundtouch/soundstretch.html.

  9. http://www.surina.net/soundtouch/index.html.

References

  1. Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU Recommendation BS.1116-1 (1994)

  2. General methods for the subjective assessment of sound quality. ITU Recommendation BS.1284-1 (1997)

  3. Abdallah, A., Sandler, B., Rhodes, C., Casey, M.: Using duration models to reduce fragmentation in audio segmentation. Mach. Learn. 65(2–3), 485–515 (2006)

    Article  Google Scholar 

  4. Anh, N.T.N., Yang, W., Cai, J.: Seam carving extension: a compression perspective. In: ACM Multimedia, pp. 825–828 (2009)

  5. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3), 10 (2007)

    Article  Google Scholar 

  6. Barrington, L., Chan, A., Lanckriet, G.: Modeling music as a dynamic texture. IEEE Trans. Audio Speech Lang. Process. 18(3), 602–612 (2010)

    Article  Google Scholar 

  7. Bartsch, M.A., Wakefield, G.H.: Audio thumbnailing of popular music using chroma-based representations. IEEE Trans. Multimed. 7(1), 96–104 (2005)

    Article  Google Scholar 

  8. Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B.: Voice-selective areas in human auditory cortex. Nature 403, 309–312 (2000)

    Article  Google Scholar 

  9. Bello, J.P., Daudet, L., Abdallah, S.A., Duxbury, C., Davies, M.E., Sandler, M.B.: A tutorial on onset detection in music signals. IEEE Trans. Speech Audio Process. 13(5), 1035–1047 (2005)

    Article  Google Scholar 

  10. Bennett, E.P., McMillan, L.: Computational time-lapse video. ACM Trans. Graph. 26(3), 102 (2007)

    Article  Google Scholar 

  11. Burges, C.J.C., Plastina, D., Platt, J.C., Renshaw, E., Malvar, H.S.: Using audio fingerprinting for duplicate detection and thumbnail generation. In: IEEE ICASSP, pp. 9–12 (2005)

  12. Chai, W., Vercoe, B.: Music thumbnailing via structural analysis. In: ACM Multimedia, pp. 223–226 (2003)

  13. Chen, H.C., Lin, C.H., Chen, A.L.: Music segmentation by rhythmic features and melodic shapes. In: IEEE ICME, pp. 1643–1646 (2004)

  14. de Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)

    Article  Google Scholar 

  15. Cooper, M., Foote, J.: Automatic music summarization via similarity analysis. In: ISMIR, pp. 81–85 (2002)

  16. Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. In: ISMIR, pp. 63–70 (2002)

  17. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  18. Dixon, S.: Onset detection revisited. In: International Conference on Digital Audio Effects, pp. 133–137 (2006)

  19. Foote, J.: Visualizing music and audio using self-similarity. In: ACM Multimedia, pp. 77–80 (1999)

  20. Foote, J.: Automatic audio segmentation using a measure of audio novelty. In: IEEE ICME, pp. 452–455 (2000)

  21. Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE ICASSP, pp. 437–440 (2003)

  22. Grofit, S., Lavner, Y.: Time-scale modification of audio signals using enhanced wsola with management of transients. IEEE Trans. Acoust. Speech Signal Process. 16(1), 106–115 (2008)

    Google Scholar 

  23. Jojic, N., Petrovic, N., Huang, T.: Scene generative models for adaptive video fast forward. In: IEEE ICIP, vols. 2, 3, pp. II-619–II-22 (2003)

  24. Kopf, S., Kiess, J., Lemelson, H., Effelsberg, W.: FSCAV: fast seam carving for size adaptation of videos. In: ACM Multimedia, pp. 321–330 (2009)

  25. Laroche, J., Dolson, M.: Improved phase vocoder time-scale modification of audio. IEEE Trans. Acoust. Speech Signal Process. 7(3), 323–332 (1999)

    Article  Google Scholar 

  26. Lee, E., Nakra, T.M., Borchers, J.: You’re the conductor: a realistic interactive conducting system for children. In: International Conference on New Interfaces for Musical Expressionaris, pp. 68–73 (2004)

  27. Levy, M.: A comparison of timbral and harmonic music segmentation algorithms. In: IEEE ICASSP, pp. 1433–1436 (2007)

  28. Levy, M., Sandler, M., Casey, M.: Extraction of high-level musical structure from audio data and its application to thumbnail generation. In: IEEE ICASSP, pp. 15–16 (2006)

  29. Liu, Z., Wang, C., Bai, Y., Wang, H., Wang, J.: Musiz: a generic framework for music resizing with stretching and cropping. In: ACM Multimedia, pp. 523–532 (2011)

  30. Liu, Z., Wang, C., Guo, L., Bai, Y., Wang, J.: Lydar: a lyrics density based approach to non-homogeneous music resizing. In: IEEE ICME, pp. 310–315 (2010)

  31. Liu, Z., Wang, C., Wang, J., Zheng, W., Shi, S.: Structure-aware music resizing using lyrics. In: WWW, pp. 1155–1156 (2010)

  32. Lu, L., Zhang, H.J.: Automated extraction of music snippets. In: ACM Multimedia, pp. 140–147 (2003)

  33. Lu, L., Zhang, H.J., Li, S.Z.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)

    Google Scholar 

  34. Nwe, T.L., Shenoy, A., Wang, Y.: Singing voice detection in popular music. In: ACM Multimedia, pp. 324–327 (2004)

  35. Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zero-crossings. IEEE Trans. Multimed. 7(1), 155–166 (2005)

    Article  Google Scholar 

  36. Paulus, J., Müller, M., Klapuri, A.: Audio-based music structure analysis. In: ISMIR, pp. 625–636 (2010)

  37. Petrovic, N., Jojic, N., Huang, T.S.: Adaptive video fast forward. Multimed. Tools Appl. 26, 327–344 (2005)

    Article  Google Scholar 

  38. Plack, C.J., Oxenham, A.J., Fay, R.R., Popper, A.N.: Pitch: neural coding and perception. In: Springer Handbook of Auditory Research, vol. 24. Springer, Berlin (2005)

  39. Roebel, A.: Transient detection and preservation in the phase vocoder. In: International Computer Music Conference, pp. 247–250 (2003)

  40. Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Trans. Graph. 27(3), 1–9 (2008)

    Article  Google Scholar 

  41. Shamir, A., Avidan, S.: Seam carving for media retargeting. Commun. ACM 52(1), 77–85 (2009)

    Article  Google Scholar 

  42. Shepard, R.N.: Circularity in judgments of relative pitch. J. Acoust. Soc. Am. 36(12), 2346–2353 (1964)

    Article  Google Scholar 

  43. Shi, L., Wang, J., Duan, L., Lu, H.: Consumer video retargeting: context assisted spatial-temporal grid optimization. In: ACM Multimedia, pp. 301–310 (2009)

  44. van Son, R.: A study of pitch, formant, and spectral estimation errors introduced by three lossy speech compression algorithms. Acta Acustica United Acustica 91(4), 771–778 (2005)

    Google Scholar 

  45. Tzanetakis, G.: Music analysis, retrieval and synthesis of audio signals marsyas. In: ACM Multimedia, pp. 931–932 (2009)

  46. Verhelst, W.: Overlap-add methods for time-scaling of speech. Speech Commun. 30(4), 207–221 (2000)

    Article  Google Scholar 

  47. Weiss, R.J., Bello, J.P.: Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. In: ISMIR, pp. 123–128 (2010)

  48. Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven video-retargeting. In: IEEE ICCV, pp. 1–6 (2007)

  49. Xu, C., Zhu, Y., Tian, Q.: Automatic music summarization based on temporal, spectral and cepstral features. In: IEEE ICME, pp. 117–120 (2002)

  50. Zhang, T., Kuo, C.C.J.: Heuristic approach for generic audio data segmentation and annotation. In: ACM Multimedia, pp. 67–76 (1999)

Download references

Acknowledgments

The work is supported by the National Natural Science Foundation of China (No. 60803016, No. 61170064 and No. 61073005), the National Basic Research Program of China (No. 2012AA011002) and the National HeGaoJi Key Project (No. 2010ZX01042-002-002-01). We would like to thank the volunteers for participating the user study. We also thank the anonymous reviewers and the editors for their insightful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhang Liu.

Additional information

Communicated by T. Plagemann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Wang, C., Wang, J. et al. Adaptive music resizing with stretching, cropping and insertion. Multimedia Systems 19, 359–380 (2013). https://doi.org/10.1007/s00530-012-0289-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-012-0289-6

Keywords

Navigation