Abstract
Content-aware music adaption, i.e. music resizing, in temporal constraints starts drawing attention from multimedia communities, because there are plenty of real-world scenarios, e.g. animation production and radio advertisement production. The goal of music resizing is to change the length of a music track to a user preferred length using a series of basic operations, e.g. compression, prolonging, cropping and insertion. The only existing music resizing approach so far, called LyDAR, is based on the lyrics analysis and just utilizes the compression operation to resize a music piece. As a result, LyDAR suffers from some limitations, e.g., it can neither prolong a music track nor compress music pieces with very small stretch rates. In this paper, we propose a content-aware music resizing framework, named MUSIZ. In general, MUSIZ outperforms LyDAR in three aspects: (a) Except for the compression operation, MUSIZ takes advantages of prolonging, cropping and insertion operations to handle the resizing requests of both compression and prolonging. (b) Observing the diversity of quality degradation for different segments, we propose the concept of stretch-resistance to measure the degree of quality degradation after a segment is stretched. The stretch-resistance is modeled based on both acoustical and lyrics features. (c) Cropping and insertion operations are utilized before stretching. We develop the contiguity-preservative cropping and insertion algorithms to remove and insert music segments while smoothing the abrupt change at the joint between the manipulated segments. Comprehensive user studies show that the music tracks resized by MUSIZ achieve better quality than those produced by existing approaches.
Similar content being viewed by others
Notes
The stretch rate, denoted as α, is defined as the ratio of the user preferred length T u to the length of the original music piece \({T_{\mathcal{M}}}\), i.e. \({\alpha = T_{u} / T_{\mathcal{M}}}\).
ARPAbet symbol set consists of 39 phonemes, which is developed by the Department of Defenses Advanced Research Projects Agency (ARPA) to represent the international phonetic alphabet (IPA) with ASCII characters.
In cases when α ≥ 200 %, the repeating approach can convert α to the range (100, 200 %) as discussed in Sect. 3.
Google Music (http://google.cn/music/homepage?sourceid=cnhp) offers information of singers, albums, lyrics and music tracks free of charge. All the music tracks and lyrics it provides are copyrighted. Please note that the service of free music track downloading is only provided in the region of Chinese mainland due to the copyright restrictions.
References
Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU Recommendation BS.1116-1 (1994)
General methods for the subjective assessment of sound quality. ITU Recommendation BS.1284-1 (1997)
Abdallah, A., Sandler, B., Rhodes, C., Casey, M.: Using duration models to reduce fragmentation in audio segmentation. Mach. Learn. 65(2–3), 485–515 (2006)
Anh, N.T.N., Yang, W., Cai, J.: Seam carving extension: a compression perspective. In: ACM Multimedia, pp. 825–828 (2009)
Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3), 10 (2007)
Barrington, L., Chan, A., Lanckriet, G.: Modeling music as a dynamic texture. IEEE Trans. Audio Speech Lang. Process. 18(3), 602–612 (2010)
Bartsch, M.A., Wakefield, G.H.: Audio thumbnailing of popular music using chroma-based representations. IEEE Trans. Multimed. 7(1), 96–104 (2005)
Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B.: Voice-selective areas in human auditory cortex. Nature 403, 309–312 (2000)
Bello, J.P., Daudet, L., Abdallah, S.A., Duxbury, C., Davies, M.E., Sandler, M.B.: A tutorial on onset detection in music signals. IEEE Trans. Speech Audio Process. 13(5), 1035–1047 (2005)
Bennett, E.P., McMillan, L.: Computational time-lapse video. ACM Trans. Graph. 26(3), 102 (2007)
Burges, C.J.C., Plastina, D., Platt, J.C., Renshaw, E., Malvar, H.S.: Using audio fingerprinting for duplicate detection and thumbnail generation. In: IEEE ICASSP, pp. 9–12 (2005)
Chai, W., Vercoe, B.: Music thumbnailing via structural analysis. In: ACM Multimedia, pp. 223–226 (2003)
Chen, H.C., Lin, C.H., Chen, A.L.: Music segmentation by rhythmic features and melodic shapes. In: IEEE ICME, pp. 1643–1646 (2004)
de Cheveigné, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)
Cooper, M., Foote, J.: Automatic music summarization via similarity analysis. In: ISMIR, pp. 81–85 (2002)
Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. In: ISMIR, pp. 63–70 (2002)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Dixon, S.: Onset detection revisited. In: International Conference on Digital Audio Effects, pp. 133–137 (2006)
Foote, J.: Visualizing music and audio using self-similarity. In: ACM Multimedia, pp. 77–80 (1999)
Foote, J.: Automatic audio segmentation using a measure of audio novelty. In: IEEE ICME, pp. 452–455 (2000)
Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE ICASSP, pp. 437–440 (2003)
Grofit, S., Lavner, Y.: Time-scale modification of audio signals using enhanced wsola with management of transients. IEEE Trans. Acoust. Speech Signal Process. 16(1), 106–115 (2008)
Jojic, N., Petrovic, N., Huang, T.: Scene generative models for adaptive video fast forward. In: IEEE ICIP, vols. 2, 3, pp. II-619–II-22 (2003)
Kopf, S., Kiess, J., Lemelson, H., Effelsberg, W.: FSCAV: fast seam carving for size adaptation of videos. In: ACM Multimedia, pp. 321–330 (2009)
Laroche, J., Dolson, M.: Improved phase vocoder time-scale modification of audio. IEEE Trans. Acoust. Speech Signal Process. 7(3), 323–332 (1999)
Lee, E., Nakra, T.M., Borchers, J.: You’re the conductor: a realistic interactive conducting system for children. In: International Conference on New Interfaces for Musical Expressionaris, pp. 68–73 (2004)
Levy, M.: A comparison of timbral and harmonic music segmentation algorithms. In: IEEE ICASSP, pp. 1433–1436 (2007)
Levy, M., Sandler, M., Casey, M.: Extraction of high-level musical structure from audio data and its application to thumbnail generation. In: IEEE ICASSP, pp. 15–16 (2006)
Liu, Z., Wang, C., Bai, Y., Wang, H., Wang, J.: Musiz: a generic framework for music resizing with stretching and cropping. In: ACM Multimedia, pp. 523–532 (2011)
Liu, Z., Wang, C., Guo, L., Bai, Y., Wang, J.: Lydar: a lyrics density based approach to non-homogeneous music resizing. In: IEEE ICME, pp. 310–315 (2010)
Liu, Z., Wang, C., Wang, J., Zheng, W., Shi, S.: Structure-aware music resizing using lyrics. In: WWW, pp. 1155–1156 (2010)
Lu, L., Zhang, H.J.: Automated extraction of music snippets. In: ACM Multimedia, pp. 140–147 (2003)
Lu, L., Zhang, H.J., Li, S.Z.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)
Nwe, T.L., Shenoy, A., Wang, Y.: Singing voice detection in popular music. In: ACM Multimedia, pp. 324–327 (2004)
Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zero-crossings. IEEE Trans. Multimed. 7(1), 155–166 (2005)
Paulus, J., Müller, M., Klapuri, A.: Audio-based music structure analysis. In: ISMIR, pp. 625–636 (2010)
Petrovic, N., Jojic, N., Huang, T.S.: Adaptive video fast forward. Multimed. Tools Appl. 26, 327–344 (2005)
Plack, C.J., Oxenham, A.J., Fay, R.R., Popper, A.N.: Pitch: neural coding and perception. In: Springer Handbook of Auditory Research, vol. 24. Springer, Berlin (2005)
Roebel, A.: Transient detection and preservation in the phase vocoder. In: International Computer Music Conference, pp. 247–250 (2003)
Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Trans. Graph. 27(3), 1–9 (2008)
Shamir, A., Avidan, S.: Seam carving for media retargeting. Commun. ACM 52(1), 77–85 (2009)
Shepard, R.N.: Circularity in judgments of relative pitch. J. Acoust. Soc. Am. 36(12), 2346–2353 (1964)
Shi, L., Wang, J., Duan, L., Lu, H.: Consumer video retargeting: context assisted spatial-temporal grid optimization. In: ACM Multimedia, pp. 301–310 (2009)
van Son, R.: A study of pitch, formant, and spectral estimation errors introduced by three lossy speech compression algorithms. Acta Acustica United Acustica 91(4), 771–778 (2005)
Tzanetakis, G.: Music analysis, retrieval and synthesis of audio signals marsyas. In: ACM Multimedia, pp. 931–932 (2009)
Verhelst, W.: Overlap-add methods for time-scaling of speech. Speech Commun. 30(4), 207–221 (2000)
Weiss, R.J., Bello, J.P.: Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. In: ISMIR, pp. 123–128 (2010)
Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven video-retargeting. In: IEEE ICCV, pp. 1–6 (2007)
Xu, C., Zhu, Y., Tian, Q.: Automatic music summarization based on temporal, spectral and cepstral features. In: IEEE ICME, pp. 117–120 (2002)
Zhang, T., Kuo, C.C.J.: Heuristic approach for generic audio data segmentation and annotation. In: ACM Multimedia, pp. 67–76 (1999)
Acknowledgments
The work is supported by the National Natural Science Foundation of China (No. 60803016, No. 61170064 and No. 61073005), the National Basic Research Program of China (No. 2012AA011002) and the National HeGaoJi Key Project (No. 2010ZX01042-002-002-01). We would like to thank the volunteers for participating the user study. We also thank the anonymous reviewers and the editors for their insightful comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by T. Plagemann.
Rights and permissions
About this article
Cite this article
Liu, Z., Wang, C., Wang, J. et al. Adaptive music resizing with stretching, cropping and insertion. Multimedia Systems 19, 359–380 (2013). https://doi.org/10.1007/s00530-012-0289-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-012-0289-6