Automatic identification of audio recordings based on statistical modeling

doi:10.1016/j.sigpro.2009.10.025

Signal Processing

Volume 90, Issue 4, April 2010, Pages 1064-1076

https://doi.org/10.1016/j.sigpro.2009.10.025 Get rights and content

Abstract

This paper describes a methodology for the automatic identification of audio recordings of ethnic music. The identification is based on an application of hidden Markov models (HMMs), which are automatically built from a representation of the music pieces to be identified. States of the HMMs are labeled with music events, and the transition and observation probabilities are directly computed from the information on the music piece. The recordings are modeled by a set of acoustic features that are computed according with the characteristics of the music events. Three alternative approaches, based on typical applications of HMMs, are proposed to perform the identification. Tests carried out on collections of recordings showed that the methodology can achieve good results, and the identification rate is high enough to suggest applications for automatic retrieval of metadata and for the identification of alternative recordings of the same piece.

Introduction

Digital collections, usually stored and organized in digital libraries, are becoming an increasingly important tool for the preservation and dissemination of music cultural heritage. Music can benefit from the application of digital technologies even more than other art forms, because it can be shared by users with different background and education, easily crossing language barriers. This is particularly true for ethnic music, which can be enjoyed by users with different culture, traditions, and living in different geographical areas. For a digital library of ethnic music, the goal “anytime–anywhere” is meant in its widest sense. Yet, the interest towards music digital collections should be paired by the availability of tools for an effective access to digital content. Textual descriptors such as title, author, and performers are all valuable metadata that can allow a user to retrieve potentially relevant music items. Moreover, additional information about time and key signatures, instrumentation, and geographical location, may provide alternative ways to access music content.

In the case of ethnic music, the access to music documents becomes more challenging, because typical metadata may not be good descriptors of music content. The most effective fields in a music search—author and title according to [1]—may not be good descriptors for this repertoire. To this end, tools for content-based search should be provided to the user, aiming at the exploration of the digital collection, for instance to retrieve alternative recordings of a given piece also when textual metadata are different or not available. Another important issue on the development of a music digital collection regards how music content is cataloged during the acquisition of the material. Classification and labeling should be manually carried out by trained personnel, with a considerable increment in time and costs. Also in this case, automatic tools that allow the identification of a recording as an instance of a given song can be used to automatically add pertinent metadata.

This paper reports a methodology for the automatic identification of unknown recordings. The approach is based on hidden Markov models (HMMs), which model the unobservable process underlying the production of a music performance given a statistical representation of a music piece. The presented approach builds on earlier work on audio to score alignment [2], [3], which has been extended to the particular task of exploring ethnic music digital collections. One of the goals of the proposed paper is the exploration on whether HMM-based identification can be carried out also when partial information—e.g., melodic content—is available to describe a song. To this end, some of the common characteristics of many repertoires of ethnic music make the proposed approach particularly suitable.

First of all, melody plays a fundamental role in most ethnic repertories, as opposed to functional harmony used in tonal Western music, yet the same song can be performed in a countless number of variants, which depend on the oral tradition by which it is transmitted, on the available instrumentation, on the geographical areas and thus on the cultures that have been influenced by it. Given the prominent role of oral tradition, for many songs the most common metadata such as author and title may not be significant—the author being unknown and the title being present in dozens of variants—making music retrieval a particularly difficult task. The fact that a given piece can be sung a cappella, played by a single instrument, played or sung with accompaniment, makes its identification challenging because only partial information about the acoustic features is available. All these considerations suggest the use of a statistical framework to model ethnic music recordings. The framework can be applied to identify unknown recordings, mine a music collection to highlight the presence of different recordings—for instance due to geographical variants—of the same song.

The paper is organized as follows. Section 2 introduces a number of approaches that are related to the presented methodology. The methodology for music modeling is described in Section 3, while three approaches to identification are presented in Section 4. The results of the experimental evaluation are given in Section 5, followed by some concluding considerations in Section 6.

Section snippets

Related work

This section describes some approaches reported in the literature related to the present work. It has been chosen to limit the discussion to research work focusing on music identification, thus not including the vast literature in music information retrieval, which has a focus on the computation of music similarity aimed at content-based retrieval. A complete overview of these approaches can be found in [4].

Statistical modeling of music performances

The idea underlying the proposed approach is that a performance can be considered as the realization of a process that converts the representation a performer has about a music piece into a sequence of acoustic features. The representation can be read from a music score, recalled from memory or improvised. The process is non-deterministic because different performances correspond to the same music piece, depending on a number of parameters which are only partially known. For instance, features

Music identification

Identification, or recognition, is probably the application of HMMs that is most often described in literature. The classical identification problem may be stated as follows:

Case A:
Given an unknown audio recording, described by a sequence of audio features $\bar{E} = {\bar{e} (1), \dots, \bar{e} (T)}$ , where $\bar{e} (t) = {e_{o} (t), e_{1} (t), \dots}$ are the acoustic events described in Section 3.2, and a set of competing models $λ_{i}$ , which have been generated from the representation of music pieces as described in Section 3.1, find the model that

Experimental evaluation

A number of experiments have been carried out to evaluate the methodology with real acoustic data from original recordings, taken from the personal collection of the author. The test collection was partitioned in two subsets, related to Cases A and B described in Section 4:

Set A.
A collection of 139 transcriptions of Balkan music (mainly Romanian and Bulgarian songs), where only melodic information was represented as it often happens for this kind of transcriptions. The first bars of the songs,

Conclusions

A HMM-based methodology for statistical modeling of audio recording is presented and applied to the automatic identification of ethnic music. Three approaches to compute the conditional probability of observing an audio recording given a statistical model of the corresponding music piece have been proposed and tested on two collections: a collection of transcriptions of Balkan music and a collection of recordings of Italian songs. In both cases identification has been carried out by modeling

References (40)

G. Leazer
The effectiveness of keyword searching in the retrieval of musical works on sound recordings
Cataloging and Classification Quarterly
(1992)
N. Orio et al.
Score following using spectral analysis and hidden Markov models
N. Orio et al.
Alignment of monophonic and polyphonic music to a score
J. Downie
Music information retrieval
Annual Review of Information Science and Technology
(2003)
J. Haitsma et al.
A highly robust audio fingerprinting system with an efficient search strategy
Journal of New Music Research
(2003)
H. Özer et al.
Perceptual audio hashing functions
EURASIP Journal on Applied Signal Processing
(2005)
P. Cano et al.
A review of audio fingerprinting
Journal of VLSI Signal Processing
(2005)
Gracenote, Music search 〈http://www.gracenote.com/〉 (September...
Tunatic, Free music identification software 〈http://www.wildbits.com/tunatic/〉 (September...
L. Boney, A. Tewfik, K. Hamdy, Digital watermarks for audio signals, IEEE Proceedings Multimedia (1996)...

J.-S. Pan et al.

Intelligent Watermarking Techniques

(2004)

T. Fujishima

Realtime chord recognition of musical sound: a system using common lisp music

M. Goto

A chorus section detection method for musical audio signals and its application to a music listening station

IEEE Transactions on Audio, Speech, and Language Processing

(2006)

C. Harte et al.

Detecting harmonic changes in musical audio

P. Herrera et al.

Chroma binary similarity and local alignment applied to cover song identification

IEEE Transactions on Audio, Speech, and Language Processing

(2008)

N. Hu et al.

Polyphonic audio matching and alignment for music retrieval

D. Ellis et al.

Identifying cover songs with chroma features and dynamic programming beat tracking

R. Miotto et al.

Automatic identification of music works through audio matching

M. Müller et al.

Efficient index-based audio matching

IEEE Transactions on Audio, Speech, and Language Processing

(2008)

R. Miotto et al.

A music identification system based on chroma indexing and statistical modeling

Cited by (6)

Listen by Looking: A framework to support the development of serious games for live music
2021, Entertainment Computing
Citation Excerpt :
An overview of the different approaches to score following is presented in [38]. The work in this paper builds upon previous research on score following, applied to an identification task [39]. The idea behind the proposed approach is that a performance can be considered as the realization of a process that converts the representation a performer has about a music piece into a sequence of acoustic features.
This paper presents an ongoing project focused on the framework Listen by Looking, that includes several tools to support the implementation of digital games, allowing the interactive display of music sheets, piano rolls and dialog windows with contextual information. Using this framework, a serious game for mobile devices was designed to entertain and teach music to users by listening to songs and answering questions about the melodies and musical structures learned during the game. The usefulness of game was assessed in a controlled setup, by comparing the answers of the app users with those of people who listened the same music without the app. Furthermore, an external synchronization tool through a data-over-sound channel has been tested in order to simultaneously manage a wide network of devices. The results are encouraging for future developments of the framework, with the possibility of using it during a public performance and with different games.
Music genre classification using LBP textural features
2012, Signal Processing
Citation Excerpt :
In 2011, the amount of digital information produced in the year should be equal nearly 1800 exabytes, or 10 times that produced in 2006 [1]. Among all the different sources of information, music certainly is the one that can benefit from this impressive growing since it can be shared by users with different background and education, easily crossing cultural and language barriers [2]. In general, indexing and retrieving music is based on meta information tags such as ID3 tags.
In this paper we present an approach to music genre classification which converts an audio signal into spectrograms and extracts texture features from these time-frequency images which are then used for modeling music genres in a classification system. The texture features are based on Local Binary Pattern, a structural texture operator that has been successful in recent image classification research. Experiments are performed with two well-known datasets: the Latin Music Database (LMD), and the ISMIR 2004 dataset. The proposed approach takes into account some different zoning mechanisms to perform local feature extraction. Results obtained with and without local feature extraction are compared. We compare the performance of texture features with that of commonly used audio content based features (i.e. from the MARSYAS framework), and show that texture features always outperforms the audio content based features. We also compare our results with results from the literature. On the LMD, the performance of our approach reaches about 82.33%, above the best result obtained in the MIREX 2010 competition on that dataset. On the ISMIR 2004 database, the best result obtained is about 80.65%, i.e. below the best result on that dataset found in the literature.
An Overview of Audio Event Detection Methods from Feature Extraction to Classification
2017, Applied Artificial Intelligence
Comprehensive non-repudiate speech communication involving geo-tagged featuremark
2015, Transactions on Engineering Technologies: World Congress on Engineering and Computer Science 2014
ClipBoard: Augmenting movie entertainment1
2013, CEUR Workshop Proceedings
A framework for robust audio fingerprinting
2010, Journal of Communications

View full text

Automatic identification of audio recordings based on statistical modeling

Abstract

Introduction

Section snippets

Related work

Statistical modeling of music performances

Music identification

Experimental evaluation

Conclusions

The effectiveness of keyword searching in the retrieval of musical works on sound recordings

Cataloging and Classification Quarterly

Score following using spectral analysis and hidden Markov models

Alignment of monophonic and polyphonic music to a score

Music information retrieval

Annual Review of Information Science and Technology

A highly robust audio fingerprinting system with an efficient search strategy

Journal of New Music Research

Perceptual audio hashing functions

EURASIP Journal on Applied Signal Processing

A review of audio fingerprinting

Journal of VLSI Signal Processing

Intelligent Watermarking Techniques

Realtime chord recognition of musical sound: a system using common lisp music

A chorus section detection method for musical audio signals and its application to a music listening station

IEEE Transactions on Audio, Speech, and Language Processing

Detecting harmonic changes in musical audio

Chroma binary similarity and local alignment applied to cover song identification

IEEE Transactions on Audio, Speech, and Language Processing

Polyphonic audio matching and alignment for music retrieval

Identifying cover songs with chroma features and dynamic programming beat tracking

Automatic identification of music works through audio matching

Efficient index-based audio matching

IEEE Transactions on Audio, Speech, and Language Processing

A music identification system based on chroma indexing and statistical modeling