Elsevier

Speech Communication

Volume 54, Issue 6, July 2012, Pages 743-762
Speech Communication

Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification

https://doi.org/10.1016/j.specom.2012.01.004Get rights and content

Abstract

The present paper focuses on the investigation of various audio pattern classifiers in broadcast-audio semantic analysis, using radio-programme-adaptive classification strategies with supervised training. Multiple neural network topologies and training configurations are evaluated and compared in combination with feature-extraction, ranking and feature-selection procedures. Different pattern classification taxonomies are implemented, using programme-adapted multi-class definitions and hierarchical schemes. Hierarchical and hybrid classification taxonomies are deployed in speech analysis tasks, facilitating efficient speaker recognition/identification, speech/music discrimination, and generally speech/non-speech detection-segmentation. Exhaustive qualitative and quantitative evaluation is conducted, including indicative comparison with non-neural approaches. Hierarchical approaches offer classification-similarities for easy adaptation to generic radio-broadcast semantic analysis tasks. The proposed strategy exhibits increased efficiency in radio-programme content segmentation and classification, which is one of the most demanding audio semantics tasks. This strategy can be easily adapted in broader audio detection and classification problems, including additional real-world speech-communication demanding scenarios.

Highlights

► Use of audio pattern classification in radio broadcast semantic analysis concepts. ► Investigation of programme-adaptive modules in real-world demanding scenarios. ► Formulation of direct, hierarchical and hybrid pattern classification schemes. ► Implementation of speech/non-speech segmentation and speaker recognition models. ► Training/evaluation of multiple pattern classifiers utilizing feature ranking.

Introduction

The rapid evolution of contemporary information and communication technologies (ICT) has undoubtedly influenced various aspects of human communication and behavior, including the field of journalism and mass communication. This impact is obvious on most traditional mass media, both printed and electronic ones, where important changes have occurred in production and distribution chains. Because of technology, media coverage has acquired a global character, while novel services, on-demand access and user-friendly environments have turned one-to-many passive journalism models into user-interactive mass communication experiences (Spyridou et al., in press, Matsiola, 2009, Spyridou, 2009, Spyridou and Veglis, 2008a, Spyridou and Veglis, 2008b, Kalliris and Dimoulas, 2009). At the same time, the rapid growth of the digital storage media capacity, in combination with the increased telecommunication bandwidth, the efficiency of compression algorithms and the continuous decrease of the corresponding costs, have allowed the deployment of high quality audiovisual content in entertainment, journalism and mass communication applications (Burnett et al., 2006, Kakumanu et al., 2006, Kalliris and Dimoulas, 2009, Koenen, 2000). Particular interest has been denoted in the participation of citizens in the extraction of news and the creation of user-generated content including audiovisual material, known as public or citizens’ journalism (Spyridou et al., in press, Matsiola, 2009, Kalliris and Dimoulas, 2009). Blog-posts, podcasting, i-reporting and other infotainment services are some of the related examples that are very popular nowadays (Spyridou et al., in press, Matsiola, 2009, Celma and Raimond, 2008, Nguyen et al., 2010).

Extending the above analysis, traditional radio and television broadcasting products are more easily created and distributed, whereas supplementary digital publishing media (i.e. Web-Radio/Web-TV) also contribute to the creation of massive audiovisual content. The audiovisual material derives from multiple sources with different formats of coding and objects of interest. This content can be implemented in common real-time online access scenarios, but also in more sophisticated user-preferences-based content access, on-demand services or even on producer-site content archiving, summarization, highlighting and reuse scenarios. However, the nature of the audiovisual material in combination with the user-related production preferences and differences, results, besides content-massiveness, in content-heterogeneity as well. This brings forward two major requirements: efficient content management and interoperability (Kalliris and Dimoulas, 2009, Kotsakis and Gioltzidou, 2011, Burnett et al., 2006). There are many technologies that have been proposed to resolve these issues, including MPEG-4, which introduced the extraction of audiovisual-objects (Kakumanu et al., 2006, Koenen, 2000), the MPEG-7 standard that proposes formalized audiovisual content description and management mechanisms (Kim et al., 2006, Burred and Lerch, 2004, Kalliris and Dimoulas, 2009), the MPEG-21 standard that targets transparent multimedia with standardized content-metadata-linking mechanisms (Burnett et al., 2006, Kalliris and Dimoulas, 2009), as well as many machine learning, data mining algorithms and artificial neural systems that are implemented for pattern classification/recognition and semantic analysis purposes (Burred and Lerch, 2004, Celma and Raimond, 2008, Dhanalakshmi et al., 2009, Dhanalakshmi et al., 2010, Dhanalakshmi et al., 2011, Dimoulas et al., 2007a, Dimoulas and Kalliris, 2010, Hall et al., 2009, Loviscach, 2010, Rongqing and Hansen, 2006, Vegiris et al., 2009, Wu and Hsieh, 2009).

The paper focuses on the implementation of radio-broadcast semantic analysis using programme-adaptive classification scenarios in combination with “trial and error”-based comparisons of different machine learning approaches and various hierarchical pattern taxonomies. The basic idea is to be able to apply automated audio content analysis on a specific radio-broadcast programme, implementing expert systems that have been trained using samples from only one or two shows of the specific program. Hence, motivated from the results of an feasibility study regarding broadcast-audio content semantic analysis automation demands, the paper mostly focuses on tasks related to speech communication, such as speech/non-speech segmentation, speech/music discrimination, speaker and phone-line identification in real-world demanding scenarios. These kinds of speech/audio semantic analysis concepts seem to be very helpful to both user-side and radio-producers-side content management automation requests (e.g. user-preferences-based content access, radio-programme post-analysis and scheduling, on-demand services or even on producer-site content archiving, summarization, highlighting and reuse scenarios).

The current work is related to the problem of general audio segmentation and classification specializing in recordings of prolonged radio broadcasts, and aims to facilitate semantic analysis automation for the corresponding speech/audio content. As already stated, incorporation of audio semantic analysis is very important towards efficient content management and browsing, both at the producer-site and the audience-site. It has to be pointed out that the task of radio programme segmentation and classification is very demanding compared to other audio recordings, because there is a variety of different audio patterns that depend on the specific characteristic of each radio broadcasting show (i.e. various speakers, music, commercials, etc.). In addition, radio productions usually feature a continuous data-flow, without easily distinguished pauses, where it is quite usual to confront temporal overlapping in audio or voice signals. For instance, background music often co-exists with speech segments, while fade-in/fade-out operations and additional background noises deteriorate audio detection and classification performance. Hence, the formulation of the pattern definition taxonomy is quite difficult, while pattern classification cannot be handled and treated easily because of the appearance of many audio pattern similarities and frequent classes-overlapping.

Based on the above, the problem under study has similarities to the general model of music, speech, phoneme and general audio segmentation and classification approaches, with the additional difficulties of acoustic background noise and speech variability (i.e. foreign and regional accents, speaker physiology, speaking style, rate of speech, spontaneous speech disfluencies, etc.), the continuous flow, the prolonged duration of the recordings, the unrestricted and unlimited nature of the pattern-classes dictionary (Benzeghiba et al., 2007, Stouten et al., 2006, Huijbregts and de Jong, 2011, Bach et al., 2011, Beyerlein et al., 2002). Despite the progress that has been conducted in audio classification and speech recognition techniques, the subjects of speech/non-speech detection-segmentation and speech/music discrimination still remain very demanding and critical in most realistic tasks related to the automatic transcription of broadcast news (Ajmera et al., 2003, Bach et al., 2011, Huijbregts and de Jong, 2011, Markaki and Stylianou, 2011, Taniguchi et al., 2008). The same applies for the speaker recognition/identification problem in natural (/common) speech-communication cases, whereas speaker-specific characteristics are usually involved (i.e. fluency, accent, speech-rate, gender, age, emotional status, etc.), while text information is not provided in most cases (Kinnunen and Li, 2010, Lee et al., 2011, Palanivel, 2009, Wu and Lin, 2009). Thus, although classical strategies can be adopted to deal with such problems, careful treatment is necessary from the early beginning of the project, during the adoption of specific classification schemes and the formulation of the necessary ground-truth labeling that is associated with the corresponding audio pattern dictionaries. Hence, if x(i) is the long-duration audio-broadcasting signal, the following steps are necessary in order to implement an automated system of audio semantic analysis (Fig. 1): (a) adoption of the appropriate pattern taxonomy, (b) formulation of the audio pattern samples Xs(λ) via segmentation process, (c) audio-samples pattern labeling (ground-truth formulation), (d) feature extraction – selection, (e) artificial neural system training, (f) system evaluation.

In the current work we attempt to apply and evaluate the above model in various programme-adaptive strategies, using different classification schemes, both direct and hierarchical/hybrid ones, in combination with various configurations in terms of time-segmentation, feature-vectors, neural system topologies and training algorithms. The process of classification is quite difficult as complete different characteristics and properties in the same audio content have to be dealt with. Exhaustive quantitative and qualitative evaluation is applied in trial-and-error scenarios in order to investigate the potential of radio broadcasting semantics in generic and adaptive approaches.

As already stated, the discussed problem combines tasks and difficulties that are met in general audio segmentation and classification research (Bach et al., 2011, Burred and Lerch, 2004, Dhanalakshmi et al., 2009, Dhanalakshmi et al., 2010, Dhanalakshmi et al., 2011, Dimoulas et al., 2007a, Loviscach, 2010, Rongqing and Hansen, 2006, Vegiris et al., 2009), as well as in more specific sub-topics, such as speech recognition and phoneme matching (Avdelidis et al., 2010a, Avdelidis et al., 2010b, Avdelidis et al., 2010c, Benzeghiba et al., 2007, Beyerlein et al., 2002, Kalliris et al., 2002, Stouten et al., 2006), speaker identification and verification (Kinnunen and Li, 2010, Palanivel, 2009, Wu and Lin, 2009), voice/music detection – discrimination (Ajmera et al., 2003, Taniguchi et al., 2008, Markaki and Stylianou, 2011), or even bioacoustics and other non-speech audio segmentation (Dimoulas et al., 2007b, Dimoulas et al., 2008, Dimoulas et al., 2011, Dimoulas and Kalliris, 2010, Nguyen et al., 2010). Pattern-classes formulation and definition of taxonomy schemes depend on the targets and the capabilities of each specific method that are connected with user-related functionalities and automation demands. Next, there is a variety of audio-features (i.e. linear predictive coefficients, linear predictive cepstral coefficients, mel-frequency cepstral coefficients, short term magnitude average and zero crossing rates, MPEG-7 low level audio descriptors and others) and feature-ranking screening methods (i.e. linear correlation and inter-/intra-distance measures, expectation maximization and more sophisticated algorithms) that are employed for the audio detection, segmentation and classification tasks (Ajmera et al., 2003, Bach et al., 2011, Burred and Lerch, 2004, Dhanalakshmi et al., 2010, Dimoulas et al., 2008, Kinnunen and Li, 2010, Lartillot and Toiviainen, 2007, Loviscach, 2010, Markaki and Stylianou, 2011, Palanivel, 2009, Taniguchi et al., 2008, Vegiris et al., 2009, Wu and Lin, 2009). Similarly, there is a plurality of available supervised and non-supervised machine learning approaches (i.e. neural and fuzzy systems, Hidden Markov Chains, Rough-Sets, statistical and tree-based reasoning, nearest neighbor and support vector machines, syntactic and hierarchical classification approaches, etc.) that can be employed with various training strategies, whereas system-inputs and sizing should be carefully treated according to the population of the training-samples and the complexity of the problem (Ajmera et al., 2003, Avdelidis et al., 2010a, Avdelidis et al., 2010b, Dhanalakshmi et al., 2009, Dimoulas et al., 2008, Dimoulas et al., 2011, Hall et al., 2009, Jain et al., 2000, Lee et al., 2011, Moody et al., 1992, Rongqing and Hansen, 2006, Wu and Hsieh, 2009). In most cases, energy-detection rules, but also more sophisticated feature-vector comparison operands are used for audio event detection and segmentation, prior to the classification tasks. Short-term windowed signal processing techniques are mostly employed, while multi-resolution window-based scanning and point-to-point adaptive treatment have been reported (Dimoulas and Kalliris, 2010, Dimoulas et al., 2007a, Dimoulas et al., 2007b, Dimoulas et al., 2011, Loviscach, 2010, Vegiris et al., 2009).

Various audio broadcast pattern classification and web-semantic approaches have been appeared in the bibliography for automatic or semi-automatic annotation of radio and television news (Beyerlein et al., 2002, Celma and Raimond, 2008, Dhanalakshmi et al., 2010, Dimoulas and Kalliris, 2010, Gauvain et al., 2002, Markaki and Stylianou, 2011, Nguyen et al., 2010, Rongqing and Hansen, 2006, Sankar et al., 2002, Woodland, 2002, Wu and Hsieh, 2009). Markaki and Stylianou (2011) focus on speech/non-speech discrimination in broadcast news using modulation spectral features with singular value decomposition. Broadcast news segmentation and classification have been implemented via tree modeling, entropy-information and temporal boundaries identification of topic-stories, whereas genetic algorithm is employed for the classification task (Wu and Hsieh, 2009). The same segmentation/classification goal is attempted by Dhanalakshmi et al. (2010) using auto-associative neural networks with linear prediction and cepstral analysis. Rongqing and Hansen (2006) implement unsupervised broadcast-news audio classification and segmentation using weighted GMM networks with combined time–frequency features and feature-vector evaluation procedures. Non-supervised clustering algorithms with various metrics attempt to identify groups of data in cases that pattern-classes definition is not such trivial (Palanivel, 2009). An interesting approach arises in podcasting audio analysis, where a single file represents the podcast session and therefore, it is very difficult to seek into the music tracks. This approach suggests audio content decomposition to smaller meaningful units, facilitating the information retrieval and filtering process (Celma and Raimond, 2008). Equally challenging is the task of detecting and isolating audio advertisements that are massively appeared in Podcast sessions, whereas a candidate segmentation approach proved to be a quick and accurate (Nguyen et al., 2010).

Section snippets

Implementation

The current work proposes radio-programme-adaptive strategies for audio-broadcast semantic analysis, content description and management automation. Although the main idea is to provide the maximum possible adaptation to a specific radio-programme, extension and adaptation to general audio classification and, if possible, to most radio programmes, was considered as very important from the early beginning of the project. Hence, a radio-broadcasting show featuring multiple different voices and

Experimental results and discussion

As stated, a number of 5986 training pairs {vW(f), pWL} were used as it is shown in Fig. 2 and Table 1. Standard k-fold cross validation procedures were employed, where training samples were divided to k = 10 sub-folders iteratively used for training and generalization performance evaluation (Bishop, 1995, Dimoulas et al., 2008, Kotsakis and Gioltzidou, 2011, Moody et al., 1992). Aiming at better training/generalization, samples of all sub-classes were, as proportionally as possible, divided to

Summary and further work

The current work focused on audio-semantic analysis of radio broadcasts using programme-adaptive strategies. A variety of different classification schemes such as direct, hierarchical and hybrid have been implemented using ANN training in combination to feature extraction and selection procedures. The tests were conducted using a demanding radio-programme as a base-study for ground truth acquisition. Comparisons with other ANS methods showed improved performance, succeeding positive recognition

Acknowledgement

Authors would like to acknowledge the valuable contribution of Dr. Lia-Paschalia Spyridou, including pointing out important comments and carefully proofreading and correcting the English language and style in this paper.

References (57)

  • J.-L. Gauvain et al.

    The LIMSI broadcast news transcription system

    Speech Commun.

    (2002)
  • M. Huijbregts et al.

    Robust speech/non-speech classification in heterogeneous multimedia content

    Speech Commun.

    (2011)
  • P. Kakumanu et al.

    A comparison of acoustic coding models for speech-driven facial animation

    Speech Commun.

    (2006)
  • T. Kinnunen et al.

    An overview of text-independent speaker recognition: from features to supervectors

    Speech Commun.

    (2010)
  • R. Koenen

    Profiles and levels in MPEG-4: approach and overview

    Signal Process. Image Commun.

    (2000)
  • C. Lee et al.

    Emotion recognition using a hierarchical binary decision tree approach

    Speech Commun.

    (2011)
  • M. Markaki et al.

    Discrimination of speech from non-speech in broadcast news based on modulation frequency features

    Speech Commun.

    (2011)
  • A. Sankar et al.

    Improved modelling and efficiency for automatic transcription of broadcast news

    Speech Commun.

    (2002)
  • F. Stouten et al.

    Coping with disfluencies in spontaneous speech recognition: acoustic detection and linguistic context manipulation

    Speech Commun.

    (2006)
  • T. Taniguchi et al.

    Detection of speech and music based on spectral tracking

    Speech Commun.

    (2008)
  • P.C. Woodland

    The development of the HTK broadcast news transcription system: an overview

    Speech Commun.

    (2002)
  • J.-D. Wu et al.

    Speaker identification based on the frame linear predictive coding spectrum technique

    Expert Syst. Appl.

    (2009)
  • Avdelidis, K., Dimoulas, C., Kalliris, G., Papanikolaou, G., 2010a. Adaptive phoneme alignment based on rough set...
  • Avdelidis, K., Dimoulas, C., Kalliris, G., Papanikolaou, G., 2010b. Designing optimal phoneme-wise fuzzy cluster...
  • Avdelidis, K., Dimoulas, C., Kalliris, G., Papanikolaou, G., 2010c. A heuristic text-driven approach for applied...
  • M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • I. Burnett et al.

    The MPEG-21 Book

    (2006)
  • J.J. Burred et al.

    Hierarchical automatic audio signal classification

    J. Audio Eng. Soc.

    (2004)
  • View full text