Elsevier

Information Sciences

Volume 181, Issue 13, 1 July 2011, Pages 2873-2891
Information Sciences

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

https://doi.org/10.1016/j.ins.2011.02.013Get rights and content

Abstract

Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

Introduction

With the exponential growth of multimedia data containing speech, such as TV and radio broadcast news, meetings, lectures, voice mails and web-sharing videos, the development of automatic methods to semantically access and efficiently manage the spoken content has become increasingly important. Speech signal is semantically rich and usually covering subjects, concepts, topics, identities and emotions. For long streams such as a one-hour broadcast news episode, it is desirable to segment them into shorter clips that represent specific topics or stories. This would ideally allow users to swiftly jump to the start of relevant segments rather than have to search through the whole episode. Story segmentation aims to fulfill this task, which partitions a text, audio or video stream into a sequence of topically coherent segments known as stories [1]. It is an important precursor because various tasks, e.g., topic categorization and tracking, summarization, information extraction, indexing and retrieval usually assume the presence of individual topical documents [17], [25]. Manual segmentation requires annotators to work through the entire audio/video stream, which is tedious and costly. The need for automated segmentation approaches has become very pressing recently as a result of huge multimedia data produced.

Recently, lexical cohesion-based methods have drawn much interest for story segmentation [12], [7], [29], [27], [2], [4]. Lexical cohesion [11] indicates that words in a story (or topic) hang together by semantic relations and different stories tend to employ different sets of words. The TextTiling method [12] claims lexical similarity minimums as story boundaries through a word similarity measure across the text. The lexical chaining method [29] chains up related words in a text and a high concentration of chain starts and ends is declared as a story boundary.

Traditionally, lexical cohesion-based story segmentation has been studied at word level. In this paper, we perform a comprehensive study on subword-based approaches to story segmentation of Chinese broadcast news. Our motivations are twofold.

  • 1.

    First, different from western languages, Chinese is a character-based language and monosyllabic [45]. Chinese subwords, e.g., characters and syllables, play important semantic roles. The latent effectiveness of subwords warrants an investigation of lexical cohesion-based story segmentation of Chinese broadcast news.

  • 2.

    Second, story segmentation of broadcast news is performed largely on erroneous textual transcripts. Previous approaches measure word relations on inaccurate texts transcribed from audio (via a speech recognizer) and they did not take into account any error compensation methods. However, it is known that speech recognition errors break lexical cohesion among words, leading to performance degradation. Our previous preliminary study shows that measuring lexical cohesion at subword units hold much promise to solving the problem [42]. Subword units may be robust to speech recognition errors because of their partial matching merit. At subword levels, the mis-recognized words may include some subword units correctly recognized and matching on the subword level can thus recover word relations in noisy transcripts. However, the effectiveness of subwords for Chinese story segmentation desires a comprehensive study using different lexical methods, data sets from different sources and different speech recognition error rates.

Therefore, in this paper, we complete an extensive study on the effectiveness of various subword representations in Chinese story segmentation. Our experimental study on two popular methods, two Mandarin corpora (TDT2 and CCTV) and transcripts with different speech recognition error rates demonstrates that Chinese subwords can achieve considerable performance gains over words in lexical cohesion-based story segmentation of Chinese broadcast news, both in error-free and erroneous transcripts.

The rest of this paper is organized as follows: Section 2 makes a brief survey on related work. Section 3 describes the TextTiling and lexical chaining methods for story segmentation. In Section 4, we study the robustness of Chinese subwords and subword-based story segmentation approaches. Section 5 provides our experiments and analysis on the results. Finally, conclusions are drawn in Section 6.

Section snippets

Related work

Automatic story segmentation on multimedia documents is a challenging task. Text documents are often clearly organized with titles, sentences and paragraphs via typographic cues, e.g. punctuation and capitalization. However, spoken or video documents do not have such structural or typographic merits. Previous efforts on multimedia segmentation focus on three categories of cues: visual cues such as presence of an anchor face [14] and motion changes [13], audio cues such as significant pauses and

Lexical cohesion

Lexical cohesion describes that a text with a central topic is created by using words with related meanings and the words hang together as a whole through cohesive relations [11]. Major lexical cohesion relations include word repetition, synonym/antonym, specialization/generalization, and part/whole relation. Some examples are shown in Table 1. Among these relations, repetition is a strong, frequently used cohesion indicator.

A plenty of research [12], [7], [29] has shown that lexical cohesion

Subword lexical cohesion approaches

Lexical-based story segmentation approaches usually involve word matching, e.g., word frequency counts in sentence similarity measure of TextTiling [12] and connecting word repetitions in lexical chaining [29]. However, speech recognition errors induce severe word matching failures. In subword levels, we can conduct partial matching or “sound-like” matching that can partially recover the relations among words. This merit is especially important for Chinese due to its special characteristics. In

TDT2

Topic detection and tracking Phase 2 (TDT2) Mandarin corpus3 is released by LDC, which contains about 53 h of Mandarin broadcast news audio from Voice of America. The 177 VOA recordings span from February to June 1998, accompanied with manually annotated meta-data including story boundaries, manual word transcripts (namely TDT2-Ref) and LVCSR transcripts (namely TDT2-LVCSR). The TDT2 audio was transcribed by the Dragon LVCSR system with word, character and

Conclusions

In this paper, we have proposed to use Chinese subword representations, i.e. character and syllable n-gram units in lexical cohesion-based story segmentation of Chinese broadcast news transcripts. Different from western languages, Chinese characters and syllables play important semantic roles. Subwords are robust to speech recognition errors and can recover lexical cohesion in erroneous text via partial matching. We have studied the merits of Chinese subwords and performed an investigation on

Acknowledgements

This paper was partially supported by a Grant from the Research Grants Council of Hong Kong, China (CityU 118608), CityU Grant 7008026 and a Grant from the National Natural Science Foundation of China (60802085).

References (45)

  • S. Banerjee, I.A. Rudnicky, A TextTiling based approach to topic boundary detection in meetings, in: Interspeech:...
  • D. Beeferman et al.

    Statistical models for text segmentation

    Machine Learning

    (1999)
  • S.K. Chan, L. Xie, H. Meng, Modeling the statistical behavior of lexical chains to capture word cohesiveness for...
  • B. Chen et al.

    Discriminating capabilites of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese

    IEEE Transactions on Speech and Audio Processing

    (2002)
  • B. Chen, H.M. Wang, L.S. Lee, Spoken document retrieval and summarization, in: Advance in Chinese Spoken Language...
  • F.Y.Y. Choi, Advances in domain independent linear text segmentation, in: Human Language Technology Conference – North...
  • S. Dharanipragada, M. Franz, J. Mccarley, S. Roukos, T. Ward, Story segmentation and topic detection in the broadcast...
  • J.S. Garofolo, C.G.P. Auzanne, E.M. Voorhees, The trec spoken document retrieval track: a success story, in: Text...
  • M. Halliday et al.

    Cohesion in English

    (1976)
  • M.A. Hearst

    TexTiling: segmentation text into multi-paragraph subtopic passages

    Computational Linguistics

    (1997)
  • W. Hsu, S.F. Chang, A statistical framework for fusing mid-level perceptual features in news story segmentation, in:...
  • W. Hsu, S.F. Chang, C.W. Huang, L. Kennedy, C.Y. Lin, G. Iyengar, Discovery and fusion of salient multi-modal features...
  • Cited by (25)

    • Learning distributed sentence representations for story segmentation

      2018, Signal Processing
      Citation Excerpt :

      Story segmentation methods can be categorized into detection-based methods and probabilistic model-based methods. The former methods find optimal partitions over the word sequence by optimizing local objective criteria, e.g. TextTiling [8,38], or global criteria, e.g. NCuts [39–42]. The probabilistic model based methods assign data with latent random variables (representing topics) and the switch of the latent variable assignments indicates a story boundary.

    • Explicitly and implicitly exploiting the hierarchical structure for mining website interests on news events

      2017, Information Sciences
      Citation Excerpt :

      News events [20,35] that attract a great deal of attention by the public (e.g., a terrorist attack or a scandal of a famous star) are typically reported by numerous websites.

    • Automatic image annotation by semi-supervised manifold kernel density estimation

      2014, Information Sciences
      Citation Excerpt :

      The rapidly increasing large-scale image data makes their effective management [19,13,48] and accessing [27] highly desired.

    • Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization

      2014, Information Sciences
      Citation Excerpt :

      In speech signal processing, the phoneme is a general parts-based representation, which is a sensory element in the higher layer of the auditory system. As a basic semantic component, the phoneme is also very important in an auto speech recognition system (ASR) [28,9,45,9,20]. Because the time delay varies in different phonemes, the supervised learning approach is not suitable for parts-based learning.

    View all citing articles on Scopus
    View full text