On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

doi:10.1016/j.ins.2011.02.013

Information Sciences

Volume 181, Issue 13, 1 July 2011, Pages 2873-2891

https://doi.org/10.1016/j.ins.2011.02.013 Get rights and content

Abstract

Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

Introduction

With the exponential growth of multimedia data containing speech, such as TV and radio broadcast news, meetings, lectures, voice mails and web-sharing videos, the development of automatic methods to semantically access and efficiently manage the spoken content has become increasingly important. Speech signal is semantically rich and usually covering subjects, concepts, topics, identities and emotions. For long streams such as a one-hour broadcast news episode, it is desirable to segment them into shorter clips that represent specific topics or stories. This would ideally allow users to swiftly jump to the start of relevant segments rather than have to search through the whole episode. Story segmentation aims to fulfill this task, which partitions a text, audio or video stream into a sequence of topically coherent segments known as stories [1]. It is an important precursor because various tasks, e.g., topic categorization and tracking, summarization, information extraction, indexing and retrieval usually assume the presence of individual topical documents [17], [25]. Manual segmentation requires annotators to work through the entire audio/video stream, which is tedious and costly. The need for automated segmentation approaches has become very pressing recently as a result of huge multimedia data produced.

Recently, lexical cohesion-based methods have drawn much interest for story segmentation [12], [7], [29], [27], [2], [4]. Lexical cohesion [11] indicates that words in a story (or topic) hang together by semantic relations and different stories tend to employ different sets of words. The TextTiling method [12] claims lexical similarity minimums as story boundaries through a word similarity measure across the text. The lexical chaining method [29] chains up related words in a text and a high concentration of chain starts and ends is declared as a story boundary.

Traditionally, lexical cohesion-based story segmentation has been studied at word level. In this paper, we perform a comprehensive study on subword-based approaches to story segmentation of Chinese broadcast news. Our motivations are twofold.

1.
First, different from western languages, Chinese is a character-based language and monosyllabic [45]. Chinese subwords, e.g., characters and syllables, play important semantic roles. The latent effectiveness of subwords warrants an investigation of lexical cohesion-based story segmentation of Chinese broadcast news.
2.
Second, story segmentation of broadcast news is performed largely on erroneous textual transcripts. Previous approaches measure word relations on inaccurate texts transcribed from audio (via a speech recognizer) and they did not take into account any error compensation methods. However, it is known that speech recognition errors break lexical cohesion among words, leading to performance degradation. Our previous preliminary study shows that measuring lexical cohesion at subword units hold much promise to solving the problem [42]. Subword units may be robust to speech recognition errors because of their partial matching merit. At subword levels, the mis-recognized words may include some subword units correctly recognized and matching on the subword level can thus recover word relations in noisy transcripts. However, the effectiveness of subwords for Chinese story segmentation desires a comprehensive study using different lexical methods, data sets from different sources and different speech recognition error rates.

Therefore, in this paper, we complete an extensive study on the effectiveness of various subword representations in Chinese story segmentation. Our experimental study on two popular methods, two Mandarin corpora (TDT2 and CCTV) and transcripts with different speech recognition error rates demonstrates that Chinese subwords can achieve considerable performance gains over words in lexical cohesion-based story segmentation of Chinese broadcast news, both in error-free and erroneous transcripts.

The rest of this paper is organized as follows: Section 2 makes a brief survey on related work. Section 3 describes the TextTiling and lexical chaining methods for story segmentation. In Section 4, we study the robustness of Chinese subwords and subword-based story segmentation approaches. Section 5 provides our experiments and analysis on the results. Finally, conclusions are drawn in Section 6.

Section snippets

Related work

Automatic story segmentation on multimedia documents is a challenging task. Text documents are often clearly organized with titles, sentences and paragraphs via typographic cues, e.g. punctuation and capitalization. However, spoken or video documents do not have such structural or typographic merits. Previous efforts on multimedia segmentation focus on three categories of cues: visual cues such as presence of an anchor face [14] and motion changes [13], audio cues such as significant pauses and

Lexical cohesion

Lexical cohesion describes that a text with a central topic is created by using words with related meanings and the words hang together as a whole through cohesive relations [11]. Major lexical cohesion relations include word repetition, synonym/antonym, specialization/generalization, and part/whole relation. Some examples are shown in Table 1. Among these relations, repetition is a strong, frequently used cohesion indicator.

A plenty of research [12], [7], [29] has shown that lexical cohesion

Subword lexical cohesion approaches

Lexical-based story segmentation approaches usually involve word matching, e.g., word frequency counts in sentence similarity measure of TextTiling [12] and connecting word repetitions in lexical chaining [29]. However, speech recognition errors induce severe word matching failures. In subword levels, we can conduct partial matching or “sound-like” matching that can partially recover the relations among words. This merit is especially important for Chinese due to its special characteristics. In

TDT2

Topic detection and tracking Phase 2 (TDT2) Mandarin corpus³ is released by LDC, which contains about 53 h of Mandarin broadcast news audio from Voice of America. The 177 VOA recordings span from February to June 1998, accompanied with manually annotated meta-data including story boundaries, manual word transcripts (namely TDT2-Ref) and LVCSR transcripts (namely TDT2-LVCSR). The TDT2 audio was transcribed by the Dragon LVCSR system with word, character and

Conclusions

In this paper, we have proposed to use Chinese subword representations, i.e. character and syllable n-gram units in lexical cohesion-based story segmentation of Chinese broadcast news transcripts. Different from western languages, Chinese characters and syllables play important semantic roles. Subwords are robust to speech recognition errors and can recover lexical cohesion in erroneous text via partial matching. We have studied the merits of Chinese subwords and performed an investigation on

Acknowledgements

This paper was partially supported by a Grant from the Research Grants Council of Hong Kong, China (CityU 118608), CityU Grant 7008026 and a Grant from the National Natural Science Foundation of China (60802085).

References (45)

G. Fu et al.
Chinese word segmentation as morpheme-based lexical chunking
Information Sciences
(2008)
K. Ng et al.
Subword-based approaches for spoken document retrieval
Speech Communication
(2000)
Y. Ni et al.
Minimizing the expected complete influence time of a social network
Information Sciences
(2010)
H.-J. Oh et al.
Semantic passage segmentation based on sentence topics for question answering
Information Sciences
(2007)
X.-Z. Wang et al.
Training t-s norm neural networks to refine weights for fuzzy if-then rules
Neurocomputing
(2007)
X.-Z. Wang et al.
Learning fuzzy rules from fuzzy examples based on rough set techniques
Information Sciences
(2007)
X.-Z. Wang et al.
Induction of multiple fuzzy decision trees based on rough set technique
Information Sciences
(2008)
L. Xie et al.
A coupled HMM approach for video-realistic speech animation
Pattern Recognition
(2007)
J. Zeng et al.
Cascade Markov random fields for stroke extraction of Chinese characters
Information Sciences
(2010)

S. Banerjee, I.A. Rudnicky, A TextTiling based approach to topic boundary detection in meetings, in: Interspeech:...

D. Beeferman et al.

Statistical models for text segmentation

Machine Learning

(1999)

S.K. Chan, L. Xie, H. Meng, Modeling the statistical behavior of lexical chains to capture word cohesiveness for...

B. Chen et al.

Discriminating capabilites of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese

IEEE Transactions on Speech and Audio Processing

(2002)

B. Chen, H.M. Wang, L.S. Lee, Spoken document retrieval and summarization, in: Advance in Chinese Spoken Language...

F.Y.Y. Choi, Advances in domain independent linear text segmentation, in: Human Language Technology Conference – North...

S. Dharanipragada, M. Franz, J. Mccarley, S. Roukos, T. Ward, Story segmentation and topic detection in the broadcast...

J.S. Garofolo, C.G.P. Auzanne, E.M. Voorhees, The trec spoken document retrieval track: a success story, in: Text...

M. Halliday et al.

Cohesion in English

(1976)

M.A. Hearst

TexTiling: segmentation text into multi-paragraph subtopic passages

Computational Linguistics

(1997)

W. Hsu, S.F. Chang, A statistical framework for fusing mid-level perceptual features in news story segmentation, in:...

W. Hsu, S.F. Chang, C.W. Huang, L. Kennedy, C.Y. Lin, G. Iyengar, Discovery and fusion of salient multi-modal features...

Cited by (25)

Unsupervised measure of Chinese lexical semantic similarity using correlated graph model for news story segmentation
2018, Neurocomputing
This paper presents a simple yet effective approach to unsupervisedly measuring Chinese lexical semantic similarity, and shows its promising performance in automatic story segmentation of Mandarin broadcast news. Our approach centers on the unsupervised correlated affinity graph (UCAG) model, which is initialized as a hybrid sparse graph, encoding both explicit word-to-word contextual correlations and latent word-to-character correlations within the given corpus. The UCAG model further diffuses the initial sparse correlations throughout the graph by parallel affinity propagation. This provides us with a dense, reliable, and corpus-specific lexical semantic similarity measure, which comes from purely unlabeled data. We then generalize the classical cosine similarity metric to effectively take soft similarities into account for story segmentation. Extensive experiments on benchmark datasets validate the superiority of the proposed similarity measure over previous measures. We specifically show that our similarity measure averagely helps to achieve 7.7% relative F1-score improvement to the accuracy of state-of-art normalized cuts (NCuts) based story segmentation on two holistic benchmark Mandarin broadcast news corpora, TDT2 and CCTV, and achieves 10.8% relative F1-score improvement on the detailed broadcast news subsets.
Learning distributed sentence representations for story segmentation
2018, Signal Processing
Citation Excerpt :
Story segmentation methods can be categorized into detection-based methods and probabilistic model-based methods. The former methods find optimal partitions over the word sequence by optimizing local objective criteria, e.g. TextTiling [8,38], or global criteria, e.g. NCuts [39–42]. The probabilistic model based methods assign data with latent random variables (representing topics) and the switch of the latent variable assignments indicates a story boundary.
Traditional sentence representations such as bag-of-words (BOW) and term frequency-inverse document frequency (tf-idf) face the problem of data sparsity and may not generalize well. Neural network based representations such as word/sentence vectors are usually trained in an unsupervised way and lack the topic information which is important for story segmentation. In this paper, we propose to learn sentence representation by using deep neural network (DNN) to directly predict the topic class of the input sentence. By using supervised training, the learned vector representation of sentences contains more topic information and is more suitable for the story segmentation task. The input of the DNN is BOW vector computed from a context window. Multiple time resolution BOW and bottleneck features (BNF) are also introduced to enhance the performance of story segmentation. As text data labeled with topic information is limited, we cluster stories into classes and use the class ID as the topic label of the stories for DNN training. We evaluated the proposed sentence representation with the TextTiling and normalized cuts (NCuts) based story segmentation methods on the topic detection and tracking (TDT2) task. Experimental results show that the proposed topical sentence representation outperforms both the BOW baseline and the recently proposed neural network based representations, i.e., word and sentence vectors.
Explicitly and implicitly exploiting the hierarchical structure for mining website interests on news events
2017, Information Sciences
Citation Excerpt :
News events [20,35] that attract a great deal of attention by the public (e.g., a terrorist attack or a scandal of a famous star) are typically reported by numerous websites.
After a news event, many different websites publish coverage of that event, each expressing their own unique commentary, perspectives, and viewpoints. Websites form around a specific set of interests to cater to different audiences, and discovering these interests can help audiences C especially people and organizations that are interested in news C select the most appropriate websites to use as their sources of information. This paper presents three methods for formally defining and mining a websites interests, each of which is explicitly or implicitly based on a hierarchial structure: website-webpage-keyword. The first, and most straightforward, method explicitly uses keyword-layer network communities and the mapping relations between websites and keywords. The second method expands upon the first method with an iterative algorithm that combines both the mapping relations and the network relations from the website-webpage-keyword structure to further refine the keyword-layer network communities. In the third method, a website topic model implicitly captures the mapping relations among the websites, webpages, and keywords. The performance of three proposed methods in website interest mining is compared using a bespoke evaluation metric. The experimental results show that the iterative procedure designed in the second method is able to improve website interest mining performance, and the website topic model in the third method achieves the best performance among the three methods.
Automatic image annotation by semi-supervised manifold kernel density estimation
2014, Information Sciences
Citation Excerpt :
The rapidly increasing large-scale image data makes their effective management [19,13,48] and accessing [27] highly desired.
The insufficiency of labeled training data is a major obstacle in automatic image annotation. To tackle this problem, we propose a semi-supervised manifold kernel density estimation (SSMKDE) approach based on a recently proposed manifold KDE method. Our contributions are twofold. First, SSMKDE leverages both labeled and unlabeled samples and formulates all data in a manifold structure, which enables a more accurate label prediction. Second, the relationship between KDE-based methods and graph-based semi-supervised learning (SSL) methods is analyzed, which helps to better understand graph-based SSL methods. Extensive experiments demonstrate the superiority of SSMKDE over existing KDE-based and graph-based SSL methods.
Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization
2014, Information Sciences
Citation Excerpt :
In speech signal processing, the phoneme is a general parts-based representation, which is a sensory element in the higher layer of the auditory system. As a basic semantic component, the phoneme is also very important in an auto speech recognition system (ASR) [28,9,45,9,20]. Because the time delay varies in different phonemes, the supervised learning approach is not suitable for parts-based learning.
This paper focuses on the development of an algorithm that can be optimized for a specific acoustic environment to improve the intelligibility of whispered speech. A new convolutive non-negative matrix factorization (NMF) algorithm is proposed to extract phoneme bases from noisy whispered speech with the noise bases from prior learning; these noise bases are obtained from training using the conventional non-negative matrix factorization. The divergence function with a sparseness constraint term is selected as the objective function in the developed algorithm to obtain multiplicative update rules of the phoneme base matrix and the corresponding weight matrix. The weights of the noise bases from prior learning are also updated in the phoneme learning stage. Listening experiments were conducted to assess the intelligibility performance of speech synthesized using the proposed algorithm. The experimental results indicate that the proposed algorithm is very effective for improving the intelligibility of whispers in various noise contexts, and it outperforms conventional algorithms.
Collapse and reorganization patterns of social knowledge representation in evolving semantic networks
2012, Information Sciences
This study introduces semantic network analysis of natural language processing in collective social settings. It utilizes the spreading-activation theory of human long-term memories from social psychology to extract information and graph-theoretic linguistic approximations supporting rational propositional inference and formalisms. Using an empirical case study we demonstrate the process of extracting linguistic concepts from data and training a Hopfield artificial neural network for semantic network classification. We further develop an agent-based computational model of network evolution in order to study the processes and patterns of collective semantic knowledge representation, introducing incidents of collapses in central network structures. Large ensembles of simulation replication experiments are conducted and the resulted networks are analyzed using a variety of estimation techniques. We show how collective social structure emerges from simple interactions among semantic categories. Our findings provide evidence of the significance of collapse and reorganization effects in the structure of collective social knowledge; the statistical importance of the within-factor interactions in network evolution, and; stochastic exploration of whole parameter spaces in large ensembles of simulation runs can reveal important self-organizing aspects of the system’s behavior. The last session discusses the results and revisits the issues of generative semantic inference and the semantic networks as inferential formalisms in guiding self-organizing systemic complexity.

View all citing articles on Scopus

View full text

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Abstract

Introduction

Section snippets

Related work

Lexical cohesion

Subword lexical cohesion approaches

TDT2

Conclusions

Acknowledgements

Information Sciences

Speech Communication

Information Sciences

Information Sciences

Neurocomputing

Information Sciences

Information Sciences

Pattern Recognition

Information Sciences

Statistical models for text segmentation

Machine Learning

Discriminating capabilites of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese

IEEE Transactions on Speech and Audio Processing

Cohesion in English

TexTiling: segmentation text into multi-paragraph subtopic passages

Computational Linguistics