Abstract
This paper presents a subword normalized cut (N-cut) approach to automatic story segmentation of Chinese broadcast news (BN). We represent a speech recognition transcript using a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence similarities. Story segmentation is formalized as a graph-partitioning problem under the N-cut criterion, which simultaneously minimizes the similarity across different partitions and maximizes the similarity within each partition. We measure inter-sentence similarities and perform N-cut segmentation on the character/syllable (i.e. subword units) overlapping n-gram sequences. Our method works at the subword levels because subword matching is robust to speech recognition errors and out-of-vocabulary words. Experiments on the TDT2 Mandarin BN corpus show that syllable-bigram-based N-cut achieves the best F1-measure of 0.6911 with relative improvement of 11.52% over previous word-based N-cut that has an F1-measure of 0.6197. N-cut at the subword levels is more effective than the word level for story segmentation of noisy Chinese BN transcripts.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hsu, W., Chang, S., Huang, C., Kennedy, L., Lin, C., Iyengar, G.: Discovery and fusion of salient multi-modal features towards news story segmentation. In: SPIE Electronic Imaging (2004)
Xie, L., Liu, C., Meng, H.: Combined use of speaker-and tone-normalized pitch reset with pause duration for automatic story segmentation in Mandarin broadcast news. In: Proc. HLT-NAACL, pp. 193–196 (2007)
Hearst, M.: TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Dharanipragada, S., Franz, M., Mccarley, J., Roukos, S., Ward, T.: Story segmentation and topic detection in the broadcast news domain. In: Proc. DARPA Broadcast News Workshop (1999)
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proc. NAACL, pp. 26–33 (2000)
Malioutov, I., Barzilay, R.: Minimum cut model for spoken lecture segmentation. In: Proc. ACL, pp. 25–32 (2006)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Choi, F., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: Proc. EMNLP (2001)
Ng, K., Zue, V.W.: Subword-based approaches for spoken document retrieval. Speech Communication 32(3), 157–186 (2000)
Xie, L., Zeng, J., Feng, W.: Multi-scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 345–355. Springer, Heidelberg (2008)
Yang, Y., Xie, L.: Subword latent semantic analysis for texttiling-based automatic story segmentation of chinese broadcast news. In: Proc. ISCSLP, pp. 358–361 (2008)
Stokes, N., Carthy, J., Smeaton, A.: Select: A lexical cohesion based news story segmentation system. Journal of AI Communication 17(1), 3–12 (2004)
Feng, W., Liu, Z.Q.: Self-validated and spatially coherent clustering with net-structured MRF and graph cuts. In: Proc. ICPR, vol. 4, pp. 37–40 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, J., Xie, L., Feng, W., Zhang, Y. (2009). A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-04769-5_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)