Hostname: page-component-8448b6f56d-t5pn6 Total loading time: 0 Render date: 2024-04-24T11:51:30.239Z Has data issue: false hasContentIssue false

Finite state segmentation of discourse into clauses

Published online by Cambridge University Press:  01 December 1996

EVA EJERHED
Affiliation:
Department of Linguistics, University of Umeå, S-90187 Umeå, Sweden. e-mail: ejerhed@ling.umu.se

Abstract

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results is derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber 1993 concerning the linguistic structure of major prosodic units. The other set of results is derived from experiments in segmenting part of speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words of Swedish texts from different genres, part of speech annotated by hand, and from the Umeå corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging. The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.

Type
Research Article
Copyright
1997 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)