Abstract:
Many problems in natural language processing (NLP) can be cast as the problem of segmenting a sequence. In this article, we combine the semi-Markov conditional random fie...Show MoreMetadata
Abstract:
Many problems in natural language processing (NLP) can be cast as the problem of segmenting a sequence. In this article, we combine the semi-Markov conditional random fields (semi-CRF) with neural networks to solve NLP segmentation problems. We focus on the segment representation in neural semi-CRF which is important to the performance. Based on our preliminary work in Liu et al.[1], we represent a segment by both encoding the subsequence and embedding the segment string. We conduct a systematic study of the utility of various components in subsequence encoding and propose a method of constructing and deriving segment string embeddings. Extensive experiments on three typical segmentation problems, namely, shallow syntax parsing, named entity recognition, and Chinese word segmentation are conducted. The results show that we can achieve equally-performed subsequence encoding with a three times faster concatenation network compared to previous work. The results also show that the segment string embeddings help our neural semi-CRF model to achieve a macro-averaged error reduction of 13.15% over a strong baseline using deep contextualized embeddings and bidirectional long-short-term memory CRF, which also show the usefulness of semi-CRF even with contextualized embeddings. These results are competitive with the state-of-the-art segmentation systems.
Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 28)