Combining hidden Markov models and latent semantic analysis for topic segmentation and labeling: Method and clinical application

https://doi.org/10.1016/j.ijmedinf.2009.02.003Get rights and content

Abstract

Motivation

Topic segmentation and labeling systems enable fine-grained information search. However, previously proposed methods require annotated data to adapt to different information needs and have limited applicability to texts with short segment length.

Methods

We introduce an unsupervised method based on a combination of hidden Markov models and latent semantic analysis which allows the topics of interest to be defined freely, without the need for data annotation, and can identify short segments.

Results

The method is evaluated on intensive care nursing narratives and motivated by information needs in this domain. The method is shown to considerably outperform a keyword-based heuristic baseline and to achieve a level of performance comparable to that of a related supervised method trained on 3600 manually annotated words.

Introduction

Topic segmentation (TS) and labeling systems enable fine-grained information search. We have previously applied a TS and labeling method, a supervised hidden Markov model (HMM), to Finnish intensive care unit (ICU) nursing narratives [1]. The problem was to automatically divide text into topically coherent segments with respect to pre-determined, repeatedly discussed topics to support information access and clinical decision-making (see Fig. 1). This type of structure has been empirically shown to increase the information search speed of clinicians [2]. However, this approach requires topic-labeled training data to induce the HMM model and consequently, TS topics cannot be changed without additional annotation effort from the perspective of the new information need.

In this paper, we introduce a TS and labeling method, where the topics are not fixed in advance but are provided by the user as freely chosen keywords (e.g., breathing or hemodynamics). It combines latent semantic analysis (LSA) with a graphical model closely related to HMMs and is unsupervised in the sense of not requiring labeled training data. This allows the topics of interest to be easily changed: the user simply specifies new keywords.

The applicability of existing TS methods is limited in our case. First, to allow a free, ad hoc choice of topics, we require an unsupervised approach. Commonly a TS problem is solved in an unsupervised manner by analyzing the similarity (e.g., first uses of words, word co-occurrence, repetition or semantic relations) of text before and after a proposed segment boundary (see, e.g. [3], [4]); a sudden drop in similarity indicates a likely change in topic. However, these techniques do not typically allow the topics of interest to be specified in advance and methods that consider pre-specified topics (see, e.g. [5], [6], [7]) tend to be supervised. Second, the ICU narratives are characterized by extremely short segments; a single sentence may contain several topic-changes and the average topic length is only 18 tokens. Existing unsupervised TS methods require considerably longer segments (e.g., the TextTiling method [3] searches for topic boundaries between contexts of 200 tokens) and those specifically designed for short segments (see, e.g. [8], [9]) do not consider pre-specified topics. Third, our method is specifically designed for applications where almost all documents contain relevant information about the topics of interest.

Although ICU narratives motivated us for developing the method, it is a general TS and labeling technique that we believe to have potential to support ad hoc information needs also in many other application domains. For instance, the method could be applicable in more general information retrieval tasks such as rhetorical zone detection [10].

Section snippets

Clinical data

The dataset used in this study consists of nursing notes of 516 adult ICU patients.2 These Finnish patient-specific records are written during every shift mainly for intra-unit information exchange. The dataset consists of 17,140 nursing shifts.

We apply a simple domain-adapted tokenizer, obtaining 1.2 million tokens (including

Method

We now first recall basic notions of LSA and HMMs and then proceed to introduce the unsupervised TS and labeling method which is based on their combination. The main insight of the proposed method is that the LSA similarity of words to the given topic keywords can be used to replace HMM emission probabilities. Whereas a supervised HMM requires labeled data to estimate the emission probabilities, the unsupervised method only requires a single keyword for each topic.

Evaluation

We evaluate the proposed method on manually annotated data (see Section 2) randomly selected from 135 patient reports and divided among training (198 shifts) and testing (204 shifts). If two shifts report on the same patient, both are placed either in the training set or in the test set. To deal with the highly inflective nature of Finnish, we lemmatize the text using the FinTWOL Finnish morphological analyzer4[16] in all experiments. Our version has a lexicon extended

Results

The accuracy of the unsupervised model is considerably better than that of the keyword baseline, but, as expected, it is outperformed by the supervised HMM as the latter receives much more detailed information about the distribution of words with respect to topics (Table 1). To reach the performance of the unsupervised method, the supervised HMM requires approximately 3600 words of manually labeled training data (Fig. 3). For comparison, the learning curve for the unsupervised method is shown

Conclusions and future work

We have introduced an unsupervised method for TS and labeling based on a combination of HMMs and LSA and applied the proposed method in a clinically motivated setting. We have shown that, in order to reach the performance of the unsupervised method, a standard HMM would require 3600 words of labeled training data, as opposed to just one keyword per topic necessary for the unsupervised method.

Our study holds promise for improving the functionality of electronic patient records. A topic-wise

Acknowledgments

This work was supported by the Academy of Finland and the Finnish Funding Agency for Technology and Innovation, Tekes (40020/07). We thank Sari Ahonen and Simo Vihjanen from Lingsoft Inc. for extending FinTWOL, Philip Ogren for assistance with Knowtator and Heljä Lundgrén-Laine for help in annotation.

References (20)

  • H. Suominen et al.

    Automated text segmentation and topic labeling of clinical narratives

  • H.J. Tange et al.

    The granularity of medical narratives and its effect on the speed and completeness of information retrieval

    Journal of American Medical Informatics Association

    (1998)
  • M.A. Hearst

    TextTiling: segmenting text into multi-paragraph subtopic passages

    Computational Linguistics

    (1997)
  • O. Ferret

    Using collocations for topic segmentation and link detection

  • J.P. Yamron et al.

    A hidden Markov model approach to text segmentation and event tracking

  • D.M. Blei et al.

    Topic segmentation with an aspect hidden Markov model

  • A. Gruber et al.

    Hidden topic Markov models

  • J.M. Ponte et al.

    Text segmentation by topic

  • T.-H. Chang et al.

    Topic segmentation for short texts

  • T. Mullen et al.

    A baseline feature set for learning rhetorical zones using full articles in the biomedical domain

    SIGKDD Explorations

    (2005)
There are more references available in the full text version of this article.

Cited by (21)

  • A tale of two countries: International comparison of online doctor reviews between China and the United States

    2017, International Journal of Medical Informatics
    Citation Excerpt :

    Wallace et al. [12] showed that using information derived from mining the qualitative reviews can improve the quantitative model fit. There are text mining studies on online health forums [13,14] and on intensive care unit nursing narratives [15]. Hao and Zhang [25] have also used topic modeling to examine what Chinese patients said about their physicians in four major specialty areas.

  • Predicting patient acuity from electronic patient records

    2014, Journal of Biomedical Informatics
    Citation Excerpt :

    In recent years there has been significant interest in developing and applying text mining techniques based on machine learning to the analysis of EPRs, leading to applications such as automated diagnostic systems [29–31], text segmentation tools for nursing narratives [32,33], and quality-of-life-prediction for patients [32]. For a more thorough overview of research on text mining EPRs, we refer to [33]. To the best of our knowledge, the present study is the first to address the problem of predicting patient acuity scores.

  • Collaboration-based medical knowledge recommendation

    2012, Artificial Intelligence in Medicine
    Citation Excerpt :

    Deerwester et al. [27] have tested LSI on two standard document collections (MED and CISI) with promising results. Ginter et al. [28] propose an unsupervised method based on a combination of LSI and hidden Markov models applied the proposed method in a clinically motivated setting in order to allow the terms of knowledge items to be defined freely. Other kinds of semantic methods have also been used in medical informatics.

  • A time-varying propagation model of hot topic on BBS sites and Blog networks

    2012, Information Sciences
    Citation Excerpt :

    Zheng [29] proposed a document representation methodology to take into account both noun phrases and various semantic relationships, as there were a number of semantic relationships that could relate a pair of words. Ginter [6] proposed an unsupervised method, based on hidden Markov models, which was combined with latent semantic analysis to freely define topics of interest without necessarily data annotation; this method could also be used to identify short segments. Second, data mining [2,9,10,24,26,27] has been developed to study social media to identify textual keywords that refer to important events or topics.

  • A Clustering Approach for Information Retrieval Using A Quantum-Based Computation Technique

    2024, International Journal of Intelligent Systems and Applications in Engineering
View all citing articles on Scopus
View full text