Combining hidden Markov models and latent semantic analysis for topic segmentation and labeling: Method and clinical application

doi:10.1016/j.ijmedinf.2009.02.003

International Journal of Medical Informatics

Volume 78, Issue 12, December 2009, Pages e1-e6

https://doi.org/10.1016/j.ijmedinf.2009.02.003 Get rights and content

Abstract

Motivation

Topic segmentation and labeling systems enable fine-grained information search. However, previously proposed methods require annotated data to adapt to different information needs and have limited applicability to texts with short segment length.

Methods

We introduce an unsupervised method based on a combination of hidden Markov models and latent semantic analysis which allows the topics of interest to be defined freely, without the need for data annotation, and can identify short segments.

Results

The method is evaluated on intensive care nursing narratives and motivated by information needs in this domain. The method is shown to considerably outperform a keyword-based heuristic baseline and to achieve a level of performance comparable to that of a related supervised method trained on 3600 manually annotated words.

Introduction

Topic segmentation (TS) and labeling systems enable fine-grained information search. We have previously applied a TS and labeling method, a supervised hidden Markov model (HMM), to Finnish intensive care unit (ICU) nursing narratives [1]. The problem was to automatically divide text into topically coherent segments with respect to pre-determined, repeatedly discussed topics to support information access and clinical decision-making (see Fig. 1). This type of structure has been empirically shown to increase the information search speed of clinicians [2]. However, this approach requires topic-labeled training data to induce the HMM model and consequently, TS topics cannot be changed without additional annotation effort from the perspective of the new information need.

In this paper, we introduce a TS and labeling method, where the topics are not fixed in advance but are provided by the user as freely chosen keywords (e.g., breathing or hemodynamics). It combines latent semantic analysis (LSA) with a graphical model closely related to HMMs and is unsupervised in the sense of not requiring labeled training data. This allows the topics of interest to be easily changed: the user simply specifies new keywords.

The applicability of existing TS methods is limited in our case. First, to allow a free, ad hoc choice of topics, we require an unsupervised approach. Commonly a TS problem is solved in an unsupervised manner by analyzing the similarity (e.g., first uses of words, word co-occurrence, repetition or semantic relations) of text before and after a proposed segment boundary (see, e.g. [3], [4]); a sudden drop in similarity indicates a likely change in topic. However, these techniques do not typically allow the topics of interest to be specified in advance and methods that consider pre-specified topics (see, e.g. [5], [6], [7]) tend to be supervised. Second, the ICU narratives are characterized by extremely short segments; a single sentence may contain several topic-changes and the average topic length is only 18 tokens. Existing unsupervised TS methods require considerably longer segments (e.g., the TextTiling method [3] searches for topic boundaries between contexts of 200 tokens) and those specifically designed for short segments (see, e.g. [8], [9]) do not consider pre-specified topics. Third, our method is specifically designed for applications where almost all documents contain relevant information about the topics of interest.

Although ICU narratives motivated us for developing the method, it is a general TS and labeling technique that we believe to have potential to support ad hoc information needs also in many other application domains. For instance, the method could be applicable in more general information retrieval tasks such as rhetorical zone detection [10].

Section snippets

Clinical data

The dataset used in this study consists of nursing notes of 516 adult ICU patients.² These Finnish patient-specific records are written during every shift mainly for intra-unit information exchange. The dataset consists of 17,140 nursing shifts.

We apply a simple domain-adapted tokenizer, obtaining 1.2 million tokens (including

Method

We now first recall basic notions of LSA and HMMs and then proceed to introduce the unsupervised TS and labeling method which is based on their combination. The main insight of the proposed method is that the LSA similarity of words to the given topic keywords can be used to replace HMM emission probabilities. Whereas a supervised HMM requires labeled data to estimate the emission probabilities, the unsupervised method only requires a single keyword for each topic.

Evaluation

We evaluate the proposed method on manually annotated data (see Section 2) randomly selected from 135 patient reports and divided among training (198 shifts) and testing (204 shifts). If two shifts report on the same patient, both are placed either in the training set or in the test set. To deal with the highly inflective nature of Finnish, we lemmatize the text using the FinTWOL Finnish morphological analyzer⁴[16] in all experiments. Our version has a lexicon extended

Results

The accuracy of the unsupervised model is considerably better than that of the keyword baseline, but, as expected, it is outperformed by the supervised HMM as the latter receives much more detailed information about the distribution of words with respect to topics (Table 1). To reach the performance of the unsupervised method, the supervised HMM requires approximately 3600 words of manually labeled training data (Fig. 3). For comparison, the learning curve for the unsupervised method is shown

Conclusions and future work

We have introduced an unsupervised method for TS and labeling based on a combination of HMMs and LSA and applied the proposed method in a clinically motivated setting. We have shown that, in order to reach the performance of the unsupervised method, a standard HMM would require 3600 words of labeled training data, as opposed to just one keyword per topic necessary for the unsupervised method.

Our study holds promise for improving the functionality of electronic patient records. A topic-wise

Acknowledgments

This work was supported by the Academy of Finland and the Finnish Funding Agency for Technology and Innovation, Tekes (40020/07). We thank Sari Ahonen and Simo Vihjanen from Lingsoft Inc. for extending FinTWOL, Philip Ogren for assistance with Knowtator and Heljä Lundgrén-Laine for help in annotation.

References (20)

H. Suominen et al.
Automated text segmentation and topic labeling of clinical narratives
H.J. Tange et al.
The granularity of medical narratives and its effect on the speed and completeness of information retrieval
Journal of American Medical Informatics Association
(1998)
M.A. Hearst
TextTiling: segmenting text into multi-paragraph subtopic passages
Computational Linguistics
(1997)
O. Ferret
Using collocations for topic segmentation and link detection
J.P. Yamron et al.
A hidden Markov model approach to text segmentation and event tracking
D.M. Blei et al.
Topic segmentation with an aspect hidden Markov model
A. Gruber et al.
Hidden topic Markov models
J.M. Ponte et al.
Text segmentation by topic
T.-H. Chang et al.
Topic segmentation for short texts
T. Mullen et al.
A baseline feature set for learning rhetorical zones using full articles in the biomedical domain
SIGKDD Explorations
(2005)

There are more references available in the full text version of this article.

Cited by (21)

A tale of two countries: International comparison of online doctor reviews between China and the United States
2017, International Journal of Medical Informatics
Citation Excerpt :
Wallace et al. [12] showed that using information derived from mining the qualitative reviews can improve the quantitative model fit. There are text mining studies on online health forums [13,14] and on intensive care unit nursing narratives [15]. Hao and Zhang [25] have also used topic modeling to examine what Chinese patients said about their physicians in four major specialty areas.
Worldwide, patients have posted millions of online reviews for their doctors. The rich textual information in the online reviews holds the potential to generate insights into how patients’ experience with their doctors differ across nations and how should we use them to improve our health service.
We apply customized text mining techniques to compare online doctor reviews from China and the United States, in order to measure the systematic differences in patient reviews between the two countries, and assess the potential insights that can be derived from this large volume of online text data.
We compare the textual reviews of obstetrics and gynecology (OBGYN) doctors from the two most popular online doctor rating websites in the U.S. and China, respectively: RateMDs.com and Haodf.com. We apply a customized text mining technique, Latent Dirichlet Allocation (LDA) topic modeling to identify the major topics in positive and negative reviews of those two countries. We then compare their similarities and differences.
Among the positive reviews, both Chinese and American patients talked about medical treatment, bedside manner, and appreciation/recommendation, but Chinese patients commented more about medical treatment while American patients focused more on recommendation. Also, reviews about bedside manner from Chinese patients were more related to doctors while on the American side, they were more about staff. This reflects the difference between the two countries’ health systems. Further, among the negative reviews, both countries’ patients talked about medical treatment, bedside manner, and logistics. However, Chinese patients focus more on the registration process, while American patients are more related to the staff, wait time, and insurance, which further shows the differences between the two nations’ health systems.
Online doctor reviews contain valuable information that can generate insights on the similarities and differences of patient experience across nations. They are useful assets to assist healthcare consumers, providers, and administrators in moving toward a patient-centered care. In this age of big data, online doctor reviews can be a valuable source for international perspectives on healthcare systems.
Predicting patient acuity from electronic patient records
2014, Journal of Biomedical Informatics
Citation Excerpt :
In recent years there has been significant interest in developing and applying text mining techniques based on machine learning to the analysis of EPRs, leading to applications such as automated diagnostic systems [29–31], text segmentation tools for nursing narratives [32,33], and quality-of-life-prediction for patients [32]. For a more thorough overview of research on text mining EPRs, we refer to [33]. To the best of our knowledge, the present study is the first to address the problem of predicting patient acuity scores.
The ability to predict acuity (patients’ care needs), would provide a powerful tool for health care managers to allocate resources. Such estimations and predictions for the care process can be produced from the vast amounts of healthcare data using information technology and computational intelligence techniques. Tactical decision-making and resource allocation may also be supported with different mathematical optimization models.
This study was conducted with a data set comprising electronic nursing narratives and the associated Oulu Patient Classification (OPCq) acuity. A mathematical model for the automated assignment of patient acuity scores was utilized and evaluated with the pre-processed data from 23,528 electronic patient records. The methods to predict patient’s acuity were based on linguistic pre-processing, vector-space text modeling, and regularized least-squares regression.
The experimental results show that it is possible to obtain accurate predictions about patient acuity scores for the coming day based on the assigned scores and nursing notes from the previous day. Making same-day predictions leads to even better results, as access to the nursing notes for the same day boosts the predictive performance. Furthermore, textual nursing notes allow for more accurate predictions than previous acuity scores. The best results are achieved by combining both of these information sources. The developed model achieves a concordance index of 0.821 when predicting the patient acuity scores for the following day, given the scores and text recorded on the previous day.
By applying language technology to electronic patient documents it is possible to accurately predict the value of the acuity scores of the coming day based on the previous daýs assigned scores and nursing notes.
Collaboration-based medical knowledge recommendation
2012, Artificial Intelligence in Medicine
Citation Excerpt :
Deerwester et al. [27] have tested LSI on two standard document collections (MED and CISI) with promising results. Ginter et al. [28] propose an unsupervised method based on a combination of LSI and hidden Markov models applied the proposed method in a clinically motivated setting in order to allow the terms of knowledge items to be defined freely. Other kinds of semantic methods have also been used in medical informatics.
Clinicians rely on a large amount of medical knowledge when performing clinical work. In clinical environment, clinical organizations must exploit effective methods of seeking and recommending appropriate medical knowledge in order to help clinicians perform their work.
Aiming at supporting medical knowledge search more accurately and realistically, this paper proposes a collaboration-based medical knowledge recommendation approach. In particular, the proposed approach generates clinician trust profile based on the measure of trust factors implicitly from clinicians’ past rating behaviors on knowledge items. And then the generated clinician trust profile is incorporated into collaborative filtering techniques to improve the quality of medical knowledge recommendation, to solve the information-overload problem by suggesting knowledge items of interest to clinicians.
Two case studies are conducted at Zhejiang Huzhou Central Hospital of China. One case study is about the drug recommendation hold in the endocrinology department of the hospital. The experimental dataset records 16 clinicians’ drug prescribing tracks in six months. This case study shows a proof-of-concept of the proposed approach. The other case study addresses the problem of radiological computed tomography (CT)-scan report recommendation. In particular, 30 pieces of CT-scan examinational reports about cerebral hemorrhage patients are collected from electronic medical record systems of the hospital, and are evaluated and rated by 19 radiologists of the radiology department and 7 clinicians of the neurology department, respectively. This case study provides some confidence the proposed approach will scale up.
The experimental results show that the proposed approach performs well in recommending medical knowledge items of interest to clinicians, which indicates that the proposed approach is feasible in clinical practice.
A time-varying propagation model of hot topic on BBS sites and Blog networks
2012, Information Sciences
Citation Excerpt :
Zheng [29] proposed a document representation methodology to take into account both noun phrases and various semantic relationships, as there were a number of semantic relationships that could relate a pair of words. Ginter [6] proposed an unsupervised method, based on hidden Markov models, which was combined with latent semantic analysis to freely define topics of interest without necessarily data annotation; this method could also be used to identify short segments. Second, data mining [2,9,10,24,26,27] has been developed to study social media to identify textual keywords that refer to important events or topics.
Modeling the propagation of hot online topic is a preliminary requirement of predicting the trend of hot online topic. We propose a time-varying hot topic propagation model in online discussion context based upon the collective behavior of users who are in different social subgroups on blog networks and bulletin board system (BBS) sites. By analyzing the stability of the equilibrium of our model, we search for the threshold to be watershed of the trend of hot online topic and generalize about two theorems from the results of analysis, they exposit two sufficient conditions under which the trend of hot online topic will die out or remain uniformly weakly persistent. Furthermore, we propose methods to predict the trend of hot online topic on the strength of our model and theorems. For different motivation, we design two methods: Method (I) is mainly served as a way of theoretical research for predicting long trend of single-peak hot online topic by the thresholds of theorems; and for application, we design method (II) to predict the number of users writing or commenting upon article posts with respect to multi-peak hot online topic and single-peak one in the following two days with the help of Method (I). Experiments of two methods are performed on widely-discussed topics on the Sina Blog and the famous Liang Quan Qi Mei (LQQM) BBS and Xi’an Jiaotong University (BMY) BBS in China. The experimental results show that our methods predict the trend of hot online topic efficiently not only for theoretical motivation but also for applicable motivation, and reduce the computational complexity. Hence, our model can serve as basis for predicting trends in hot online topic propagation.
A Clustering Approach for Information Retrieval Using A Quantum-Based Computation Technique
2024, International Journal of Intelligent Systems and Applications in Engineering
The current state of Finnish NLP
2021, arXiv

View all citing articles on Scopus

¹: [email protected].

View full text

Combining hidden Markov models and latent semantic analysis for topic segmentation and labeling: Method and clinical application

Abstract

Motivation

Methods

Results

Introduction

Section snippets

Clinical data

Method

Evaluation

Results

Conclusions and future work

Acknowledgments

Automated text segmentation and topic labeling of clinical narratives

The granularity of medical narratives and its effect on the speed and completeness of information retrieval

Journal of American Medical Informatics Association

TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics

Using collocations for topic segmentation and link detection

A hidden Markov model approach to text segmentation and event tracking

Topic segmentation with an aspect hidden Markov model

Hidden topic Markov models

Text segmentation by topic

Topic segmentation for short texts

A baseline feature set for learning rhetorical zones using full articles in the biomedical domain

SIGKDD Explorations