A methodology for partitioning a vocabulary hierarchy into trees1

https://doi.org/10.1016/S0933-3657(98)00046-3Get rights and content

Abstract

Controlled medical vocabularies are useful in application areas such as medical information systems and decision-support systems. However, such vocabularies are large and complex, and working with them can be daunting. It is important to provide a means for orienting vocabulary designers and users to the vocabulary's contents. We describe a methodology for partitioning a vocabulary based on an IS-A hierarchy into small meaningful pieces. The methodology uses our disciplined modeling framework to refine the IS-A hierarchy according to prescribed rules in a process carried out by a user in conjunction with the computer. The partitioning of the hierarchy implies a partitioning of the vocabulary. We demonstrate the methodology with respect to a complex sample of the MED, an existing medical vocabulary.

Introduction

Controlled medical vocabularies (vocabularies for short) play an important role in many medical enterprises that employ a large number of disparate information systems (e.g. clinical databases). Often, each such system has its own inherent ‘language’ or terminology. A number of such vocabularies have appeared in the medical field 11, 34, 35. Of note is the Medical Entities Dictionary (MED) developed and in use at Columbia-Presbyterian Medical Center (CPMC) 8, 9. Controlled vocabularies have been shown to greatly facilitate the process of integrating medical information systems [10]using different terminologies. They also help to standardize common information handling tasks and reduce the overall cost of data processing.

While a controlled vocabulary offers tremendous benefits, these benefits do come at a price. A vocabulary can be quite extensive and can contain an overwhelming amount of structural and semantic complexity. For example, the MED contains over 48 000 concepts, over 61 000 IS-A links and over 71 000 other links. (We are referring to a particular version of the MED, dated 12/96, throughout this paper.) Obviously, the job of comprehending such a vocabulary can be an extremely difficult problem in itself.

In this paper, we are concerned with providing a tool to help users comprehend vocabularies. In particular, we present a methodology to make large and complex vocabularies easier to understand. Our approach is based on the partitioning of a vocabulary into manageably-sized, meaningful units. The partitioning assumes the existence of a vocabulary with an IS-A hierarchy and centers around this IS-A (or concept subsumption) hierarchy.

To enhance comprehension of the MED vocabulary [14], we have mapped it into an OODB schema representation based on partitioning the vocabulary into sets of concepts with the same sets of properties. In [21], we reported on implementing the InterMED (a partial revised version of the MED) 26, 33using ONTOS, a commercial OODB system. We call the resultant OODB the Object-Oriented Healthcare Vocabulary Repository (OOHVR). Among other things, the OOHVR's schema captures the complete structure of the vocabulary in a compact form which aids in its comprehension. However, for the much larger MED, each class in the corresponding OODB schema summarizes on average 500 concepts. A vocabulary of 500 concepts is still hard to understand. Thus, further partitioning efforts are needed to enhance comprehension.

The backbone of many controlled vocabularies is the IS-A hierarchy which relates more specialized concepts (subconcepts) to more generalized concepts (superconcepts) that subsume them. The IS-A hierarchy also serves as the basis for property inheritance. In general, the IS-A hierarchy of a controlled vocabulary will be a directed acyclic graph, permitting multiple superconcepts and multiple inheritance. Our methodology is based on the following two premises: (1) a vocabulary's IS-A hierarchy taken alone is much more comprehensible than the entire vocabulary itself; (2) a ‘forest’ IS-A hierarchy (i.e. a collection of trees in which every link is an IS-A link and where, by definition, no concept has more than one superconcept) is easier to comprehend than a directed acyclic graph containing the same number of concepts.

With these premises in mind, we develop a theoretical framework that reduces an entire vocabulary (typically represented as a large semantic network) into a forest hierarchy composed of small trees, each representing a logical unit whose graphical representation can fit on a computer screen. This reduction in size makes it easier for users and system designers alike to comprehend the contents of the vocabulary in a modular fashion.

Our methodology relies on an interaction between a user (presumably the vocabulary designer or administrator) and the computer. The process requires that a user refines the vocabulary's IS-A hierarchy according to some prescribed principles so that it conforms to what we call the rules of disciplined modeling. After the refinement, the computer can automatically reduce the vocabulary to a forest structure. We formally prove that our approach always finds a forest partition as long as the rules of disciplined modeling are adhered to. Let us note that partitioning networks (graphs) according to various criteria has been shown to be NP-complete, i.e. computationally intractable [12].

In previous work [28], we have employed a similar paradigm to reduce the complexity of large object-oriented database (OODB) subclass hierarchies. In this paper, we rework and adapt the approach to the IS-A hierarchy of an extensive, complex vocabulary. Furthermore, we present an interactive methodology for partitioning the vocabulary. To ground our discussion in a real-world application, we will focus on the MED as our test-bed vocabulary. The methodology developed herein will be applied to a complex subnetwork of the MED.

Our approach is closely related to the principle of orthogonal taxonomies' as implemented in the GALEN project 30, 31. There, a taxonomy is organized from the start by requiring that all primitive entities have only one primitive parent. In our methodology, an existing vocabulary is partitioned to achieve a similar effect.

The rest of this paper is organized as follows. In Section 2, we describe the notions of informational thinning and partitioning with respect to vocabularies. Section 3introduces the rules of disciplined modeling and proves that they make it possible to obtain a meaningful forest hierarchy from a directed acyclic graph. In Section 4, we describe our methodology for partitioning the vocabulary. In Section 5, we apply the methodology to a very complex portion of the MED. Section 6contains our conclusions. A short, preliminary version of this paper appeared in [15].

Section snippets

Informational thinning and partitioning

In this section, we describe two approaches which are used to enhance the comprehension of large and complex vocabularies. If a vocabulary network, containing a vast amount of objects (representing concepts), relationships and attributes, is displayed on a screen, then the user typically has difficulties comprehending and dealing with it. For such an overwhelming display of the InterMED, see [27].

According to our experience, the difficulties of understanding a vocabulary stem more from the

Theoretical paradigm using disciplined modeling

In order to identify a meaningful forest subhierarchy of an IS-A hierarchy, we shall look into the nature of the specialization IS-A relationship. In previous OODB research [13], we and others [25]have identified two different kinds of SUBCLASS relationships between object classes, called category-of and role-of. Both are specialization relationships. Category-of relates the specialized class to the more general class where both are seen in the same application context. Role-of relates the

A methodology for context partitioning of a hierarchy

We have described a conceptual framework which guarantees that for the price of following the rules of disciplined modeling, there can be found a forest-structure subhierarchy of the given directed acyclic graph hierarchy. This forest structure serves as a skeleton supporting the comprehension of the hierarchy. Furthermore, the trees of the forest represent contexts which are logical subhierarchies concentrating on a specific subject area, further supporting the comprehension of the original

Applying the methodology to a complex hierarchy

In order to test the effectiveness of our methodology, we applied it to the previously mentioned cortisporin subnetwork of the MED. First, informational thinning was used to obtain a directed acyclic graph hierarchy out of this subnetwork (Fig. 1). Then we used the methodology introduced in the previous section to partition the hierarchy into trees. Each tree produced by the partitioning is a logical unit in the forest hierarchy. The root object of a tree defines the unifying context for the

Conclusions

Vocabularies promise to be important tools for many medical information processing tasks. They can help overcome differences in terminology between different databases and information systems and different categories of users. Unfortunately, the job of understanding and maintaining the vocabulary itself is difficult and time-consuming. A graphical representation can help in the process of understanding most vocabularies. However, if the vocabulary is very large, the graphical representation

Acknowledgements

We thank Jim Cimino for his important feedback on earlier drafts of this paper.

References (35)

  • J. Cimino et al.

    Knowledge-based approaches to the maintenance of a large controlled medical terminology

    JAMIA

    (1994)
  • Cimino J, Hripcsak G, Johnson S, Clayton P. Designing an introspective, controlled medical vocabulary. In: Kingsland...
  • College of American Pathologists, Skokie, IL. Systematized Nomenclature of Medicine. Second edition,...
  • M. Gary et al.

    Computers and Intractability

    (1979)
  • J. Geller et al.

    Structure and semantics in OODB class specifications

    SIGMOD Record

    (1991)
  • Gu H, Cimino J, Halper M, Geller J, Perl Y. Utilizing OODB schema modeling for vocabulary management. In: Proc. '96...
  • Gu H, Perl Y, Geller J, Halper M, Cimino J, Singh M. Partitioning a vocabulary's IS-A hierarchy into trees. In: Proc....
  • Cited by (28)

    • Visual comprehension and orientation into the COVID-19 CIDO ontology

      2021, Journal of Biomedical Informatics
      Citation Excerpt :

      Biomedical ontologies represent complex domain knowledge and are not easy to comprehend. We have previously designed various summarization networks that provide “big picture” views of ontologies to achieve better ontology comprehension [22,37,40,50]. A summarization network is composed of nodes, summarizing sets of similar concepts and connected by hierarchical child-of relationships.

    • Towards summarizing knowledge: Brief ontologies

      2012, Expert Systems with Applications
      Citation Excerpt :

      The final sections give the conclusions derived from this research and the references cited. One of the first approaches to reducing an ontology to manageable size is described in Gu, Perl, Geller, Halper, and Singh (1999). In this study, the authors partitioned a large, complex semantic network into separate smaller sub-networks.

    • Structural group auditing of a UMLS semantic type's extent

      2009, Journal of Biomedical Informatics
    • Auditing description-logic-based medical terminological systems by detecting equivalent concept definitions

      2008, International Journal of Medical Informatics
      Citation Excerpt :

      The first step is to determine which concept category is to be audited. Generally, medical TSs can be regarded as consisting of various (more or less explicitly distinguishable) modules [26,27]. For example, SNOMED CT specifies not only concepts in the category “disease”, but, among others, also the categories “body structure”, “finding”, “organism”, “specimen”, and “substance”.

    View all citing articles on Scopus
    1

    This research was (partially) done under a cooperative agreement between the National Institute of Standards and Technology Advanced Technology Program (under the HIIT contract #70NANB5H1011) and the Healthcare Open Systems and Trials, Inc. consortium, and the Center for Manufacturing Systems.

    View full text