A methodology for partitioning a vocabulary hierarchy into trees1
Introduction
Controlled medical vocabularies (vocabularies for short) play an important role in many medical enterprises that employ a large number of disparate information systems (e.g. clinical databases). Often, each such system has its own inherent ‘language’ or terminology. A number of such vocabularies have appeared in the medical field 11, 34, 35. Of note is the Medical Entities Dictionary (MED) developed and in use at Columbia-Presbyterian Medical Center (CPMC) 8, 9. Controlled vocabularies have been shown to greatly facilitate the process of integrating medical information systems [10]using different terminologies. They also help to standardize common information handling tasks and reduce the overall cost of data processing.
While a controlled vocabulary offers tremendous benefits, these benefits do come at a price. A vocabulary can be quite extensive and can contain an overwhelming amount of structural and semantic complexity. For example, the MED contains over 48 000 concepts, over 61 000 IS-A links and over 71 000 other links. (We are referring to a particular version of the MED, dated 12/96, throughout this paper.) Obviously, the job of comprehending such a vocabulary can be an extremely difficult problem in itself.
In this paper, we are concerned with providing a tool to help users comprehend vocabularies. In particular, we present a methodology to make large and complex vocabularies easier to understand. Our approach is based on the partitioning of a vocabulary into manageably-sized, meaningful units. The partitioning assumes the existence of a vocabulary with an IS-A hierarchy and centers around this IS-A (or concept subsumption) hierarchy.
To enhance comprehension of the MED vocabulary [14], we have mapped it into an OODB schema representation based on partitioning the vocabulary into sets of concepts with the same sets of properties. In [21], we reported on implementing the InterMED (a partial revised version of the MED) 26, 33using ONTOS, a commercial OODB system. We call the resultant OODB the Object-Oriented Healthcare Vocabulary Repository (OOHVR). Among other things, the OOHVR's schema captures the complete structure of the vocabulary in a compact form which aids in its comprehension. However, for the much larger MED, each class in the corresponding OODB schema summarizes on average 500 concepts. A vocabulary of 500 concepts is still hard to understand. Thus, further partitioning efforts are needed to enhance comprehension.
The backbone of many controlled vocabularies is the IS-A hierarchy which relates more specialized concepts (subconcepts) to more generalized concepts (superconcepts) that subsume them. The IS-A hierarchy also serves as the basis for property inheritance. In general, the IS-A hierarchy of a controlled vocabulary will be a directed acyclic graph, permitting multiple superconcepts and multiple inheritance. Our methodology is based on the following two premises: (1) a vocabulary's IS-A hierarchy taken alone is much more comprehensible than the entire vocabulary itself; (2) a ‘forest’ IS-A hierarchy (i.e. a collection of trees in which every link is an IS-A link and where, by definition, no concept has more than one superconcept) is easier to comprehend than a directed acyclic graph containing the same number of concepts.
With these premises in mind, we develop a theoretical framework that reduces an entire vocabulary (typically represented as a large semantic network) into a forest hierarchy composed of small trees, each representing a logical unit whose graphical representation can fit on a computer screen. This reduction in size makes it easier for users and system designers alike to comprehend the contents of the vocabulary in a modular fashion.
Our methodology relies on an interaction between a user (presumably the vocabulary designer or administrator) and the computer. The process requires that a user refines the vocabulary's IS-A hierarchy according to some prescribed principles so that it conforms to what we call the rules of disciplined modeling. After the refinement, the computer can automatically reduce the vocabulary to a forest structure. We formally prove that our approach always finds a forest partition as long as the rules of disciplined modeling are adhered to. Let us note that partitioning networks (graphs) according to various criteria has been shown to be NP-complete, i.e. computationally intractable [12].
In previous work [28], we have employed a similar paradigm to reduce the complexity of large object-oriented database (OODB) subclass hierarchies. In this paper, we rework and adapt the approach to the IS-A hierarchy of an extensive, complex vocabulary. Furthermore, we present an interactive methodology for partitioning the vocabulary. To ground our discussion in a real-world application, we will focus on the MED as our test-bed vocabulary. The methodology developed herein will be applied to a complex subnetwork of the MED.
Our approach is closely related to the principle of orthogonal taxonomies' as implemented in the GALEN project 30, 31. There, a taxonomy is organized from the start by requiring that all primitive entities have only one primitive parent. In our methodology, an existing vocabulary is partitioned to achieve a similar effect.
The rest of this paper is organized as follows. In Section 2, we describe the notions of informational thinning and partitioning with respect to vocabularies. Section 3introduces the rules of disciplined modeling and proves that they make it possible to obtain a meaningful forest hierarchy from a directed acyclic graph. In Section 4, we describe our methodology for partitioning the vocabulary. In Section 5, we apply the methodology to a very complex portion of the MED. Section 6contains our conclusions. A short, preliminary version of this paper appeared in [15].
Section snippets
Informational thinning and partitioning
In this section, we describe two approaches which are used to enhance the comprehension of large and complex vocabularies. If a vocabulary network, containing a vast amount of objects (representing concepts), relationships and attributes, is displayed on a screen, then the user typically has difficulties comprehending and dealing with it. For such an overwhelming display of the InterMED, see [27].
According to our experience, the difficulties of understanding a vocabulary stem more from the
Theoretical paradigm using disciplined modeling
In order to identify a meaningful forest subhierarchy of an IS-A hierarchy, we shall look into the nature of the specialization IS-A relationship. In previous OODB research [13], we and others [25]have identified two different kinds of SUBCLASS relationships between object classes, called category-of and role-of. Both are specialization relationships. Category-of relates the specialized class to the more general class where both are seen in the same application context. Role-of relates the
A methodology for context partitioning of a hierarchy
We have described a conceptual framework which guarantees that for the price of following the rules of disciplined modeling, there can be found a forest-structure subhierarchy of the given directed acyclic graph hierarchy. This forest structure serves as a skeleton supporting the comprehension of the hierarchy. Furthermore, the trees of the forest represent contexts which are logical subhierarchies concentrating on a specific subject area, further supporting the comprehension of the original
Applying the methodology to a complex hierarchy
In order to test the effectiveness of our methodology, we applied it to the previously mentioned cortisporin subnetwork of the MED. First, informational thinning was used to obtain a directed acyclic graph hierarchy out of this subnetwork (Fig. 1). Then we used the methodology introduced in the previous section to partition the hierarchy into trees. Each tree produced by the partitioning is a logical unit in the forest hierarchy. The root object of a tree defines the unifying context for the
Conclusions
Vocabularies promise to be important tools for many medical information processing tasks. They can help overcome differences in terminology between different databases and information systems and different categories of users. Unfortunately, the job of understanding and maintaining the vocabulary itself is difficult and time-consuming. A graphical representation can help in the process of understanding most vocabularies. However, if the vocabulary is very large, the graphical representation
Acknowledgements
We thank Jim Cimino for his important feedback on earlier drafts of this paper.
References (35)
- et al.
A shifting algorithm for constrained min–max partition on trees
Discrete Appl Math
(1993) - et al.
Shifting algorithms for tree partitioning with general weighting functions
J Algorithms
(1983) - et al.
The shifting algorithm technique for the partitioning of trees
Discrete Appl Math
(1995) - et al.
Most uniform path partitioning and its use in image processing
Discrete Appl Math
(1993) - et al.
The GRAIL concept modelling language for medical terminology
Artif Intell Med
(1997) - et al.
Data Structures and Algorithms
(1983) - et al.
A shifting algorithm for min–max tree-partitioning
J ACM
(1982) - Buvac S, Fikes R. A declarative formalization of knowledge translation. In CIKM '95, Baltimore, MD,...
- Buvac S, Mason IM. Propositional logic of context. In Proceedings of the 11th National Conference on Artificial...
- et al.
Automated translation between medical terminologies using semantic definitions
MD Comput
(1990)
Knowledge-based approaches to the maintenance of a large controlled medical terminology
JAMIA
Computers and Intractability
Structure and semantics in OODB class specifications
SIGMOD Record
Cited by (28)
Visual comprehension and orientation into the COVID-19 CIDO ontology
2021, Journal of Biomedical InformaticsCitation Excerpt :Biomedical ontologies represent complex domain knowledge and are not easy to comprehend. We have previously designed various summarization networks that provide “big picture” views of ontologies to achieve better ontology comprehension [22,37,40,50]. A summarization network is composed of nodes, summarizing sets of similar concepts and connected by hierarchical child-of relationships.
Towards summarizing knowledge: Brief ontologies
2012, Expert Systems with ApplicationsCitation Excerpt :The final sections give the conclusions derived from this research and the references cited. One of the first approaches to reducing an ontology to manageable size is described in Gu, Perl, Geller, Halper, and Singh (1999). In this study, the authors partitioned a large, complex semantic network into separate smaller sub-networks.
Structural group auditing of a UMLS semantic type's extent
2009, Journal of Biomedical InformaticsAuditing description-logic-based medical terminological systems by detecting equivalent concept definitions
2008, International Journal of Medical InformaticsCitation Excerpt :The first step is to determine which concept category is to be audited. Generally, medical TSs can be regarded as consisting of various (more or less explicitly distinguishable) modules [26,27]. For example, SNOMED CT specifies not only concepts in the category “disease”, but, among others, also the categories “body structure”, “finding”, “organism”, “specimen”, and “substance”.
Description logic-based methods for auditing frame-based medical terminological systems
2005, Artificial Intelligence in MedicineDesigning metaschemas for the UMLS enriched semantic network
2003, Journal of Biomedical Informatics
- 1
This research was (partially) done under a cooperative agreement between the National Institute of Standards and Technology Advanced Technology Program (under the HIIT contract #70NANB5H1011) and the Healthcare Open Systems and Trials, Inc. consortium, and the Center for Manufacturing Systems.