Instance-based classifiers applied to medical databases: Diagnosis and knowledge extraction
Introduction
Machine learning [1], [2], [3] is the field of artificial intelligence that is concerned with programmes that can learn from experience and improve their performance; within this field the problem of supervised classification concerns the construction of classifier systems that, previously trained, can assign the proper class among a set of possible classes to each instance or object in the input.
Classifier systems can have two major purposes: to predict the class for new observations and to extract knowledge from past experiences. In the medical field they can be used as second-opinion tools for clinical decisions if they are embedded in a clinical decision support system (CDSS) [4, Chapter 25], and they can be used as tools for the phase of knowledge extraction and data mining in knowledge discovery processes [5], [6].
If we consider only the input–output behaviour of classifier systems – that is, if we use them as black boxes – they are useful in predicting the class of new examples: forecasting what will happen in new situations by relying on data that describe what happened in the past.
On the other hand, if we “open the box” and consider also the internal representation of classes that the systems inferred from the training set – that is, we use them as white boxes – then they are useful for extracting the knowledge that is implicitly already contained in the data set as provided by human experts.
This latter activity can be considered even more important than the former: “we are equally – perhaps more – interested in applications in which the result of “learning” is an actual description of a structure that can be used to classify examples. This structural description supports explanation, understanding, and prediction” [1, p. xxiv].
This particular purpose in machine learning is interesting in practically all its applicable fields [7] and has some peculiarities in the medical field [8].
Instance-based (IB) classifier systems [9], [10], [11] constitute a family of classifiers whose main distinctive characteristic is to use the instances themselves as class representations. IB classification relies on the similarity between the new observation to be classified and instances chosen as representative of the learned class. The representative instances can be either previously observed exemplars or abstracted from the given data. In the first case, we are dealing with classifiers called memory-based, exemplar-based or case-based, while in the second case, we deal with those called prototype-based or abstraction-based.
Two well-known classifiers in this family are the nearest neighbour classifier (NNC) and the nearest prototype classifier (NPC), which are based on observed exemplars or abstracted prototypes, respectively.
The NNC is one of the first classifiers proposed in the literature, and with its different extensions and generalizations is still widely used in the clinical setting (e.g., [12], [13], [14], [15]) both for its performance and for its ease of operation and implementation.
There are also applications in the bio-medical field of prototype-based classifier systems (e.g., [16], [17])
IB classifiers, and in particular the k-NNC with an optimized k (an improved version of NNC with a free parameter), are used in real-life problems because achieve good performances, but when used in situations where an explanation of the output of the classifier is useful, they cannot be very effective in providing this explanation because they do not perform any “interpretation” or knowledge extraction from the data set provided by a given physician (e.g., [12, p. 231–2]).
Moreover, instance-based classifiers are usually slow in the classification phase and have large storage requirements.
These problems are not fully overcome by some extensions of the IB classifiers proposed in the literature, such as the fuzzy version (e.g., [18]), which uses the whole training set as a set of representative instances or by the reduction techniques for instances stored in memory (e.g., [11]), which usually build sets of representative instances composed only of exemplars, and decrease the classification accuracy.
It would be preferable for IB systems to extract a set of representative instances that are concise and meaningful in terms of the clinical knowledge contained in the database. Furthermore, a small set of representative instances has the advantage of having small storage requirements and being able to perform a quick classification; yet it should produce similar, or even improved, classification accuracy.
In the following sections, we will describe the general foundation of the instance-based approach, present some representative classifiers belonging to this family and introduce a new one. Then we will discuss the implications and the limitations of the instance-based approach in the knowledge discovery process as applied to medical domains. After a description of the three clinical databases used in this study, we present experimental results and discuss them.
Section snippets
The basic framework
The task of classification is carried out in two steps: one of learning or training and one of predicting, or more properly, classification. In the former a set of labelled data, called the training set, is considered to learn the function which maps observations to classes. In the latter phase, data for which the appropriate class is unknown are considered, and the classifying function, learned during the training phase, is used to predict their classes
Classification algorithms depend heavily
Knowledge discovery process
Knowledge discovery has been defined as “a non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns from collections of data” [7]; its main aim is “to reveal some new and useful information from the data” [27, p. 149].
In the medical field, Cios et al. [5], [6] propose a six-step knowledge discovery process which can be summarized [27] as follows:
- 1.
Understanding the problem domain. Learning the terminology and relevant prior knowledge;
Three diagnostic problems
We have considered three diagnostic problems: the differential diagnosis of erythemato-squamous diseases, the diagnosis of the onset of diabetes mellitus and a problem of diagnostic imaging in nuclear cardiology.
These three clinical problems have been chosen for their applicative interest and because with their respective databases they cover different typologies of classification problems. In fact, the dermatological problem is multi-class, has high dimensionality and nominal attributes, while
Experimental results and discussion
We show here the comparison of the experimental results obtained by applying the classifier systems introduced above to the problem of classifying these databases.
We carried out different test suites for every classifier system: NPC, NNC, k-NNC, T.R.A.C.E. and PEL-C. Each test suite was prepared by using the leave-one-out procedure as a cross-validation technique [22], [42]. Using this technique, each instance of the database in turn is left out, and the learning method is trained on all the
Conclusions
In this study, we have investigated the behaviour of five classifiers belonging to the family of IB learning, and we have applied them to three medical databases.
We have considered classifiers with different types of internal representation, which varies among prototype-based, exemplar-based and a hybrid of the two, to analyze not only variations in diagnostic classification performance, but also the kind of knowledge extracted by these classifiers. From this perspective, we have specified the
Acknowledgments
The author wishes to thank Roberto Cordeschi for the valuable discussions and suggestions and the three anonymous referees for providing him with constructive comments and suggestions that contributed to improve the present paper. Heartedly thanks are due to Angela Brindisi for her support.
References (48)
- et al.
Uniqueness of medical data mining
Artif Intell Med
(2002) - et al.
Applying instance-based techniques to prediction of final outcome in acute stroke
Artif Intell Med
(2005) - et al.
Supervised pattern recognition for the prediction of contrast-enhancement appearance in brain tumors from multivariate magnetic resonance imaging and spectroscopy
Artif Intell Med
(2008) - et al.
Medical diagnosis of atherosclerosis from Carotid Artery Doppler Signals using principal component analysis (PCA), k-NN based weighting pre-processing and Artificial Immune Recognition System (AIRS)
J Biomed Inform
(2008) - et al.
Prediction of diagnosis in patients with early arthritis using a combined Kohonen mapping and instance-based evaluation criterion
Artif Intell Med
(2004) - et al.
Prototype based fuzzy classification in clinical proteomics
Int J Approx Reason
(2008) - et al.
Adaptive prototype-based fuzzy classification
Fuzzy Sets Syst
(2008) - et al.
A software package for interactive motor unit potential classification using fuzzy k-NN classifier
Comput Methods Programs Biomed
(2008) - et al.
Formal methods in pattern recognition: a review
Eur J Oper Res
(2000) - et al.
Knowledge discovery approach to automated cardiac SPECT diagnosis
Artif Intell Med
(2001)