Abstract
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space.
Similar content being viewed by others
References
Logeswaran, R., Cholangiocarcinoma—An automated preliminary detection system using MLP. J. Med. Syst. 33:413–421, 2009.
Afsar, F. A., and Arif, M., Robust Electrocardiogram (ECG) beat classification using discrete wavelet transform. Physiol. Meas. 29:555–570, 2008.
Kim, J. H., Kohane, I. S., and Ohno-Machado, L., Visualization and evaluation of clusters for exploratory analysis of gene expression data. J. Biomed. Inform. 35(1):25–36, 2002.
Afsar, F. A., and Arif, M., Detection of ST segment deviation episodes in the ECG using KLT with an ensemble neural classifier. Physiol. Meas. 29:747–760, 2008.
Andrews, D. F., Plot of high dimensional data. Biometrics 29:125–136, 1972.
Chambers, J. M., Cleveland, W. S., Kleiner, B., Tukey, P. A., Graphical methods for data analysis. Chapman and Hall, 1976.
van Wijk, J. J., van Liere, R., HyperSlice, Proceedings of IEEE Visualization ‘93. In: Nielson, G. M., Bergeron, R. D., (Ed.), Los Alamitos: IEEE Computer Society Press, pp. 119–125, 1993.
Alpern, B., Carter, L., Hyperbox, Proceedings of IEEE Visualization ‘91, 133–139, 1991.
Spence, R., Tweedie, L., Dawkes, H., Su, H., Visualisation for functional design. Proceedings of IEEE Visualization ‘95, 4–10, 1995.
Inselberg, A., The plane with parallel coordinates. Vis. Comp. 69–92, 1985.
Inselberg, A., Dimsdale, B., Parallel coordinates: A tool for visualization high dimensional geometry. Proc. IEEE Visualization, 361–378, 1990.
Peng, W., Ward, M. O., Rundensteiner, E. A., Cluster reduction in multi-dimensional data visualization using dimension reordering. Proc of IEEE symposium on Information visualization, 89–96, 2004.
Johansson, J., Ljung, P., Jern, M., Cooper, M., Revealing structures within clustered parallel coordinates display. Proc. of IEEE symposium on Information visualization, 125–132, 2005.
Siirtola, H., Direct manipulation of parallel coordinates. Proc. of IEEE 4th International Conference on Information visualization, 373–378, 2000.
Murtagh, F., A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26(4):354–359, 1983.
Boudaillier, E., and Hebrial, G., Interactive interpretation of hierarchical clustering. Intell. Data Anal. 2(3):229–244, 1998.
Willet, P., Recent trends in hierarchical document clustering: A critical review. Inf. Process. Manage. 24:577–597, 1988.
Kohonen, T., The self-organising map. Proc. IEEE 78(9):m1464–1480, 1990.
Brunsdon, C., Fotheringham, A. S., Charlton, M. E., An investigation of methods for visualising highly multivariate datasets. In Case studies of Visualization in Social Sciences, pp. 55–80, 1998.
Leban, G., Bratko, I., Petrovic, U., Curk, T., and Zupan, B., Vizrank: finding informative data projections in functional genomics by machine learning. Bioinformatics 21(3):413–414, 2005.
McCarthy, J. F., Marx, K. A., Hoffman, P. E., Gee, A. G., O’Neil, P., Ujwal, M. L., and Hotchkiss, J., Applications of machine learning and high-dimensional visualization in cancer detection, diagnosis and management. Ann. NY Acad. Sci. 1020:239–262, 2004.
Demsar, J., Leban, G., and Zupan, B., FreeViz—an intelligent multivariate visualization approach to explorative analysis of biomedical data. J. Biomed. Inform. 40(6):661–671, 2007.
Horton, P., Nakai, K., A probabilistic classification system for predicting the cellular localization sites of proteins. Proc. 4th Int. Conf. Intell. Syst Mol. Biol. 109–115, 1996.
Tanwani, A. K., Afridi, J., Shafiq, M. Z., Farooq, M., Guidelines to select machine learning scheme for classification of biomedical datasets. Proc of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, 128–139, 2009.
Mangasarian, O. L., Street, W. N., and Wolberg, W. H., Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43(4):570–577, 1995.
Wolberg, W. H., Street, W. N., Heisey, D. M., Mangasarian, O. L., Computerized breast cancer diagnosis and prognosis from fine needle aspirates, Arch. Surg. 130:511–516, 1995.
Moghaddam, B., Shakhnarovich, G., Boosted dyadic kernel discriminants. Proc of Neural Information Processing Systems, 761–768, 2002.
Ubeyli, E. D., A mixture of experts network structure for breast cancer diagnosis. J. Med. Syst. 29(5), 2005.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:531–537, 1999.
Chao, S., Lihui, C., Feature dimension reduction for microarray data analysis using locally linear embedding. Proc. of 3 rd Asia-Pacific Bioinformatics conference, 211–217, 2005.
Sohn, K., and Lim, S. H., A new gene selection method based on PCA for molecular classification. Proc of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery 4:275–279, 2007.
Marchand, M., Shah, M., PAC-Bayes learning of conjunctions and classification of gene-expression data. In: Saul, L. K., Weiss, Y., Bottou, L. (Ed.), Advances in Neural Information Processing Systems, MIT Press, 17, pp. 881–888, 2005.
Pillati, M., Viroli, C., Supervised locally linear embedding for classification: An application to gene expression data analysis. In: Zani, S., Cerioli, A. (Eds.), Book of Short Papers, CLADAG2005, Parma, pp. 147–150, 2005.
Asuncion, A., Newman, D. J., UCI machine learning repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science, 2007.
Lal, T. N., Chapelle, O., Schölkopf, B., Combining a filter method with SVMs. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. A. (Eds.), Feature Extraction: Foundations and Applications. Springer, pp. 439–446, 2006.
Li, K., Meng, X., Cao, Z., Sun, X., Multi-view learning for high dimensional data classification. Chinese Control and Decision Conference, CCDC ‘09, 3766–3770, 2009.
Hong, Z. Q., and Yang, J. Y., Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit. 24(4):317–324, 1991.
Aeberhard, S., Coomans, D., De Vel, O., Comparative-analysis of statistical pattern recognition methods in high-dimensional settings. Proc of IEEE Signal Processing Workshop on Higher Order Statistics, 14–16, 1994.
Chitsaz, E., Taheri, M., Katebi, S. D., and Jahromi, M. Z., An improved fuzzy feature clustering and selection based on chi-squared-test. Proc of the International Multi Conference of Engineers and Computer Scientists 1:35–40, 2009.
McKusick, K., Thompson, K., COBWEB/3: A portable implementation, Technical Report FIA-90-6-18-2. NASA Ames Research Center, 1980.
Cha, S.-H., Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Model Meth. Appl. Sci. 1(4):300–307, 2007.
Reich, Y., Fenves, S. J., The formation and use of abstract concepts in design. In: Fisher, D. H., Pazzani, M. J., Langley, P. (Eds.), Concepts Formation: Knowledge and Experience in Unsupervised Learning. Morgan Kaufmann, pp. 323–352, 1991.
Li, C., and Biswas, G., Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4):673–690, 2002.
Goodall, D.W., A new similarity index based on probability. Biometrics. 22:882–907, 1966.
Boriah, S., Chandola, V., Kumar, V., Similarity measures for categorical data: A comparative evaluation. In: SDM, SIAM, Philadelphia, pp. 243–254, 2008.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Arif, M. Similarity-Dissimilarity Plot for Visualization of High Dimensional Data in Biomedical Pattern Classification. J Med Syst 36, 1173–1181 (2012). https://doi.org/10.1007/s10916-010-9579-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10916-010-9579-8