Skip to main content
Log in

Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Today, it is well known that taking into account the semantic information available for categorical variables sensibly improves the meaningfulness of the final results of any analysis. The paper presents a generalization of mixed Gibert’s metrics, which originally handled numerical and categorical variables, to include also semantic variables. Semantic variables are defined as categorical variables related to a reference ontology (ontologies are formal structures to model semantic relationships between the concepts of a certain domain). The superconcept-based distance (SCD) is introduced to compare semantic variables taking into account the information provided by the reference ontology. A benchmark shows the good performance of SCD with respect to other proposals, taken from the literature, to compare semantic features. Mixed Gibert’s metrics is generalized incorporating SCD. Finally, two real applications based on touristic data show the impact of the generalized Gibert’s metrics in clustering procedures and, in consequence, the impact of taking into account the reference ontology in clustering. The main conclusion is that the reference ontology, when available, can sensibly improve the meaningfulness of the final clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Anderberg MR (1973) Cluster analysis for applications. Monographs and textbooks on probability and mathematical statistics. Academic Press, NY

  2. Ankerst M, Breuing MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: ACM SIGMOD international conference on Management of data, pp 49–60

  3. Annichiarico R, Gibert K et al (2004) Qualitative profiles of disability. JRRD 41(6A):835–845

    Article  Google Scholar 

  4. Anton-Clavé S, Nello MG, Orellana A (2007) Coastal tourism in natural parks. An analysis of demand profiles and recreational uses in coastal protected natural areas. Revista Turismo and Desenvolvimento 7–8:9–81

    Google Scholar 

  5. Antunes C (2007) Onto4ar: a framework for mining association rules. In Proceedings ECML/PKDD07

  6. Baralis E, Cagliero L, Cerquitelli T, Garza P, Marchetti M (2011) Cas-mine: providing personalized services in context-aware applications by means of generalized rules. Knowl Inf Syst 28:283–310

    Article  Google Scholar 

  7. Batet M, Sánchez D, Valls A, Gibert K (2013) Semantic similarity estimation from multiple ontologies. Appl Intell 38:29–44

    Article  Google Scholar 

  8. Batet M, Valls A, Gibert K (2010). A distance function to assess the similarity of words using ontologies. In Proc. XV ESTYLF’10, pages 561–566

  9. Benzécri J (1980) Pratique de l’analyse des données. Analyse des correspondances, expose elementaire, vol 1. Paris: Dunod

  10. Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. Knowl Data Eng IEEE Trans 17(4):503–518

    Article  Google Scholar 

  11. Blanchard E, Harzallah M, Kuntz P (2008) A generic framework for comparing semantic similarities on a subsumption hierarchy. In: Ghallab M, Spyropoulos CD, Fakotakis N, Avouris NM (eds). Procedeings of 18th European conference on artificial intelligence (ECAI), vol 178, pp 20–24. IOS Press, Patras, Greece

  12. Breen C, Khan L, Ponnusamy A (2002) Image classification using neural networks and ontologies. In: Database and expert systems applications, 2002. In: Proceedings of the 13th international workshop on, pp 98–102

  13. Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B (2004) Functional classification of proteins for the prediction of cellular function from a protein–protein interaction network. Genome Biol 5:6–6

    Article  Google Scholar 

  14. Buitelaar P, et al. (2004) Ontoselect: a dynamic ontology library with support for ontology selection. In: Proceedings of the International Semantic Web Conference

  15. Cardoso J (2006) Developing an owl ontology for e-tourism. In: Cardoso J, Sheth A (eds). Semantic web services, processes and applications, vol 3 of semantic web and beyond, pp 247–282. Springer, US

  16. Ceccaroni L, Cortés U, Sánchez-Marré M (2004) Ontowedss: augmenting environmental decision-support systems with ontologies. Environ Model Softw 19(9):785–797

    Google Scholar 

  17. Cespivova H, Rauch J, Svatek V, Kejkula M, Tomeckova M (2004) Roles of medical ontology in association mining crisp-dm cycle. In: ECML/PKDD04 workshop on knowledge discovery and ontologies (KDO 2004)

  18. Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. In: Proceedings of the CIKM), pp 1469–1470

  19. Choi C, Cho M, Choi J, Hwang M, Park J, Kim P (2009) Travel ontology for intelligent recommendation system. In: Modelling simulation, 2009. AMS ’09. Third Asia international conference on, pp 637–642

  20. Cimiano P (2006) Ontology learning and population from text. Algorithms, evaluation and applications. Springer, Berlin

    Google Scholar 

  21. Dillon WR, Goldstein M (1984) Multivariate analysis: methods and applications. Wiley, London

    MATH  Google Scholar 

  22. Ding L, et al. (2004) Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the XIIIth ACM international CIKM04, pp 652–659. ACM Press, NY

  23. Domingues R (2005) Using ontologies to facilitate the analysis of association rules. In: ECML/PKDD07 Workshop on knowledge discovery and ontologies

  24. Downey D et al. (2007) Locating complex named entities in web text. In: Proceedings of the 20th IJCAI, pp 2733–2739

  25. Ensan F, Du W (2011) A knowledge encapsulation approach to ontology modularization. Knowl Inf Syst 26:249–283

    Article  Google Scholar 

  26. Wood ME (2002) Ecotourism: Principles, practices and policies for sustainability. UNEP, TIES

  27. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Press A (ed) KDD’96, pp 226–231

  28. Fan J, Gao Y, Luo H (2008) Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation. Image Process, IEEE Trans 17(3):407–426

    Article  MathSciNet  Google Scholar 

  29. Fayyad U, et al (1996) Advances in KDD and data mining, chapter From data mining to knowledge discovery: an overview. AAAI/MIT Press, Cambridge

  30. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge, Massachusetts. More information: http://www.cogsci.princeton.edu/wn/

  31. Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international Conference on knowledge discovery in data mining, pp 73–83

  32. Garcia A, Bentes C, Melo R, Zadrozny B, Penna T (2011) Sensor data analysis for equipment monitoring. Knowl Inf Syst 28:333–364

    Article  Google Scholar 

  33. Garcia-Crespo A, Lopez-Cuadrado JL, Colomo-Palacios R, Gonzalez-Carrasco I, Ruiz-Mezcua B (2011) Sem-fit: a semantic based expert system to provide recommendations in the tourism domain. Expert Syst Appl 38(10):13310–13319

    Article  Google Scholar 

  34. Gibert K, Cortés U (1997) Weighing quantitative and qualitative variables in clustering methods. Mathware Soft Comput 4(3):251–266

    Google Scholar 

  35. Gibert K, Cortés U (1998) Clustering based on rules and knowledge discovery in ill-structured domains. Comput Sistemas 1(4):213–227

    Google Scholar 

  36. Gibert K, García-Rudolph, et al (2008) Response to TBI-neurorehabilitation through an AI & stats hybrid kdd methodology. Med Arch 62(3):132–135

  37. Gibert K, Nonell et al (2005) Kdd with clustering: impact of metrics and reporting phase by using klass. Neural Net World 15(4):319–326

    Google Scholar 

  38. Gibert K, Nonell R (2003) Impact of mixed metrics on clustering. Lect Notes Comput Sci 2905:464–471

    Article  Google Scholar 

  39. Gibert K, Nonell R (2008) Pre and post-processing in klass. In: iEMSs 2008 Proceedings, pp 1965–1966

  40. Gibert K, Rodríguez-Silva G, Rodríguez-Roda I (2010a) Knowledge discovery with clustering based on rules by states: a water treatment application. Environ Model Softw 25:712–723

    Article  Google Scholar 

  41. Gibert K, Salvador-Carulla L, García Alonso C (2010b) Integrating clinicians, knowledge and data: expert-based cooperative analysis in medical decision support. Health Research Policy and Systems, (in press)

  42. Gibert K, Sonicki Z, Martín JC (2002) Impact of data encoding and thyroids dysfunctions. Stud Health Tech Inf 90:494–498

    Google Scholar 

  43. Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering. 2nd printing. Springer, Berlin. ISBN 1-85233-551-3

  44. Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–874

    Article  Google Scholar 

  45. Guha S, Rastogi R, Shim K (1999) Rock: A robust clustering algorithm for categorical attributes. In Proc. 15th Int. Conf. on Data, Engineering, pp. 512–521

  46. Helsper E, van der Gaag L (2002) Building bayesian networks through ontologies. In: Proceedings of ECAI2002, pp 680–684

  47. Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Data mining, 2003. ICDM 2003. Third IEEE international conference on, pp 541–544

  48. Huang Y, Bian L (2009) A bayesian network and analytic hierarchy process based personalized recommendations for tourist attractions over the internet. Expert Syst Appl 36(1):933–943

    Article  Google Scholar 

  49. Song M (2008) Biomedical ontologies and text mining for biomedicine and healthcare: a survey. J Comput Sci Eng 2(2):109–136

    Google Scholar 

  50. Ichino M, Yaguchi H (1994 April) Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans SMC 22(2):146–153

    Google Scholar 

  51. Meng XJ, Chen QC, Wang XL (2009) A tolerance rough set based semantic clustering method for web search results. Inf Technol J 8(4):453–464

    Article  Google Scholar 

  52. Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text mining, Bethesda, Maryland, USA

  53. Gibert K, Conti D, Vrecko D (2012) Assisting the end-user in the interpretation of profiles for decision support. an application to wastewater treatment plants. Environ Eng Manage J 11(5):931–944

    Google Scholar 

  54. Knappe R (2005) Measures of semantic similarity and relatedness for use in ontology-based information retrieval. PhD thesis, Roskilde University, DN

  55. Lamsfus C, Grun C, Alzua-Sorzabal A, Werthner H (2010) Context-based matchmaking to enhance tourists’ experience. J Inform Prof 203:17–23

    Google Scholar 

  56. Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification, chapter WordNet: an electronic lexical database, pp 265–283. MIT Press

  57. Lemaire B, Denhière G (2006) Effects of high-order co-occurrences on word semantic similarities. Curr Psychol Lett—BBC 18(1). arXiv:0804.0143. http://arxiv.org/abs/0805.4369

    Google Scholar 

  58. Maedche A, Zacharias V (2002) Clustering ontology-based metadata in the semantic web. In: vol 2431 of LNCS, pp 348–360, London, UK. Springer

  59. Miller G, Charles W (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28

    Article  Google Scholar 

  60. Minguez I, Berrueta D, Polo L (2010) Cruzar: an application of semantic matchmaking to e-tourism. Practices and applications. In: IGI Global, In Cases on semantic interoperability for information systems integration, pp 255–271

  61. Moreno A, Valls A, Isern D, Marin L, Borras J (2013) Sigtur/e-destination: ontology-based personalized recommendation of tourism and leisure activities. Eng Appl Artif Intell 26(1):633–651

    Article  Google Scholar 

  62. Nakhaeizadeh G (1996) Classification as a subtask of data mining experiences form some industrial projects. In Proceedings of the IFCS, vI, pp 17–20

  63. Ovaska K, Aakso M, Hautaniemi S (2008) Fast gene ontology based clustering for microarray experiments. BioData Mining, 1(11). doi:10.1186/1756-0381-1-11

  64. Pandey G, Myers CL, Kumar V (2009) Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinform I:42

    Google Scholar 

  65. Pavlidis P, Qin J, Arango V, Mann J, Sibille E (2004) Using the gene ontology for microarray data mining: A comparison of methods and application to age effects in human prefrontal cortex. Neurochemical Research 29:1213–1222

    Article  Google Scholar 

  66. Pedersen T, Pakhomov S, Patwardhan S, Chute C (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299

    Article  Google Scholar 

  67. Pérez-Bonilla A, Gibert K (2007) Automatic generation of conceptual interpretation of clustering. In: Progress in pattern recognition, image analysis and applications. LNCS, vol 4756, pp 653–663. Springer

  68. Rada R, Mili H, Bichnell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans SMC 9(1):17–30

    Google Scholar 

  69. Rajpathak D, Chougule R, Bandyopadhyay P (2012) A domain-specific decision support system for knowledge discovery using association and text mining. Knowl Inf Syst 31:405–432

    Article  Google Scholar 

  70. Ralambondrainy H (1988) A clustering method for nominal data and mixture of numerical and nominal data. Classification and related methods of data analysis. H.H. Bock, Elsevier Science Publishers, B.V. (North-Holland)

  71. Ramamohanarao K, Krishna P R, et al. (2007) Advances in databases: concepts, systems and applications DASFAA, vol 4443

  72. Renso C, Baglioni M, Macedo JA, Trasarti R, Wachowicz M (2012) How you move reveals who you are: understanding human behavior by analyzing trajectory data. Knowl Inf Syst, 1–32

  73. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the IJCAI 95, pp 448–453, Montreal, Canada

  74. Rubenstein H, Goodenough J (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633

    Article  Google Scholar 

  75. Ruiz-Montiel M, Aldana J (2009) Semantically enhanced recommender systems. In: Proceedings of the OTM 2009 conference, workshop on the move to meaningful internet systems, pp 604–609

  76. Sánchez D, Batet M, Valls A, Gibert K (2010) Ontology-driven web-based semantic similarity. J Intell Inf Syst 35:383–413

    Article  Google Scholar 

  77. Senkul P, Salin S (2012) Improving pattern quality in web usage mining by using semantic information. Knowl Inf Syst 30:527–541

    Article  Google Scholar 

  78. Shin K, Abraham A (2006) IDEAL 2006, LNCS, chapter two Phase Semi-supervised clustering using background knowledge, pp 707–712. Springer

  79. Sokal R, Sneath P (1963) Principles of numerical taxonomy. Freeman, San Francisco

    Google Scholar 

  80. Song S, Guo Z, Chen P (2011) Fuzzy document clustering using weighted conceptual model. Inf Technol J 10(6):1178–1185

    Article  Google Scholar 

  81. Steyvers M, Smyth P, Chemuduganta C (2011) Combining background knowledge and learned topics. Topics Cogn Sci 3:18–47

    Article  Google Scholar 

  82. Thangamani M, Thangaraj P (2010) Integrated clustering and feature selection scheme for text documents. J Comput Sci 6(5):536–541

    Article  Google Scholar 

  83. Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA (2005) Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 33(5):1544–1552

    Article  Google Scholar 

  84. Tversky A (1977) Features of similarity. Phychol Rev 84:327–352

    Google Scholar 

  85. Valls A, Batet M, Lopez E (2009) Using experts rules as background knowledge in the ClusDM methodology. EJOR 193(3):864–875

    Article  Google Scholar 

  86. Wang F, Sun J, Ebadollahi S (2011) Integrating distance metrics learned from multiple experts and its application in inter-patient similarity assessment. In: Proceedings of the 11th SIAM international conference on data mining (SDM), pp 59–70

  87. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd ACL, pp 133–138, New Mexico, USA

  88. Xing E, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. NIPS, pp 505–512

  89. Yang S, Liao P, Ho C (2005) An ontology-supported case-based reasoning technique for faq proxy service. In: Proceedings of the 17th international conference on software engineering and knowledge, engineering, pp 639–644

  90. Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning. Inf Sci 8:199–249

    Article  MATH  MathSciNet  Google Scholar 

  91. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374

    Article  Google Scholar 

  92. Zhang J, Silvescu A, Honavar V (2002) Ontology-driven induction of decision trees at multiple levels of abstraction. In: Koenig S, Holte R (eds) Abstraction, reformulation, and approximation, vol 2371 of lecture notes in computer science, pp 316–323. Springer, Berlin, Heidelberg

Download references

Acknowledgments

This work is partially supported by the Spanish Ministry of Science and Innovation (DAMASK, TIN2009-11005) in the Spanish Government PlanE (Spanish Economy and Employment Stimulation Plan). Montserrat Batet has been supported by a research grant provided by the Universitat Rovira i Virgili. The testing part has been possible thanks to the data provided by “Observatori de la Fundació d’Estudis Turístics Costa Daurada” and “Parc Nacional del Delta de l’Ebre (Departament de Medi Ambient i Habitatge, Generalitat de Catalunya).” Thanks to S. Clavé for his close collaboration. The authors also acknowledge the collaboration of E. Fourier, D. Corcho, N. Malé and N. Corral in the data preparation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karina Gibert.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gibert, K., Valls, A. & Batet, M. Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering. Knowl Inf Syst 40, 559–593 (2014). https://doi.org/10.1007/s10115-013-0663-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0663-5

Keywords

Navigation