Abstract
Today, it is well known that taking into account the semantic information available for categorical variables sensibly improves the meaningfulness of the final results of any analysis. The paper presents a generalization of mixed Gibert’s metrics, which originally handled numerical and categorical variables, to include also semantic variables. Semantic variables are defined as categorical variables related to a reference ontology (ontologies are formal structures to model semantic relationships between the concepts of a certain domain). The superconcept-based distance (SCD) is introduced to compare semantic variables taking into account the information provided by the reference ontology. A benchmark shows the good performance of SCD with respect to other proposals, taken from the literature, to compare semantic features. Mixed Gibert’s metrics is generalized incorporating SCD. Finally, two real applications based on touristic data show the impact of the generalized Gibert’s metrics in clustering procedures and, in consequence, the impact of taking into account the reference ontology in clustering. The main conclusion is that the reference ontology, when available, can sensibly improve the meaningfulness of the final clusters.
Similar content being viewed by others
References
Anderberg MR (1973) Cluster analysis for applications. Monographs and textbooks on probability and mathematical statistics. Academic Press, NY
Ankerst M, Breuing MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: ACM SIGMOD international conference on Management of data, pp 49–60
Annichiarico R, Gibert K et al (2004) Qualitative profiles of disability. JRRD 41(6A):835–845
Anton-Clavé S, Nello MG, Orellana A (2007) Coastal tourism in natural parks. An analysis of demand profiles and recreational uses in coastal protected natural areas. Revista Turismo and Desenvolvimento 7–8:9–81
Antunes C (2007) Onto4ar: a framework for mining association rules. In Proceedings ECML/PKDD07
Baralis E, Cagliero L, Cerquitelli T, Garza P, Marchetti M (2011) Cas-mine: providing personalized services in context-aware applications by means of generalized rules. Knowl Inf Syst 28:283–310
Batet M, Sánchez D, Valls A, Gibert K (2013) Semantic similarity estimation from multiple ontologies. Appl Intell 38:29–44
Batet M, Valls A, Gibert K (2010). A distance function to assess the similarity of words using ontologies. In Proc. XV ESTYLF’10, pages 561–566
Benzécri J (1980) Pratique de l’analyse des données. Analyse des correspondances, expose elementaire, vol 1. Paris: Dunod
Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. Knowl Data Eng IEEE Trans 17(4):503–518
Blanchard E, Harzallah M, Kuntz P (2008) A generic framework for comparing semantic similarities on a subsumption hierarchy. In: Ghallab M, Spyropoulos CD, Fakotakis N, Avouris NM (eds). Procedeings of 18th European conference on artificial intelligence (ECAI), vol 178, pp 20–24. IOS Press, Patras, Greece
Breen C, Khan L, Ponnusamy A (2002) Image classification using neural networks and ontologies. In: Database and expert systems applications, 2002. In: Proceedings of the 13th international workshop on, pp 98–102
Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B (2004) Functional classification of proteins for the prediction of cellular function from a protein–protein interaction network. Genome Biol 5:6–6
Buitelaar P, et al. (2004) Ontoselect: a dynamic ontology library with support for ontology selection. In: Proceedings of the International Semantic Web Conference
Cardoso J (2006) Developing an owl ontology for e-tourism. In: Cardoso J, Sheth A (eds). Semantic web services, processes and applications, vol 3 of semantic web and beyond, pp 247–282. Springer, US
Ceccaroni L, Cortés U, Sánchez-Marré M (2004) Ontowedss: augmenting environmental decision-support systems with ontologies. Environ Model Softw 19(9):785–797
Cespivova H, Rauch J, Svatek V, Kejkula M, Tomeckova M (2004) Roles of medical ontology in association mining crisp-dm cycle. In: ECML/PKDD04 workshop on knowledge discovery and ontologies (KDO 2004)
Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. In: Proceedings of the CIKM), pp 1469–1470
Choi C, Cho M, Choi J, Hwang M, Park J, Kim P (2009) Travel ontology for intelligent recommendation system. In: Modelling simulation, 2009. AMS ’09. Third Asia international conference on, pp 637–642
Cimiano P (2006) Ontology learning and population from text. Algorithms, evaluation and applications. Springer, Berlin
Dillon WR, Goldstein M (1984) Multivariate analysis: methods and applications. Wiley, London
Ding L, et al. (2004) Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the XIIIth ACM international CIKM04, pp 652–659. ACM Press, NY
Domingues R (2005) Using ontologies to facilitate the analysis of association rules. In: ECML/PKDD07 Workshop on knowledge discovery and ontologies
Downey D et al. (2007) Locating complex named entities in web text. In: Proceedings of the 20th IJCAI, pp 2733–2739
Ensan F, Du W (2011) A knowledge encapsulation approach to ontology modularization. Knowl Inf Syst 26:249–283
Wood ME (2002) Ecotourism: Principles, practices and policies for sustainability. UNEP, TIES
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Press A (ed) KDD’96, pp 226–231
Fan J, Gao Y, Luo H (2008) Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation. Image Process, IEEE Trans 17(3):407–426
Fayyad U, et al (1996) Advances in KDD and data mining, chapter From data mining to knowledge discovery: an overview. AAAI/MIT Press, Cambridge
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge, Massachusetts. More information: http://www.cogsci.princeton.edu/wn/
Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international Conference on knowledge discovery in data mining, pp 73–83
Garcia A, Bentes C, Melo R, Zadrozny B, Penna T (2011) Sensor data analysis for equipment monitoring. Knowl Inf Syst 28:333–364
Garcia-Crespo A, Lopez-Cuadrado JL, Colomo-Palacios R, Gonzalez-Carrasco I, Ruiz-Mezcua B (2011) Sem-fit: a semantic based expert system to provide recommendations in the tourism domain. Expert Syst Appl 38(10):13310–13319
Gibert K, Cortés U (1997) Weighing quantitative and qualitative variables in clustering methods. Mathware Soft Comput 4(3):251–266
Gibert K, Cortés U (1998) Clustering based on rules and knowledge discovery in ill-structured domains. Comput Sistemas 1(4):213–227
Gibert K, García-Rudolph, et al (2008) Response to TBI-neurorehabilitation through an AI & stats hybrid kdd methodology. Med Arch 62(3):132–135
Gibert K, Nonell et al (2005) Kdd with clustering: impact of metrics and reporting phase by using klass. Neural Net World 15(4):319–326
Gibert K, Nonell R (2003) Impact of mixed metrics on clustering. Lect Notes Comput Sci 2905:464–471
Gibert K, Nonell R (2008) Pre and post-processing in klass. In: iEMSs 2008 Proceedings, pp 1965–1966
Gibert K, Rodríguez-Silva G, Rodríguez-Roda I (2010a) Knowledge discovery with clustering based on rules by states: a water treatment application. Environ Model Softw 25:712–723
Gibert K, Salvador-Carulla L, García Alonso C (2010b) Integrating clinicians, knowledge and data: expert-based cooperative analysis in medical decision support. Health Research Policy and Systems, (in press)
Gibert K, Sonicki Z, Martín JC (2002) Impact of data encoding and thyroids dysfunctions. Stud Health Tech Inf 90:494–498
Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering. 2nd printing. Springer, Berlin. ISBN 1-85233-551-3
Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–874
Guha S, Rastogi R, Shim K (1999) Rock: A robust clustering algorithm for categorical attributes. In Proc. 15th Int. Conf. on Data, Engineering, pp. 512–521
Helsper E, van der Gaag L (2002) Building bayesian networks through ontologies. In: Proceedings of ECAI2002, pp 680–684
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Data mining, 2003. ICDM 2003. Third IEEE international conference on, pp 541–544
Huang Y, Bian L (2009) A bayesian network and analytic hierarchy process based personalized recommendations for tourist attractions over the internet. Expert Syst Appl 36(1):933–943
Song M (2008) Biomedical ontologies and text mining for biomedicine and healthcare: a survey. J Comput Sci Eng 2(2):109–136
Ichino M, Yaguchi H (1994 April) Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans SMC 22(2):146–153
Meng XJ, Chen QC, Wang XL (2009) A tolerance rough set based semantic clustering method for web search results. Inf Technol J 8(4):453–464
Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text mining, Bethesda, Maryland, USA
Gibert K, Conti D, Vrecko D (2012) Assisting the end-user in the interpretation of profiles for decision support. an application to wastewater treatment plants. Environ Eng Manage J 11(5):931–944
Knappe R (2005) Measures of semantic similarity and relatedness for use in ontology-based information retrieval. PhD thesis, Roskilde University, DN
Lamsfus C, Grun C, Alzua-Sorzabal A, Werthner H (2010) Context-based matchmaking to enhance tourists’ experience. J Inform Prof 203:17–23
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification, chapter WordNet: an electronic lexical database, pp 265–283. MIT Press
Lemaire B, Denhière G (2006) Effects of high-order co-occurrences on word semantic similarities. Curr Psychol Lett—BBC 18(1). arXiv:0804.0143. http://arxiv.org/abs/0805.4369
Maedche A, Zacharias V (2002) Clustering ontology-based metadata in the semantic web. In: vol 2431 of LNCS, pp 348–360, London, UK. Springer
Miller G, Charles W (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28
Minguez I, Berrueta D, Polo L (2010) Cruzar: an application of semantic matchmaking to e-tourism. Practices and applications. In: IGI Global, In Cases on semantic interoperability for information systems integration, pp 255–271
Moreno A, Valls A, Isern D, Marin L, Borras J (2013) Sigtur/e-destination: ontology-based personalized recommendation of tourism and leisure activities. Eng Appl Artif Intell 26(1):633–651
Nakhaeizadeh G (1996) Classification as a subtask of data mining experiences form some industrial projects. In Proceedings of the IFCS, vI, pp 17–20
Ovaska K, Aakso M, Hautaniemi S (2008) Fast gene ontology based clustering for microarray experiments. BioData Mining, 1(11). doi:10.1186/1756-0381-1-11
Pandey G, Myers CL, Kumar V (2009) Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinform I:42
Pavlidis P, Qin J, Arango V, Mann J, Sibille E (2004) Using the gene ontology for microarray data mining: A comparison of methods and application to age effects in human prefrontal cortex. Neurochemical Research 29:1213–1222
Pedersen T, Pakhomov S, Patwardhan S, Chute C (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299
Pérez-Bonilla A, Gibert K (2007) Automatic generation of conceptual interpretation of clustering. In: Progress in pattern recognition, image analysis and applications. LNCS, vol 4756, pp 653–663. Springer
Rada R, Mili H, Bichnell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans SMC 9(1):17–30
Rajpathak D, Chougule R, Bandyopadhyay P (2012) A domain-specific decision support system for knowledge discovery using association and text mining. Knowl Inf Syst 31:405–432
Ralambondrainy H (1988) A clustering method for nominal data and mixture of numerical and nominal data. Classification and related methods of data analysis. H.H. Bock, Elsevier Science Publishers, B.V. (North-Holland)
Ramamohanarao K, Krishna P R, et al. (2007) Advances in databases: concepts, systems and applications DASFAA, vol 4443
Renso C, Baglioni M, Macedo JA, Trasarti R, Wachowicz M (2012) How you move reveals who you are: understanding human behavior by analyzing trajectory data. Knowl Inf Syst, 1–32
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the IJCAI 95, pp 448–453, Montreal, Canada
Rubenstein H, Goodenough J (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633
Ruiz-Montiel M, Aldana J (2009) Semantically enhanced recommender systems. In: Proceedings of the OTM 2009 conference, workshop on the move to meaningful internet systems, pp 604–609
Sánchez D, Batet M, Valls A, Gibert K (2010) Ontology-driven web-based semantic similarity. J Intell Inf Syst 35:383–413
Senkul P, Salin S (2012) Improving pattern quality in web usage mining by using semantic information. Knowl Inf Syst 30:527–541
Shin K, Abraham A (2006) IDEAL 2006, LNCS, chapter two Phase Semi-supervised clustering using background knowledge, pp 707–712. Springer
Sokal R, Sneath P (1963) Principles of numerical taxonomy. Freeman, San Francisco
Song S, Guo Z, Chen P (2011) Fuzzy document clustering using weighted conceptual model. Inf Technol J 10(6):1178–1185
Steyvers M, Smyth P, Chemuduganta C (2011) Combining background knowledge and learned topics. Topics Cogn Sci 3:18–47
Thangamani M, Thangaraj P (2010) Integrated clustering and feature selection scheme for text documents. J Comput Sci 6(5):536–541
Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA (2005) Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 33(5):1544–1552
Tversky A (1977) Features of similarity. Phychol Rev 84:327–352
Valls A, Batet M, Lopez E (2009) Using experts rules as background knowledge in the ClusDM methodology. EJOR 193(3):864–875
Wang F, Sun J, Ebadollahi S (2011) Integrating distance metrics learned from multiple experts and its application in inter-patient similarity assessment. In: Proceedings of the 11th SIAM international conference on data mining (SDM), pp 59–70
Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd ACL, pp 133–138, New Mexico, USA
Xing E, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. NIPS, pp 505–512
Yang S, Liao P, Ho C (2005) An ontology-supported case-based reasoning technique for faq proxy service. In: Proceedings of the 17th international conference on software engineering and knowledge, engineering, pp 639–644
Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning. Inf Sci 8:199–249
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374
Zhang J, Silvescu A, Honavar V (2002) Ontology-driven induction of decision trees at multiple levels of abstraction. In: Koenig S, Holte R (eds) Abstraction, reformulation, and approximation, vol 2371 of lecture notes in computer science, pp 316–323. Springer, Berlin, Heidelberg
Acknowledgments
This work is partially supported by the Spanish Ministry of Science and Innovation (DAMASK, TIN2009-11005) in the Spanish Government PlanE (Spanish Economy and Employment Stimulation Plan). Montserrat Batet has been supported by a research grant provided by the Universitat Rovira i Virgili. The testing part has been possible thanks to the data provided by “Observatori de la Fundació d’Estudis Turístics Costa Daurada” and “Parc Nacional del Delta de l’Ebre (Departament de Medi Ambient i Habitatge, Generalitat de Catalunya).” Thanks to S. Clavé for his close collaboration. The authors also acknowledge the collaboration of E. Fourier, D. Corcho, N. Malé and N. Corral in the data preparation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gibert, K., Valls, A. & Batet, M. Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering. Knowl Inf Syst 40, 559–593 (2014). https://doi.org/10.1007/s10115-013-0663-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0663-5