How to measure similarity for multiple categorical data sets?

Park, Simon Soon-Hyoung; Song, Justin JongSu; Lee, James Jung-Hoon; Lee, Wookey; Ree, Sangbok

doi:10.1007/s11042-014-1914-5

How to measure similarity for multiple categorical data sets?

Published: 08 April 2014

Volume 74, pages 3489–3505, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Simon Soon-Hyoung Park¹,
Justin JongSu Song¹,
James Jung-Hoon Lee¹,
Wookey Lee¹ &
…
Sangbok Ree²

743 Accesses
4 Citations
Explore all metrics

Abstract

How to measure similarity or distance for multiple categorical data? It is an important step for Data Mining and Knowledge Management process to measure similarity or distance between objects appropriately. Measurements for continuous data have been well-defined and relatively easy to be calculated. However, the notion of similarity for categorical data is not simple, since categorical data usually is not simply translated into the numerical format, and they also have their own priority with structures and data distribution. In this paper, we propose a new measure for multiple categorical data sets using data distribution. Our new measure, MCSM (Multiple Categorical Similarity Measure), can solve conventional drawbacks of multiple categorical data sets successfully in which we prove the verification of our measure with mathematical proofs and experimentation. The experimental result shows that our measure is powerful for multiple categorical data sets with proper data distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

References

Ahmad A, Dey L (2007) A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recogn Lett 28(1):110–118
Article Google Scholar
Arora NR, Lee W (2013) Graph based ranked answers for keyword graph structure. N Gener Comput 31(2):115–134
Article Google Scholar
Atrey PK, Ibrahim H, Hossain MA, Ramanna S, El-Saddik A (2012) Determining trust in media-rich websites using semantic similarity. Multimed Tools Appl 61(1):69–96
Article Google Scholar
Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P (2010) Building a web-scale image similarity search system. Multimed Tools Appl 47(3):599–629
Article Google Scholar
Bhaduri K, Matthews BL, Giannella C (2011) Algorithms for speeding up distance-based outlier detection. Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp 859–867
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In Proceedings of the 8th SIAM International Conference on Data Mining, pp 243–254
Candan WSL, Vu Q, Agrawal D (2001) Retrieving and organizing web pages by “Information unit,”. Proceedings of the 10th International Conference on World Wide Web, pp 230–244
Cheesman P, Kelly J, Self M, Stutz J, Taylor W, Freeman D (1988) AutoClass: a Bayesian classification system. International Conference on Machine Learning, pp 54–64
Cox TF, Ferry G (1993) Discriminant analysis using non-metric multidimensional scaling. Pattern Recogn 26(1):145–153
Article Google Scholar
Das G, Mannila H (2000) Context-based similarity measures for categorical databases. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp 201–210
Dzogang F, Marsala C, Lesot MJ, Rifqi M (2012) An ellipsoidal k-means for document clustering. IEEE International Conference on Data Mining, pp 221–230
Ganti V, Gehrke J, Ramakrishnan R, (1999) CACTUS-clustering categorical data using summaries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp 73–83
Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: an approach based on dynamical systems. VLDB J 8(3):222–236
Article Google Scholar
Goodall DW (1996) A new similarity index based on probability. Biometrics 22(4):882–907
Article MathSciNet Google Scholar
Gou J, Yi Z, Du L, Xiong T (2012) A local mean-based k-nearest centroid neighbor classifier. Comput J 55(9):1058–1071
Article Google Scholar
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Article Google Scholar
Hashem T, Kulik L, Zhang R (2013) Countering overlapping rectangle privacy attack for moving kNN queries. Inf Syst 38(3):430–453
Article Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304
Article Google Scholar
Huang YP, Lai SL (2012) Novel query-by-humming/singing method with fuzzy inference system. J Converg 3(4):1–8
Google Scholar
Huang Z, Ng KM (1999) A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452
Article Google Scholar
Hwang S, Yu H (2007) Mining and processing category ranking. The 22nd Annual ACM Symposium on Applied Computing, pp 441–442
Jones WP, Furnas GW (1987) Pictures of relevance: a geometric analysis of similarity measures. J Am Soc Inf Sci 38(6):420–442
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Kuo HC (2012) Automatic concept hierarchy construction from a distance. J Converg 3(2):9–14
Google Scholar
Le SQ, Ho TB (2005) An association-based dissimilarity measure for categorical data. Pattern Recogn Lett 26(16):2549–2557
Google Scholar
Lee W, Leung CK, Lee J (2011) Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans Ind Electron 58(6):2154–2162
Article Google Scholar
Lee W, Song J, Leung CK (2011) Categorical data skyline using classification tree. Asia-Pacific Web Conference, pp 181–187
Lee W, Loh W, Sohn M (2012) Searching Steiner trees for web graph query. Comput Ind Eng 62(3):732–739
Article Google Scholar
Lin D (1998) An information-theoretic definition of similarity. International Machine Learning Society, pp 296–304
Mekouar L, Iraqi Y, Boutaba R (2012) An analysis of peer similarity for recommendations in P2P systems. Multimed Tools Appl 60(2):277–303
Article Google Scholar
Nagpal G, Uddin M, Kaur A (2012) A comparative study of estimation by analogy using data mining techniques. J Inf Process Syst 8(4):621–665
Article Google Scholar
Noreault T, McGill M, Koll MB (1981) A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. Proceedings of the 3rd Annual ACM conference on Research and Development in Information Retrieval, pp 57–76
Orair GH, Teixeira CHC, Wang Y, Meira W Jr, Parthasarathy S (2010) Distance-based outlier detection: consolidation and renewed bearing. VLDB J 3(2):1469–1480
Google Scholar
Palmer CR, Faloutsos C (2003) Electricity based external similarity of categorical attributes. Proceedings of the 7th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp 486–500
Pappis CP, Karacapilidis NI (1993) A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets Syst 56(2):171–174
Article MATH MathSciNet Google Scholar
Perkio J, Tuominen AJ, Vahakangas T, Myllymaki P (2012) Image similarity: from syntax to weak semantics. Multimed Tools Appl 57(1):5–27
Article Google Scholar
Santos PS Jr, Almeida JPA, Guizzardi G (2013) An ontology-based analysis and semantics for organizational structure modeling in the ARIS method. Inf Syst 38(5):690–708
Article Google Scholar
Spanakis G, Siolas G, Stafylopatis A (2012) Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. Comput J 55(3):299–312
Article Google Scholar
Stull RB (1988) An introduction to boundary layer meteorology. Atmospheric Sciences Library
The 1998 ACM Computing Classification System—Association for Computing Machinery, http://www.acm.org/about/class/1998/
Torra V, Narukawa Y (2012) On a comparison between Mahalanobis distance and Choquet integral: the Choquet-Mahalanobis operator. Inf Sci Int J 190:56–63
MATH MathSciNet Google Scholar
Wang X, Baets B, De Kerre E (1995) A comparative study of similarity measures. Fuzzy Sets Syst 73(2):259–268
Article MATH Google Scholar
Wong WK, Cheung DW, Kao B, Mamoulis N, (2009) Secure kNN computation on encrypted databases. Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp 139–152
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp 133–138
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data, Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 682–687
Yu H, Ko I, Kim Y, Hwang SW, Han WS (2011) Exact indexing for support vector machines. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp 709–720
Zhang Z, Lu H, Ooi BC, Tung AKH (2010) Understanding the meaning of a shifted sky: a general framework on extending skyline query. VLDB J 19(2):181–201
Article Google Scholar
Zwick R, Carlstein E, Budescu DV (1987) Measures of similarity among fuzzy concepts: a comparative analysis. Int J Approx Reason 1(2):221–242
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by Inha University, Seokyeong University and the National Research Foundation of Korea(NRF) Grant funded by the Korean Government(MOE) (NRF-2013R1A1A2012887)

Author information

Authors and Affiliations

Department of Industrial Engineering, INHA University, Incheon, South Korea
Simon Soon-Hyoung Park, Justin JongSu Song, James Jung-Hoon Lee & Wookey Lee
Department of Industrial Engineering, Seokyeong University, Seoul, South Korea
Sangbok Ree

Authors

Simon Soon-Hyoung Park
View author publications
You can also search for this author in PubMed Google Scholar
Justin JongSu Song
View author publications
You can also search for this author in PubMed Google Scholar
James Jung-Hoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Wookey Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sangbok Ree
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wookey Lee.

Appendix: Experimental results on ACM classification

A1 INTRODUCTORY AND SURVEY 2

C2 COMPUTER-COMMUNICATION NETWORKS 2

C20 Security and protection (eg, firewalls) 1

C21 Network Architecture and Design 3

C22 Network Protocols 4

C24 Distributed Systems 13

C25 Internet1

C26 Standards (eg, TCP/IP) 1

C2m Miscellaneous 1

C4 Design studies 14

D15 Object-oriented Programming 1

D2 SOFTWARE ENGINEERING 1

D20 Protection mechanisms 2

D21 Requirements/Specifications 2

D211 Information hiding 4

D212 Distributed objects 6

D213 Reuse models 2

D22 Design Tools and Techniques 6

D24 Formal methods 4

D25 Testing and Debugging 6

D26 Programming Environments 3

D28 Performance measures 8

D29 Management 5

D3 PROGRAMMING LANGUAGES 1

D31 Formal Definitions and Theory 4

D32 Language Classifications 2

D33 Frameworks 7

D34 Retargetable compilers 9

D46 Security and Protection 3

E1 Graphs and networks 5

E2 DATA STORAGE REPRESENTATIONS 2

E4 Error control codes 3

F11 Models of Computation 2

F2 ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY 1

F20 General 5

F22 Nonnumerical Algorithms and Problems 1

F32 Semantics of Programming Languages 2

F43 Formal Languages 2

G16 Optimization 1

G21 Combinatorial algorithms 2

G22 Network problems 3

G3 PROBABILITY AND STATISTICS 8

H0 GENERAL 1

H1 MODELS AND PRINCIPLES 2

H10 General 2

H11 Systems and Information Theory 7

H12 Human factor 2

H1m Miscellaneous 1

H20 General 1

H21 Schema and subschema 3

H23 Data description languages (DDL) 6

H24 Query processing 11

H27 Security, integrity, and protection 2

H28 Database applications 16

H2m Miscellaneous 1

H3 INFORMATION STORAGE AND RETRIEVAL 4

H30 General 4

H31 Content Analysis and Indexing 19

H32 Information Storage 1

H33 Information Search and Retrieval 119

H34 Systems and Software 22

H35 On-line Information Services 9

H36 Library Automation 37

H37 Dissemination 1

H3m Miscellaneous 4

H4 INFORMATION SYSTEMS APPLICATIONS 2

H40 General 2

H43 Communications Applications 9

H4m Miscellaneous 20

H5 INFORMATION INTERFACES AND PRESENTATION 1

H51 Multimedia Information Systems 3

H52 User Interfaces 21

H53 Group and Organization Interfaces 23

H54 Hypertext/Hypermedia 17

Hm MISCELLANEOUS 4

I2 ARTIFICIAL INTELLIGENCE 1

I20 General 1

I22 Program verification 1

I23 Deduction and Theorem Proving 2

I24 Knowledge Representation Formalisms and Methods 20

I26 Learning 11

I27 Text analysis 10

I28 Graph and tree search strategies 1

I2m Miscellaneous 1

I2n Distributed Artificial Intelligence 3

I36 Interaction techniques 1

I4 IMAGE PROCESSING AND COMPUTER VISION 1

I51 Neural nets 1

I52 Classifier design and evaluation 5

I53 Algorithms 1

I54 Text processing 2

I65 Model Development 2

I6m Miscellaneous 1

I7 DOCUMENT AND TEXT PROCESSING 1

I72 Document Preparation 3

I75 Document analysis 1

I7m Miscellaneous 2

J0 GENERAL 3

J2 Chemistry 2

J4 SOCIAL AND BEHAVIORAL SCIENCES 15

J5 Performing arts (eg, dance, music) 1

Jm MISCELLANEOUS 1

K31 Computer Uses in Education 5

K41 Public Policy Issues 4

K42 Assistive technologies for persons with disabilities 2

K43 Organizational Impacts 3

K44 Electronic Commerce 11

K4m Miscellaneous 1

K52 Governmental Issues 1

K63 Software Management 1

K64 System Management 2

K65 Security and Protection 12

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S.SH., Song, J.J., Lee, J.JH. et al. How to measure similarity for multiple categorical data sets?. Multimed Tools Appl 74, 3489–3505 (2015). https://doi.org/10.1007/s11042-014-1914-5

Download citation

Published: 08 April 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11042-014-1914-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How to measure similarity for multiple categorical data sets?

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Experimental results on ACM classification

Rights and permissions

About this article

Cite this article

Keywords

Navigation

How to measure similarity for multiple categorical data sets?

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Experimental results on ACM classification

Appendix: Experimental results on ACM classification

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation