Abstract
In this paper, we propose an efficient similarity measure as pre-processing method for clustering of categorical and sequential attributes. The similarity measure is based on a new dynamic programming algorithm, which computes sequence comparison scoring from the gap penalty matrix. This is presented by normalizing sequence comparison scoring. Self-evaluation of the proposed similarity measure is conducted by experimental results of clustering, which is an unsupervised learning algorithm greatly influenced by similarity measure between clusters. In the experiment, Tcpdump Data from DARPA 1999 Intrusion Detection Evaluation Data Sets are used. These transmission data are composed of sequential packet data in a network. Finally, the results of comparison experiments are discussed.
This research was supported by the MIC(Ministry of Information and Communication), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute of Information Technology Assessment) (IITA-2006-C1090-0603-0027).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms, 14th edn. MIT Press and McGraw-Hill Book (1994)
Sali, A., Blundell, T.L.: Definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212, 403–428 (1990)
Tillmann, C., Ney, H.: Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation. Computational Linguistics 29(1), 97–133 (2003)
Myers, C., et al.: Performance Tradeoffs in Dynamic Time Warping Algorithms for Isolated Word Recognition. IEEE Trans. on acoustics, speech, and signal processing ASSP-28(6) (December 1980)
Atallah, M.J.: Algorithms and Theory of Computation Handbook, CRC Press, 2000 N.W. Corporate Blvd., Boca Raton, FL 33431-9868, USA (1999)
Allison, L.: Dynamic programming algorithm (DPA) for edit-distance. In: Algorithms and Data Structures Research & Reference Material, School of Computer Science and Software Engineering, Monash University, Australia (1999)
Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. In: Proceeding of the IEEE International Conference on Data Engineering, Sydney (March 1999)
MIT Lincoln Laboratory, DARPA Intrusion Detection Evaluation Data Sets, http://www.ll.mit.edu/IST/ideval/data/data_index.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Noh, SK., Kim, YM., Kim, D., Noh, BN. (2006). An Efficient Similarity Measure for Clustering of Categorical Sequences. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_41
Download citation
DOI: https://doi.org/10.1007/11941439_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49787-5
Online ISBN: 978-3-540-49788-2
eBook Packages: Computer ScienceComputer Science (R0)