Abstract
In the area of pattern discovery, there is much interest in discovering small sets of patterns that characterize the data well. In such scenarios, when data is represented by a small set of characterizing patterns, an interesting problem is the comparison of datasets, by comparing the respective representative sets of patterns. In this paper, we propose a novel kernel function for measuring similarities between two sets of patterns, which is based on evaluating the structural similarities between the patterns in the two sets, weighted using their relative frequencies in the data. We define the kernel for injective serial episodes and itemsets. We also present an efficient algorithm for computing this kernel. We demonstrate the effectiveness of our kernel on classification scenarios and for change detection using sequential datasets and transaction databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A preliminary version of this paper was presented as a poster at 2nd IKDD Conference on Data Sciences, CoDS 2015 [6].
- 2.
There are various definitions of frequency proposed for episodes [1]. We are not imposing any condition on what frequency we are considering, and hence \(fr(\alpha )\) could be any measure of relative significance of episode \(\alpha \) in the data.
- 3.
Injective because itemsets, by definition, do not have repetitive items.
References
Achar, A., Laxman, S., Sastry, P.S.: A unified view of the apriori-based algorithms for frequent episode discovery. Knowl. Inf. Syst. 31(2), 223–250 (2012)
Archer, B., Shivakumar, S., Rowe, A., Rajkumar, R.: Profiling primitives of networked embedded automation. In: IEEE International Conference on Automation Science and Engineering, CASE 2009, pp. 531–536. IEEE (2009)
Fernando, B., Fromont, E., Tuytelaars, T.: Effective use of frequent itemset mining for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 214–227. Springer, Heidelberg (2012)
Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: hardness results and efficient alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
Ibrahim, A.: Effective characterization of sequence data through frequent episodes. Ph.D. thesis, (Under review), Indian Institute of Science, Bangalore (2015, submitted)
Ibrahim, A., Sastry, P.S., Sastry, S.: Pattern set kernel. In: Proceedings of the Second ACM IKDD Conference on Data Sciences, pp. 122–123. ACM (2015)
Ibrahim, A., Sastry, S., Sastry, P.S.: Discovering compressing serial episodes from event sequences. Knowl. Inf. Syst. 1–28 (2015). http://link.springer.com/article/10.1007/s10115-015-0854-3
Kondor, R., Jebara, T.: A kernel between sets of vectors. In: ICML, vol. 20, p. 361 (2003)
Lam, H.T., Mörchen, F., Fradkin, D., Calders, T.: Mining compressing sequential patterns. Stat. Anal. Data Min. 7(1), 34–52 (2014)
Lichman, M.: UCI machine learning repository (2013)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Lyu, S.: A kernel between unordered sets of data: the gaussian mixture approach. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 255–267. Springer, Heidelberg (2005)
Tatti, N., Vreeken, J.: The long, the short of it: summarising event sequences with serial episodes. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 462–470. ACM (2012)
van Leeuwen, M., Vreeken, J., Siebes, A.: Compression picks item sets that matter. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 585–592. Springer, Heidelberg (2006)
Vreeken, J., Van Leeuwen, M., Siebes, A.: Characterising the difference. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 765–774. ACM (2007)
Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011)
Xin, D., Han, J., Yan, X., Cheng, H.: Mining compressed frequent-pattern sets. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 709–720. VLDB Endowment (2005)
Yan, X., Han, J., Afshar, R.: Clospan: mining closed sequential patterns in large datasets. In: Proceedings of SIAM International Conference on Data Mining, pp. 166–177. SIAM (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ibrahim, A., Sastry, P.S., Sastry, S. (2016). Analyzing Similarities of Datasets Using a Pattern Set Kernel. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9651. Springer, Cham. https://doi.org/10.1007/978-3-319-31753-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-31753-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31752-6
Online ISBN: 978-3-319-31753-3
eBook Packages: Computer ScienceComputer Science (R0)