Abstract
Sampling is an important preprocessing algorithm that is used to mine large data efficiently. Although a simple random sample often works fine for reasonable sample size, accuracy falls sharply with reduced sample size. In kdd’03 we proposed ease that outputs a sample based on its ‘closeness’ to the original sample. Reported results show that ease outperforms simple random sampling (srs). In this paper we propose easier that extends ease in two ways. 1) ease is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. easier, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. 2) ease was shown to work on ibm quest dataset which is a categorical count dataset. easier, in addition, is shown to work on continuous data such as Color Structure Descriptor of images. Two mining tasks, classification and association rule mining, are used to validate the efficacy of easier samples vis-a-vis ease and srs samples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with EASE. In: Proc. 9th Int. Conf. on KDD, pp. 59–68 (2003)
Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: Proc. Int. Conf. on ACM SIGKDD (2002)
Chapelle, O., Halffiner, P., Vapnik, V.N.: Support vector machine for histogram based image classification. IEEE Trans. on Neutral Network 10 (1999)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. Int. Conf. on VLDB (1994)
ISO/IEC15938-8/FDIS3: Information Technology - Multimedia Content Description Interface - Part 8 (Extraction and use of MPEG-7 descriptions)
Ojala, T., Aittola, M., Matinmikko, E.: Empirical evaluation of mpeg-7 xm color descriptors in content-based retrieval of semantic image categories. In: Proc. 16th Int. Conf. on Pattern Recognition, pp. 1021–1024 (2002)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. Int. Conf. on ACM SIGMOD (2000)
Jin, R., Yan, R., Hauptmann, A.: Image classification using a bigram model. In: AAAI Spring Symposium on Intelligent Multimedia Knowledge Management (2003)
Vitter, J.: Random sampling with a reservoir. ACM Trans. Math. Software (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, S., Dash, M., Chia, LT. (2005). Efficient Sampling: Application to Image Data. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_53
Download citation
DOI: https://doi.org/10.1007/11430919_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)