Efficient data reduction in multimedia data

Wang, Surong; Dash, Manoranjan; Chia, Liang-Tien; Xu, Min

doi:10.1007/s10489-006-0112-1

Efficient data reduction in multimedia data

Published: December 2006

Volume 25, pages 359–374, (2006)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Surong Wang¹,
Manoranjan Dash¹,
Liang-Tien Chia¹ &
…
Min Xu¹

69 Accesses
6 Citations
Explore all metrics

Abstract

As the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small sample sizes are preferred for various applications. This is typically the case for multimedia applications. But using a simple random sample may not obtain satisfactory results because such a sample may not adequately represent the entire data set due to random fluctuations in the sampling process. The difficulty is particularly apparent when small sample sizes are needed. Fortunately the use of a good sampling set for training can improve the final results significantly. In KDD’03 we proposed EASE that outputs a sample based on its ‘closeness’ to the original sample. Reported results show that EASE outperforms simple random sampling (SRS). In this paper we propose EASIER that extends EASE in two ways. (1) EASE is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. EASIER, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. (2) EASE was shown to work on IBM QUEST dataset which is a categorical count data set. EASIER, in addition, is shown to work on continuous data of images and audio features. We have successfully applied EASIER to image classification and audio event identification applications. Experimental results show that EASIER outperforms SRS significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information Selection and Data Compression RapidMiner Library

Online Streaming Feature Selection Using Sampling Technique and Correlations Between Features

On the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance

References

Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of International Conference on Management of Data
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Databases
Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342
Google Scholar
Astashyn A (2004) Deterministic data reduction methods for transactional data sets. Master thesis
Atlas L, Cohn D, Ladner R, El-Sharkawi MA, Marks IIRJ (1990) Training connectionist networks with queries and selective sampling. In: Advances in neural information processing systems, vol. 2, Morgan Kaufmann Publishers Inc., pp. 566–573
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH MathSciNet Google Scholar
Brönnimann H, Chen B, Dash M, Haas P, Scheuermann P (2003) Efficient data reduction with EASE. In: Proceedings of 9th International Conference on Knowledge Discovery and Data Mining, pp 59–68
Chapelle O, Halffiner P, Vapnik VN (1999) Support vector machine for histogram based image classification. IEEE Trans Neutr Netw 10(5):1055–1064
Article Google Scholar
Chawla N, Eschrich S, Hall LO (2001) Creating ensembles of classifiers. In: Proceedings of International Conference on Data Mining, pp 580–581
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of International Conference on Knowledge Discovery and Data Mining
Cohn DA, Ghahramani Z, Jordan MI (1995) Active learning with statistical models. In: Advances in neural information processing systems, vol 7, The MIT Press, pp 705–712
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International Conference on Machine Learning, pp 194–202
Duan LY, Xu M, Chua TS, Tian Q, Xu CS (2003) A mid-level representation framework for semantic sports video analysis. In: Proceedings of ACM Multimedia
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of International Conference on Management of Data
ISO/IEC15938-8/FDIS3. Information Technology—Multimedia Content Description Interface—Part 8: Extraction and use of MPEG-7 descriptions
Iyengar VS, Apte C, Zhang T (2000) Active learning using adaptive resampling. In: Proceeding of Intenational Conference on Knowledge Discovery and Data Mining, pp 92–98
Jin R, Yan R, Hauptmann A (2003) Image classification using a bigram model. In: AAAI Spring Symposium on Intelligent Multimedia Knowledge Management
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Cohen WW, Hirsh H (eds) Proceedings of 11th International Conference on Machine Learning, New Brunswick, US. Morgan Kaufmann Publishers, San Francisco, US, pp 148–156
Google Scholar
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: W. Bruce Croft and Cornelis J. van Rijsbergen (eds) Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, IE. Springer Verlag, Heidelberg, DE, pp 3–12
Google Scholar
Manjunath BS, Salembier P, Sikora T (2002) Introduction to MPEG-7. John Wiley & Sons, Ltd
Meek C, Thiesson B, Heckerman D (2002) The learning-curve sampling method applied to model-based clustering. J Mach Learn Res 2(3):397–418
Article MATH MathSciNet Google Scholar
Nepal S, Srinivasan U, Reynolds G (2001) Automatic detection of goal segments in basketball videos. In: Proceedings of ACM Multimedia, Los Angeles, CA
Ojala T, Aittola M, Matinmikko E (2002) Empirical evaluation of mpeg-7 xm color descriptors in content-based retrieval of semantic image categories. In: Proceedings of 16th International Conference on Pattern Recognition, Quebec, Canada, pp 1021–1024
Plutowski M, White H (1993) Selecting concise training sets from clean data. IEEE Trans Neur Netw 4(2):305–318
Article Google Scholar
Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for tv baseball programs. In: Proceedings of ACM Multimedia, pp 105–115
Saar-Tsechansky M, Provost F (2001) Active learning for class probability estimation and ranking. In: Proceedings of 17th International Joint Conference on Artificial Intelligence, pp 911–920
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of 8th ACM International Conference on Knowledge Discovery and Data Mining
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis
Tong S, Koller D (2000) Support vector machine active learning with applications to text classification. In: Langley P (ed) Proceedings of 17th International Conference on Machine Learning, Stanford, US. Morgan Kaufmann Publishers, San Francisco, US, pp 999–1006
Google Scholar
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Mathem Softw 11(1):37–57
Article MATH MathSciNet Google Scholar
Wang S, Dash M, Chia L-T (2005) Efficient sampling for image application. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining
Wang S, Xu M, Chia L-T, Dash M (2005) Easier sampling for audio event identification. In: Proceedings of International Conference on Multimedia and Expo
Xu M, Duan L-Y, Cai J, Chia L-T, Xu C-S, Tian Q (2004) Hmm-based audio keyword generation. In: Proceedings of Pacific Conference on Multimedia vol. 3, pp. 566–574
Google Scholar
Xu M, Duan L-Y, Chia L-T, Xu C-S (2004) Audio keywords generation for sports video analysis. In: Proceedings of ACM Multimedia
Young S et al (2002) The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department

Download references

Author information

Authors and Affiliations

Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore, 639798
Surong Wang, Manoranjan Dash, Liang-Tien Chia & Min Xu

Authors

Surong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Liang-Tien Chia
View author publications
You can also search for this author in PubMed Google Scholar
Min Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Surong Wang.

Additional information

Surong Wang received the B.E. and M.E. degree from the School of Information Engineering, University of Science and Technology Beijing, China, in 1999 and 2002 respectively. She is currently studying toward for the Ph.D. degree at the School of Computer Engineering, Nanyang Technological University, Singapore. Her research interests include multimedia data processing, image processing and content-based image retrieval.

Manoranjan Dash obtained Ph.D. and M. Sc. (Computer Science) degrees from School of Computing, National University of Singapore. He has worked in academic and research institutes extensively and has published more than 30 research papers (mostly refereed) in various reputable machine learning and data mining journals, conference proceedings, and books. His research interests include machine learning and data mining, and their applications in bioinformatics, image processing, and GPU programming. Before joining School of Computer Engineering (SCE), Nanyang Technological University, Singapore, as Assistant Professor, he worked as a postdoctoral fellow in Northwestern University. He is a member of IEEE and ACM. He has served as program committee member of many conferences and he is in the editorial board of “International journal of Theoretical and Applied Computer Science.”

Liang-Tien Chia received the B.S. and Ph.D. degrees from Loughborough University, in 1990 and 1994, respectively. He is an Associate Professor in the School of Computer Engineering, Nanyang Technological University, Singapore. He has recently been appointed as Head, Division of Computer Communications and he also holds the position of Director, Centre for Multimedia and Network Technology.

His research interests include image/video processing & coding, multimodal data fusion, multimedia adaptation/transmission and multimedia over the Semantic Web. He has published over 80 research papers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Dash, M., Chia, LT. et al. Efficient data reduction in multimedia data. Appl Intell 25, 359–374 (2006). https://doi.org/10.1007/s10489-006-0112-1

Download citation

Issue Date: December 2006
DOI: https://doi.org/10.1007/s10489-006-0112-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient data reduction in multimedia data

Abstract

Access this article

Similar content being viewed by others

Information Selection and Data Compression RapidMiner Library

Online Streaming Feature Selection Using Sampling Technique and Correlations Between Features

On the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient data reduction in multimedia data

Abstract

Access this article

Similar content being viewed by others

Information Selection and Data Compression RapidMiner Library

Online Streaming Feature Selection Using Sampling Technique and Correlations Between Features

On the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation