ABSTRACT
With the rapid growth of multimedia data, it becomes increasingly important to develop semantic concept modeling approaches that are consistently effective, highly efficient, and easily scalable. To this end, we first propose the robust subspace bagging (RB-SBag) algorithm by augmenting random subspace bagging with forward model selection. Compared with traditional modeling approaches, RB-SBag offers a considerably faster learning process while minimizing the risk of overfitting. Its ensemble structure also enables a convenient transformation into a simple parallel framework called MapReduce. To further improve scalability, we also develop a task scheduling algorithm to optimize task placement for heterogenous tasks. On a collection consisting of more than 250,000 images and several standard TRECVID benchmark datasets, RB-SBag achieved more than a 10-fold speedup with comparable or even better classification performance than baseline SVMs. We also deployed the MapReduce implementation on a 16-node Hadoop cluster, where the proposed task scheduler demonstrates a significantly better scalability than the baseline scheduler in the presence of task heterogeneity.
- Hadoop. http://hadoop.apache.org/.Google Scholar
- Hadoop wiki. http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996. Google ScholarCross Ref
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- R. E. Bryant. Data-intensive supercomputing: The case for disc. Technical report, School of Computer Science, Carnegie Mellon University, 2007.Google Scholar
- R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In Intl. Conf. of Machine Learning, 2004. Google ScholarDigital Library
- E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, and Z. Qiu. Psvm: Parallelizing support vector machines on distributed computers. In Advances in Neural Information Processing Systems, volume 20, 2007.Google Scholar
- C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, page 281. MIT Press, 2007.Google Scholar
- E. G. Coffman, M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM Journal on Computing, 7(1):1--17, 1978.Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832--844, 1998. Google ScholarDigital Library
- Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM Intl. Conf. on Image and video retrieval, pages 494--501, 2007. Google ScholarDigital Library
- T. Joachims. Making large-scale support vector machine learning practical. In A. S. B. Schölkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998. Google ScholarDigital Library
- Y. Lu, L. Zhang, Q. Tian, and W.-Y. Ma. What Are the High-Level Concepts with Small Semantic Gaps? In CVPR08, 2008.Google Scholar
- G. Martinez-Munoz and A. Suárez. Pruning in ordered bagging ensembles. In Proceedings of the 23rd Intl. Conf. on Machine Learning, pages 609--616, 2006. Google ScholarDigital Library
- M. R. Naphade and J. R. Smith. On the detection of semantic concepts at trecvid. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 660--667, New York, NY, USA, 2004. Google ScholarDigital Library
- P. Over, T. Ianeva, W. Kraaij, and A. F. Smeaton. Trecvid 2006 overview. In NIST TRECVID-2006, 2006.Google Scholar
- Same Author. N/A. In N/A, 2007.Google Scholar
- D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 28(7):1088--1099, 2006. Google ScholarDigital Library
- R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In Proceedings of the 13th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pages 834--843, 2007. Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. Technical Report UCB/EECS-2008-99, EECS Department, University of California, Berkeley, Aug 2008.Google Scholar
Index Terms
- Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Recommendations
Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
Highlights- Distributed Heterogeneous Ensemble is designed for big data classification.
- ...
AbstractIn this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
MapReduce based for speech classification
SoICT '16: Proceedings of the 7th Symposium on Information and Communication TechnologySpeech classification is one of the most vital problems in speech processing as well as spoken word recognition. Although, there have been many studies on the classification of speech signals, the results are still limited on both accuracy and the size ...
Comments