Abstract
Data imbalance problem often exists in our real life dataset, especial for massive video dataset, however, the balanced data distribution and the same misclassification cost are assumed in traditional machine learning algorithms, thus, it will be difficult for them to accurately describe the true data distribution, and resulting in misclassification. In this paper, the data imbalance problem in semantic extraction under massive video dataset is exploited, and enhanced and hierarchical structure (called EHS) algorithm is proposed. In proposed algorithm, data sampling, filtering and model training are considered and integrated together compactly via hierarchical structure algorithm, thus, the performance of model can be improved step by step, and is robust and stability with the change of features and datasets. Experiments on TRECVID2010 Semantic Indexing demonstrate that our proposed algorithm has much more powerful performance than that of traditional machine learning algorithms, and keeps stable and robust when different kinds of features are employed. Extended experiments on TRECVID2010 Surveillance Event Detection also prove that our EHS algorithm is efficient and effective, and reaches top performance in four of seven events.
Similar content being viewed by others
References
“Learning from Imbalanced Data Sets,” Proc. Am. Assoc. for Artificial Intelligence (AAAI) Workshop, N. Japkowicz, ed., 2000, (Technical Report WS-00-05).
“Workshop Learning from Imbalanced Data Sets II,” Proc. Int’l Conf. Machine Learning, N.V. Chawla, N. Japkowicz, and A. Kolcz, eds., 2003.
Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 3–11
Akbani R, Kwek S, Japkowicz N (2004) Applying Support Vector Machines to Imbalanced Datasets. European Conference on Machine Learning (ECML) 3201:39–50
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29
Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74
Chan P, Stolfo S (1998) Toward scalable learning with non-uniform class and cost distributions. Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 164–168
Chang S-F, Hsu W, Jiang W, Kennedy L, Xu D et al (2006) Columbia university trecvid-2006 video search and high-level feature extraction,” in TRECVID workshop
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1):1–6
Chen M-Y, Hauptmann A (2009) MoSIFT: Reocgnizing Human Actions in Surveillance Videos. CMU-CS-09-161, Carnegie Mellon University
Chen K, Lu BL, Kwok J (2006) Efficient classification of multi-label and imbalanced data using min-max modular classifiers. Proc. World Congress on Computation Intelligence-Int’l Joint Conf. Neural Networks, pp. 1770–1775
Clifton P, Damminda A, Vincent L (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter 6(1):50–59
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297
Daugman J (1980) Two-dimensional spectral analysis of cortical receptive field profiles. Vis Res 20:847–856
Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inform Process Syst 9:155–161
Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under sampling beats over-sampling. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II
Elkan C (2001) The foundations of cost-sensitive learning. Proc. Int’l Joint Conf. Artificial Intelligence, pp. 973978
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20:18–36
Freund Y, Schapire RE (1997) Decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Graf HP, Cosatto E, Bottou L, Durdanovic I, Vapnik V (2005) Parallel support vector machines: The cascade svm. In Advances in Neural Information Processing Systems 17:521–528
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost IM approach. ACM SIGKDD Explorations Newsletter 6(1):30–39
Haibo He, Member, IEEE, and Edwardo A. Garcia (2009) Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, Vol.21, No.9, pp.1263-1284, Sep
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing (ICIC) 3644:878–887
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proc. Int’l J. Conf. Neural Networks, pp. 1322-1328
He H, Shen X (2007) A ranked subspace learning method for gene expression data classification. Proc. Int’l Conf. Artificial Intelligence, pp. 358-364
Holte RC, Acker L, Porter BW (1989) Concept learning and the problem of small disjuncts. Proc. Int’l J. Conf. Artificial Intelligence, pp. 813–818
Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on Neural Networks (TNN) 18(1):28–41
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Jiang Y-G, Yang J, Ngo C-W, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Transactions on Multimedia 12(1):42–53
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2/3):195–215
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Proc. Conf. AI in Medicine in Europe: Artificial Intelligence Medicine, pp. 63-66,
Liu XY, Wu J, Zhou ZH (2006) Exploratory under sampling for class imbalance learning. Proc. Int’l Conf. Data Mining. 965–969
Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowledge and Data Eng 18(1):63–77
Lowe DG (1999) Object recognition from local scale-invariant features. Proc of the International Conference on Computer Vision, Corfu 2:1150–1157
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II
Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439
Mehrotra R (1992) Gabor filter-based edge detection. PaRem Recognition 25(12):1479–1494
National Institute of Standards and Technology (NIST):http://www.nist.gov/index.html
Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series,” Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II,
Peng Y, Yang Z, Yi J, Cao L, Li H, Yao J (2008) Peking University at TRECVID 2008: High Level Feature Extraction, in TRECVID workshop
Rao RB, Krishnan S, Niculescu RS (2006) Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter 8(1):3–10
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. Proc. Int’l Conf. Data Mining, pp. 592–602
Surveillance event detection: System task, Data, Submissions, Evaluation http://www.itl.nist.gov/iad/mig//tests/trecvid/2009/doc/EventDet09-EvalPlan-v03.htm
Tan C, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14:206–217
Ting KM (2002) An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Trans Knowledge and Data Eng 14(3):659–665
Tomek I (1976) Two modifications of CNN. IEEE Trans System Man Cybernetics 6(11):769–772
TREC Video Retrieval Evaluation (TRECVID): http://trecvid.nist.gov/
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10:988–999
Viola P, Jones M (2001) Robust real-time object detection, second international workshop on statistical and computational theories of vision – modeling, learning, computing and sampling, Vancouver, Canada, July, 13
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features, International conference of computer vision and pattern recognition, Kauai, HI, USA, December, 8–14
Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. Proc. IRIS Machine Learning Workshop
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1):7–19
Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report MLTR-43, Dept. of Computer Science, Rutgers Univ., 2001.
Woods K, Doss C, Bowyer K, Solka J, Priebe C, Kegelmeyer W (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int’l J Pattern Recognition and Artificial Intelligence 7(6):1417–1436
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. ICML Workshop on Learning from Imbalanced Data Sets II
Yang J, Jiang Y-G, Hauptmann AG (2007) etc, Evaluating bag-of-visual-words representations in scene classification[C]//International Multimedia Conference, MM'07, pp.197–206
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. in Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management (CIKM). November
Zhang J, Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. Proc. Int’l Conf. Machine Learning (ICML’2003), Workshop Learning from Imbalanced Data Sets
Zhou ZH, Liu XY (2006) On multi-class cost-sensitive learning. Proc. Nat’l Conf. Artificial Intelligence, pp. 567-572
Acknowledgements
This material is based in part upon work supported by the National Science Foundation under Grants No. 0624236 and 0751185. Zan Gao is partially supported by the NSFC (No.90920001), and Key project in Science and Technology Pillar Program of Tianjin, P.R. China (10ZCKFGX00400). We also thank the anonymous reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gao, Z., Zhang, Lf., Chen, My. et al. Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimed Tools Appl 68, 641–657 (2014). https://doi.org/10.1007/s11042-012-1071-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-012-1071-7