Abstract
One of the challenges in Multimedia Event Retrieval is the integration of data from multiple modalities. A modality is defined as a single channel of sensory input, such as visual or audio. We also refer to this as data source. Previous research has shown that the integration of different data sources can improve performance compared to only using one source, but a clear insight of success factors of alternative fusion methods is still lacking. We introduce several new blind late fusion methods based on inversions and ratios of the state-of-the-art blind fusion methods and compare performance in both simulations and an international benchmark data set in multimedia event retrieval named TRECVID MED. The results show that five of the proposed methods outperform the state-of-the-art methods in a case with sufficient training examples (100 examples). The novel fusion method named JRER is not only the best method with dependent data sources, but this method is also a robust method in all simulations with sufficient training examples.
Similar content being viewed by others
References
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed syst 16(6):345–379
Cremer F, Schutte K, Schavemaker JG, den Breejen E (2001) A comparison of decision-level sensor-fusion methods for anti-personnel landmine detection. Inf Fusion 2(3):187–208
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proc. of Int. Conf. on Multimedia. ACM, pp 675–678
Jiang YG, Bhattacharya S, Chang S-F, Shah MI (2012) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 1–29
Jiang Y-G, Wu Z, Wang J, Xue X, Chang S-F (2015) Exploiting feature and class relationships in video categorization with regularized deep neural networks. In: arXiv preprint arXiv:1502.07209
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR. IEEE, pp 1725–1732
Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 27–34
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1097–1105
Lan Z-Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia event detection. In: Advances in multimedia modeling. Springer, pp 173–185
Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
Ma AJ, Yuen PC, Lai J-H (2013) Linear dependency modeling for classifier fusion and feature combination. IEEE Trans Pattern Anal Mach Intell 35(5):1135–1148
Mc Donald K, Smeaton AF (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International Conference on Image and Video Retrieval. Springer, pp 61–70
Mladenić D (1998) Feature subset selection in text-learning. In: European Conference on Machine Learning. Springer, pp 95–100
Mukaka M (2012) A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24(3):69–71
Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW et al (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25(1):17–32
Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshop
Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R (2012) Multimodal feature fusion for robust event detection in web videos. In: CVPR. IEEE, pp 1298–1305
Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25(1):49–69
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G, Ordelman R (2015) Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proc. TRECVID 2015. NIST, USA
Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Mar Classifi 10(3):61–74
Ravana SD, Moffat A (2009) Score aggregation techniques in retrieval experimentation. In: Proceedings of the Twentieth Australasian Conference on Australasian Database-Volume 92. Australian Computer Society, Inc, pp 57–66
Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146
Strassel S, Morris A, Fiscus JG, Caruso C, Lee H, Over P, Fiumara J, Shaw B, Antonishek B, Michel M (2012) Creating havic: heterogeneous audio visual internet collection. In: LREC. Citeseer, pp 2573–2577
Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 3681–3688
Terrades OR, Valveny E, Tabbone S (2009) Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans Pattern Anal Mach Intell 31(9):1630–1644
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proc. ICCV. IEEE, pp 4489–4497
Tulyakov S, Jaeger S, Govindaraju V, Doermann D (2008) Review of classifier combination methods. In: Machine learning in document analysis and recognition. Springer, pp 361–386
Van Rijsbergen C (1979) Information retrieval
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558
Wilkins P, Ferguson P, Smeaton AF (2006) Using score distributions for query-time fusion in multimediaretrieval. In: Proceedings of the 8th ACM international workshop on Multimedia information retrieval. ACM, pp 51–60
Xiong Y, Zhu K, Lin D, Tang X (2015) Recognize complex events from static images by fusing deep channels. In: Proc. CVPR, pp 1600–1609
Xu L, Krzyzak A, Suen CY (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22(3):418–435
Yu CT, Salton G (1976) Precision weightingan effective automatic indexing method. J ACM (JACM) 23(1):76–88
Zhang H, Lu Y-J, de Boer M, ter Haar F, Qiu Z, Schutte K, Kraaij W, Ngo C-W (2015) VIREO-TNO @ TRECVID 2015: multimedia event detection. In: Proc. of TRECVID 2015
Zheng L, Wang S, Tian L, He F, Liu Z, Tian Q (2015) Query-adaptive late fusion for image search and person re-identification. In: Computer vision and pattern recognition, vol 1
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495
Acknowledgements
We would like to thank the TNO Early Research Program Making Sense of Big Data (MSoBD) for financial support. The work described in this paper was supported in part by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 120213).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Boer, M.H.T., Schutte, K., Zhang, H. et al. Blind late fusion in multimedia event retrieval. Int J Multimed Info Retr 5, 203–217 (2016). https://doi.org/10.1007/s13735-016-0112-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-016-0112-9