ABSTRACT
Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.
- A. E. Abdel-Hakim and A. A. Farag. 2006. CSIFT: A SIFT Descriptor with Color Invariant Characteristics. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. 1978-1983.Google Scholar
- David Arthur and Sergei Vassilvitskii. 2006. k-means++: The Advantages of Careful Seeding. Technical Report 2006--13. Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/Google Scholar
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarDigital Library
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring Scene Detection Performance. In Pattern Recognition and Image Analysis. Springer International Publishing, 395--403. https://doi.org/10.1007/978-3-319-19390-8_45Google Scholar
- Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is "Nearest Neighbor" Meaningful?. In Database Theory --- ICDT'99. Springer Berlin Heidelberg, Berlin, Heidelberg, 217--235.Google Scholar
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars. arXiv:cs.CV/1604.07316Google Scholar
- CISCO. 2018. Cisco Visual Networking Index: Forecast and Trends, 2017--2022. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html. [Online; accessed 25-May-2020].Google Scholar
- Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. 2004. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1--22.Google Scholar
- Bertram M. Gross. 1965. The Managing of Organizations: The Administrative Struggle, Vols. I and II. The ANNALS of the American Academy of Political and Social Science 360, 1 (1965), 197--198. https://doi.org/10.1177/000271626536000140Google Scholar
- Mennan Güder and Nihan Kesim Çiçekli. 2017. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 1 (Feb. 2017), 55--72. https://doi.org/10.1007/s00530-017-0535-zGoogle Scholar
- Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International Conference on Multimedia and Expo. IEEE. https://doi.org/10.1109/icme.2011.6012001Google ScholarDigital Library
- I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications 25, 1 (2014), 33--47. https://doi.org/10.1007/s00138-013-0567-0Google ScholarDigital Library
- Zhong Ji, Yuanyuan Zhang, Yanwei Pang, and Xuelong Li. 2018. Hypergraph Dominant Set Based Multi-video Summarization. Signal Processing 148, C (Jul 2018), 114--123. https://doi.org/10.1016/j.sigpro.2018.01.028Google Scholar
- Y. Jiang, Z. Wu, J. Tang, Z. Li, X. Xue, and S. Chang. 2018. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Transactions on Multimedia 20, 11 (2018), 3137--3147.Google ScholarDigital Library
- Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), oral session.Google Scholar
- Osmando Pereira Junior. 2019. Novos operadores de fusão aplicados a descritores de textura. Ph.D. Dissertation. University of São Paulo, São Carlos, SP, Brazil.Google Scholar
- Rodrigo Mitsuo Kishi, Tiago Henrique Trojahn, and Rudinei Goularte. 2019. Correlation based feature fusion for the temporal video scene segmentation task. Multimedia Tools and Applications 78, 11 (JUN 2019), 15623--15646. https://doi.org/10.1007/s11042-018-6959-4Google ScholarDigital Library
- Irena Koprinska and Sergio Carrato. 2001. Temporal video segmentation: A survey. In Signal Processing: Image Communication. 477--500.Google Scholar
- Akhil Kumar, Akashdeep Sharma, and Arvind Kalia. 2020. A Review of Research of Object Detection Area: Current and Future Trends. In Proceedings of ICETIT 2019. Springer International Publishing, Cham, 206--218.Google ScholarCross Ref
- Bruno Lopes, Tiago Trojahn, and Rudinei Goularte. 2014. Video Scene Detection by Multimodal Bag of Features. Journal of Information and Data Management 5 (06 2014), 1.Google Scholar
- Media kix. 2018. The 11 Biggest Statistics To Know About YouTubers, Content Creators, & The YouTube Community. https://mediakix.com/blog/youtuber-statistics-content-creators-demographics/. [Online; accessed 25-May-2020].Google Scholar
- Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In MultiMedia Modeling. Springer International Publishing, Cham, 395--399.Google Scholar
- Eunsoo Park, Xuenan Cui, Weonjin Kim, and Hakil Kim. 2018. End-to-End Fingerprints Liveness Detection using Convolutional Networks with Gram module. arXiv:cs.CV/1803.07830Google Scholar
- Osmando Pereira Jr, C. T. Ferraz, and A. Gonzaga. 2018. Image correspondence using a fusion of local region descriptors. In XIV Workshop de Visão Computacional.Google Scholar
- C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, USA.Google ScholarDigital Library
- Damaris Rothfuss, Patrick Münster, and Gottfried Zimmermann. 2019. Design Guidelines for Adaptable Videos and Video Players on the Web. In Advances in Design for Inclusion. Springer International Publishing, Cham, 229--240.Google Scholar
- D. Rotman, D. Porat, and G. Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). 1--6.Google Scholar
- Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., USA.Google ScholarDigital Library
- C. Saraceno and R. Leonardi. 1997. Audio as a support to scene change detection and characterization of video sequences. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. 2597-2600 vol.4.Google Scholar
- K. Schoeffmann. 2019. Video Browser Showdown 2012-2019: A Review. In 2019 International Conference on Content-Based Multimedia Indexing (CBMI). 1--4.Google ScholarCross Ref
- Soumya Sen, Anjan Dutta, and Nilanjan Dey. 2019. Audio Processing and Speech Recognition. Springer Singapore. https://doi.org/10.1007/978-981-13-6098-5Google Scholar
- K. K. Singh, K. Fatahalian, and A. A. Efros. 2016. KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--9.Google Scholar
- A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380.Google ScholarDigital Library
- C. G. M. Snoek and M. Worring. 2002. A review on multimodal video indexing. In Proceedings. IEEE International Conference on Multimedia and Expo, Vol. 2. 21--24 vol.2.Google Scholar
- Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early versus Late Fusion in Semantic Video Analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA '05). Association for Computing Machinery, New York, NY, USA, 399--402. https://doi.org/10.1145/1101149.1101236Google Scholar
- Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. 2020. A systematic review on content-based video retrieval. Engineering Applications of Artificial Intelligence 90 (2020), 103557. https://doi.org/10.1016/j.engappai.2020.103557Google ScholarDigital Library
- Dalton Meitei Thounaojam, Amit Trivedi, Kh. Manglem Singh, and Sudipta Roy. 2014. A Survey on Video Segmentation. In Intelligent Computing, Networking, and Informatics. Springer India, New Delhi, 903--912.Google Scholar
- K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. 2011. Empowering Visual Categorization with the GPU. IEEE Transactions on Multimedia 13, 1 (2011), 60--70. http://www.science.uva.nl/research/publications/2011/vandeSandeITM2011Google ScholarDigital Library
- Aravind Vembu, Pradeep Natarajan, Shuang Wu, Rohit Prasad, and Prem Natarajan. 2013. Graph based multimodal word clustering for video event detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3667--3671. https://doi.org/10.1109/icassp.2013.6638342Google ScholarCross Ref
- J. Vendrig and M. Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarDigital Library
- Stefanos Vrochidis, Benoit Huet, Edward Chang, and Ioannis Kompatsiaris. 2019. Big Data Analytics for Large-Scale Multimedia Search. Wiley. https://doi.org/10.1002/9781119376996Google Scholar
- Kai Wang, Charles-Edmond Bichot, Yan Li, and Bailin Li. 2017. Local binary circumferential and radial derivative pattern for texture classification. Pattern Recognition 67 (2017), 213--229. https://doi.org/10.1016/j.patcog.2017.01.034Google ScholarDigital Library
- Zhifang Wang, Erfu Wang, Shuangshuang Wang, and Qun Ding. 2011. Multimodal Biometric System Using Face-Iris Fusion Feature. JCP 6 (2011), 931--938.Google Scholar
- H. Yang, J. Liu, M. Zhang, and J. Zeng. 2018. Face recognition algorithm based on orthogonal gradient difference local directional pattern. Laser and Optoelectronics Progress 55, 4 (2018). https://doi.org/10.3788/LOP55.041008Google Scholar
- Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of Video by Clustering and Graph Analysis. Computer Vision and Image Understanding 71, 1 (1998), 94--109. https://doi.org/10.1006/cviu.1997.0628Google ScholarDigital Library
- Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huanbo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep Fusion of Multiple Semantic Cues for Complex Event Recognition. IEEE Transactions on Image Processing 25, 3 (March 2016), 1033--1046. https://doi.org/10.1109/tip.2015.2511585Google Scholar
Index Terms
- Evaluating Early Fusion Operators at Mid-Level Feature Space
Recommendations
Early versus late fusion in semantic video analysis
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on MultimediaSemantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, ...
On Comparing Early and Late Fusion Methods
Advances in Computational IntelligenceAbstractThis paper presents a theoretical comparison of early and late fusion methods. An initial discussion on the conditions to apply early or late (soft or hard) fusion is introduced. The analysis show that, if large training sets are available, early ...
Two-layer similarity fusion model for cover song identification
Various musical descriptors have been developed for Cover Song Identification (CSI). However, different descriptors are based on various assumptions, designed for representing distinct characteristics of music, and often differ in scale and noise level. ...
Comments