research-article

Evaluating Early Fusion Operators at Mid-Level Feature Space

Authors:
Antonio A. R. Beserra

University of São Paulo, Instituto de Ciências Matemáticas e de Computação, São Carlos, SP, Brazil

University of São Paulo, Instituto de Ciências Matemáticas e de Computação, São Carlos, SP, Brazil
View Profile

,
Rodrigo M. Kishi

Federal University of Mato Grosso do Sul, Três Lagoas Campus, Três Lagoas, MS, Brazil

Federal University of Mato Grosso do Sul, Três Lagoas Campus, Três Lagoas, MS, Brazil
View Profile

,
Rudinei Goularte

University of São Paulo, Instituto de Ciências Matemáticas e de Computação, São Carlos, SP, Brazil

University of São Paulo, Instituto de Ciências Matemáticas e de Computação, São Carlos, SP, Brazil
View Profile

WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the WebNovember 2020Pages 113–120https://doi.org/10.1145/3428658.3431079

Published:30 November 2020Publication History

WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the Web

Pages 113–120

ABSTRACT

Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.

References

A. E. Abdel-Hakim and A. A. Farag. 2006. CSIFT: A SIFT Descriptor with Color Invariant Characteristics. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. 1978-1983.Google Scholar
David Arthur and Sergei Vassilvitskii. 2006. k-means++: The Advantages of Careful Seeding. Technical Report 2006--13. Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/Google Scholar
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarDigital Library
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring Scene Detection Performance. In Pattern Recognition and Image Analysis. Springer International Publishing, 395--403. https://doi.org/10.1007/978-3-319-19390-8_45Google Scholar
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is "Nearest Neighbor" Meaningful?. In Database Theory --- ICDT'99. Springer Berlin Heidelberg, Berlin, Heidelberg, 217--235.Google Scholar
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars. arXiv:cs.CV/1604.07316Google Scholar
CISCO. 2018. Cisco Visual Networking Index: Forecast and Trends, 2017--2022. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html. [Online; accessed 25-May-2020].Google Scholar
Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. 2004. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1--22.Google Scholar
Bertram M. Gross. 1965. The Managing of Organizations: The Administrative Struggle, Vols. I and II. The ANNALS of the American Academy of Political and Social Science 360, 1 (1965), 197--198. https://doi.org/10.1177/000271626536000140Google Scholar
Mennan Güder and Nihan Kesim Çiçekli. 2017. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 1 (Feb. 2017), 55--72. https://doi.org/10.1007/s00530-017-0535-zGoogle Scholar
Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International Conference on Multimedia and Expo. IEEE. https://doi.org/10.1109/icme.2011.6012001Google ScholarDigital Library
I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications 25, 1 (2014), 33--47. https://doi.org/10.1007/s00138-013-0567-0Google ScholarDigital Library
Zhong Ji, Yuanyuan Zhang, Yanwei Pang, and Xuelong Li. 2018. Hypergraph Dominant Set Based Multi-video Summarization. Signal Processing 148, C (Jul 2018), 114--123. https://doi.org/10.1016/j.sigpro.2018.01.028Google Scholar
Y. Jiang, Z. Wu, J. Tang, Z. Li, X. Xue, and S. Chang. 2018. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Transactions on Multimedia 20, 11 (2018), 3137--3147.Google ScholarDigital Library
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), oral session.Google Scholar
Osmando Pereira Junior. 2019. Novos operadores de fusão aplicados a descritores de textura. Ph.D. Dissertation. University of São Paulo, São Carlos, SP, Brazil.Google Scholar
Rodrigo Mitsuo Kishi, Tiago Henrique Trojahn, and Rudinei Goularte. 2019. Correlation based feature fusion for the temporal video scene segmentation task. Multimedia Tools and Applications 78, 11 (JUN 2019), 15623--15646. https://doi.org/10.1007/s11042-018-6959-4Google ScholarDigital Library
Irena Koprinska and Sergio Carrato. 2001. Temporal video segmentation: A survey. In Signal Processing: Image Communication. 477--500.Google Scholar
Akhil Kumar, Akashdeep Sharma, and Arvind Kalia. 2020. A Review of Research of Object Detection Area: Current and Future Trends. In Proceedings of ICETIT 2019. Springer International Publishing, Cham, 206--218.Google ScholarCross Ref
Bruno Lopes, Tiago Trojahn, and Rudinei Goularte. 2014. Video Scene Detection by Multimodal Bag of Features. Journal of Information and Data Management 5 (06 2014), 1.Google Scholar
Media kix. 2018. The 11 Biggest Statistics To Know About YouTubers, Content Creators, & The YouTube Community. https://mediakix.com/blog/youtuber-statistics-content-creators-demographics/. [Online; accessed 25-May-2020].Google Scholar
Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In MultiMedia Modeling. Springer International Publishing, Cham, 395--399.Google Scholar
Eunsoo Park, Xuenan Cui, Weonjin Kim, and Hakil Kim. 2018. End-to-End Fingerprints Liveness Detection using Convolutional Networks with Gram module. arXiv:cs.CV/1803.07830Google Scholar
Osmando Pereira Jr, C. T. Ferraz, and A. Gonzaga. 2018. Image correspondence using a fusion of local region descriptors. In XIV Workshop de Visão Computacional.Google Scholar
C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, USA.Google ScholarDigital Library
Damaris Rothfuss, Patrick Münster, and Gottfried Zimmermann. 2019. Design Guidelines for Adaptable Videos and Video Players on the Web. In Advances in Design for Inclusion. Springer International Publishing, Cham, 229--240.Google Scholar
D. Rotman, D. Porat, and G. Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). 1--6.Google Scholar
Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., USA.Google ScholarDigital Library
C. Saraceno and R. Leonardi. 1997. Audio as a support to scene change detection and characterization of video sequences. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. 2597-2600 vol.4.Google Scholar
K. Schoeffmann. 2019. Video Browser Showdown 2012-2019: A Review. In 2019 International Conference on Content-Based Multimedia Indexing (CBMI). 1--4.Google ScholarCross Ref
Soumya Sen, Anjan Dutta, and Nilanjan Dey. 2019. Audio Processing and Speech Recognition. Springer Singapore. https://doi.org/10.1007/978-981-13-6098-5Google Scholar
K. K. Singh, K. Fatahalian, and A. A. Efros. 2016. KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--9.Google Scholar
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380.Google ScholarDigital Library
C. G. M. Snoek and M. Worring. 2002. A review on multimodal video indexing. In Proceedings. IEEE International Conference on Multimedia and Expo, Vol. 2. 21--24 vol.2.Google Scholar
Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early versus Late Fusion in Semantic Video Analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA '05). Association for Computing Machinery, New York, NY, USA, 399--402. https://doi.org/10.1145/1101149.1101236Google Scholar
Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. 2020. A systematic review on content-based video retrieval. Engineering Applications of Artificial Intelligence 90 (2020), 103557. https://doi.org/10.1016/j.engappai.2020.103557Google ScholarDigital Library
Dalton Meitei Thounaojam, Amit Trivedi, Kh. Manglem Singh, and Sudipta Roy. 2014. A Survey on Video Segmentation. In Intelligent Computing, Networking, and Informatics. Springer India, New Delhi, 903--912.Google Scholar
K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. 2011. Empowering Visual Categorization with the GPU. IEEE Transactions on Multimedia 13, 1 (2011), 60--70. http://www.science.uva.nl/research/publications/2011/vandeSandeITM2011Google ScholarDigital Library
Aravind Vembu, Pradeep Natarajan, Shuang Wu, Rohit Prasad, and Prem Natarajan. 2013. Graph based multimodal word clustering for video event detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3667--3671. https://doi.org/10.1109/icassp.2013.6638342Google ScholarCross Ref
J. Vendrig and M. Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarDigital Library
Stefanos Vrochidis, Benoit Huet, Edward Chang, and Ioannis Kompatsiaris. 2019. Big Data Analytics for Large-Scale Multimedia Search. Wiley. https://doi.org/10.1002/9781119376996Google Scholar
Kai Wang, Charles-Edmond Bichot, Yan Li, and Bailin Li. 2017. Local binary circumferential and radial derivative pattern for texture classification. Pattern Recognition 67 (2017), 213--229. https://doi.org/10.1016/j.patcog.2017.01.034Google ScholarDigital Library
Zhifang Wang, Erfu Wang, Shuangshuang Wang, and Qun Ding. 2011. Multimodal Biometric System Using Face-Iris Fusion Feature. JCP 6 (2011), 931--938.Google Scholar
H. Yang, J. Liu, M. Zhang, and J. Zeng. 2018. Face recognition algorithm based on orthogonal gradient difference local directional pattern. Laser and Optoelectronics Progress 55, 4 (2018). https://doi.org/10.3788/LOP55.041008Google Scholar
Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of Video by Clustering and Graph Analysis. Computer Vision and Image Understanding 71, 1 (1998), 94--109. https://doi.org/10.1006/cviu.1997.0628Google ScholarDigital Library
Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huanbo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep Fusion of Multiple Semantic Cues for Complex Event Recognition. IEEE Transactions on Image Processing 25, 3 (March 2016), 1033--1046. https://doi.org/10.1109/tip.2015.2511585Google Scholar

Index Terms

Evaluating Early Fusion Operators at Mid-Level Feature Space
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Combination, fusion and federated search
    2. Retrieval tasks and goals
      1. Information extraction
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Early versus late fusion in semantic video analysis
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, ...
Read More
On Comparing Early and Late Fusion Methods
Advances in Computational Intelligence
Abstract
This paper presents a theoretical comparison of early and late fusion methods. An initial discussion on the conditions to apply early or late (soft or hard) fusion is introduced. The analysis show that, if large training sets are available, early ...
Read More
Two-layer similarity fusion model for cover song identification

Various musical descriptors have been developed for Cover Song Identification (CSI). However, different descriptors are based on various assumptions, designed for representing distinct characteristics of music, and often differ in scale and noise level. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the Web
November 2020
364 pages
ISBN:9781450381963
DOI:10.1145/3428658
General Chair:
Carlos de Salles Soares Neto
UFMA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Early fusion
Fusion operators
Video analysis
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
WebMedia '20 Paper Acceptance Rate34of87submissions,39%Overall Acceptance Rate270of873submissions,31%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 73
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating Early Fusion Operators at Mid-Level Feature Space

WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Early versus late fusion in semantic video analysis

On Comparing Early and Late Fusion Methods

Two-layer similarity fusion model for cover song identification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating Early Fusion Operators at Mid-Level Feature Space

WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Early versus late fusion in semantic video analysis

On Comparing Early and Late Fusion Methods

Two-layer similarity fusion model for cover song identification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media