skip to main content
10.1145/3428658.3431079acmconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
research-article

Evaluating Early Fusion Operators at Mid-Level Feature Space

Authors Info & Claims
Published:30 November 2020Publication History

ABSTRACT

Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.

References

  1. A. E. Abdel-Hakim and A. A. Farag. 2006. CSIFT: A SIFT Descriptor with Color Invariant Characteristics. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. 1978-1983.Google ScholarGoogle Scholar
  2. David Arthur and Sergei Vassilvitskii. 2006. k-means++: The Advantages of Careful Seeding. Technical Report 2006--13. Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/Google ScholarGoogle Scholar
  3. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring Scene Detection Performance. In Pattern Recognition and Image Analysis. Springer International Publishing, 395--403. https://doi.org/10.1007/978-3-319-19390-8_45Google ScholarGoogle Scholar
  5. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is "Nearest Neighbor" Meaningful?. In Database Theory --- ICDT'99. Springer Berlin Heidelberg, Berlin, Heidelberg, 217--235.Google ScholarGoogle Scholar
  6. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars. arXiv:cs.CV/1604.07316Google ScholarGoogle Scholar
  7. CISCO. 2018. Cisco Visual Networking Index: Forecast and Trends, 2017--2022. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html. [Online; accessed 25-May-2020].Google ScholarGoogle Scholar
  8. Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. 2004. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1--22.Google ScholarGoogle Scholar
  9. Bertram M. Gross. 1965. The Managing of Organizations: The Administrative Struggle, Vols. I and II. The ANNALS of the American Academy of Political and Social Science 360, 1 (1965), 197--198. https://doi.org/10.1177/000271626536000140Google ScholarGoogle Scholar
  10. Mennan Güder and Nihan Kesim Çiçekli. 2017. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 1 (Feb. 2017), 55--72. https://doi.org/10.1007/s00530-017-0535-zGoogle ScholarGoogle Scholar
  11. Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International Conference on Multimedia and Expo. IEEE. https://doi.org/10.1109/icme.2011.6012001Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications 25, 1 (2014), 33--47. https://doi.org/10.1007/s00138-013-0567-0Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhong Ji, Yuanyuan Zhang, Yanwei Pang, and Xuelong Li. 2018. Hypergraph Dominant Set Based Multi-video Summarization. Signal Processing 148, C (Jul 2018), 114--123. https://doi.org/10.1016/j.sigpro.2018.01.028Google ScholarGoogle Scholar
  14. Y. Jiang, Z. Wu, J. Tang, Z. Li, X. Xue, and S. Chang. 2018. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Transactions on Multimedia 20, 11 (2018), 3137--3147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), oral session.Google ScholarGoogle Scholar
  16. Osmando Pereira Junior. 2019. Novos operadores de fusão aplicados a descritores de textura. Ph.D. Dissertation. University of São Paulo, São Carlos, SP, Brazil.Google ScholarGoogle Scholar
  17. Rodrigo Mitsuo Kishi, Tiago Henrique Trojahn, and Rudinei Goularte. 2019. Correlation based feature fusion for the temporal video scene segmentation task. Multimedia Tools and Applications 78, 11 (JUN 2019), 15623--15646. https://doi.org/10.1007/s11042-018-6959-4Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Irena Koprinska and Sergio Carrato. 2001. Temporal video segmentation: A survey. In Signal Processing: Image Communication. 477--500.Google ScholarGoogle Scholar
  19. Akhil Kumar, Akashdeep Sharma, and Arvind Kalia. 2020. A Review of Research of Object Detection Area: Current and Future Trends. In Proceedings of ICETIT 2019. Springer International Publishing, Cham, 206--218.Google ScholarGoogle ScholarCross RefCross Ref
  20. Bruno Lopes, Tiago Trojahn, and Rudinei Goularte. 2014. Video Scene Detection by Multimodal Bag of Features. Journal of Information and Data Management 5 (06 2014), 1.Google ScholarGoogle Scholar
  21. Media kix. 2018. The 11 Biggest Statistics To Know About YouTubers, Content Creators, & The YouTube Community. https://mediakix.com/blog/youtuber-statistics-content-creators-demographics/. [Online; accessed 25-May-2020].Google ScholarGoogle Scholar
  22. Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In MultiMedia Modeling. Springer International Publishing, Cham, 395--399.Google ScholarGoogle Scholar
  23. Eunsoo Park, Xuenan Cui, Weonjin Kim, and Hakil Kim. 2018. End-to-End Fingerprints Liveness Detection using Convolutional Networks with Gram module. arXiv:cs.CV/1803.07830Google ScholarGoogle Scholar
  24. Osmando Pereira Jr, C. T. Ferraz, and A. Gonzaga. 2018. Image correspondence using a fusion of local region descriptors. In XIV Workshop de Visão Computacional.Google ScholarGoogle Scholar
  25. C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Damaris Rothfuss, Patrick Münster, and Gottfried Zimmermann. 2019. Design Guidelines for Adaptable Videos and Video Players on the Web. In Advances in Design for Inclusion. Springer International Publishing, Cham, 229--240.Google ScholarGoogle Scholar
  27. D. Rotman, D. Porat, and G. Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). 1--6.Google ScholarGoogle Scholar
  28. Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Saraceno and R. Leonardi. 1997. Audio as a support to scene change detection and characterization of video sequences. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. 2597-2600 vol.4.Google ScholarGoogle Scholar
  30. K. Schoeffmann. 2019. Video Browser Showdown 2012-2019: A Review. In 2019 International Conference on Content-Based Multimedia Indexing (CBMI). 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  31. Soumya Sen, Anjan Dutta, and Nilanjan Dey. 2019. Audio Processing and Speech Recognition. Springer Singapore. https://doi.org/10.1007/978-981-13-6098-5Google ScholarGoogle Scholar
  32. K. K. Singh, K. Fatahalian, and A. A. Efros. 2016. KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--9.Google ScholarGoogle Scholar
  33. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. G. M. Snoek and M. Worring. 2002. A review on multimodal video indexing. In Proceedings. IEEE International Conference on Multimedia and Expo, Vol. 2. 21--24 vol.2.Google ScholarGoogle Scholar
  35. Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early versus Late Fusion in Semantic Video Analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA '05). Association for Computing Machinery, New York, NY, USA, 399--402. https://doi.org/10.1145/1101149.1101236Google ScholarGoogle Scholar
  36. Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. 2020. A systematic review on content-based video retrieval. Engineering Applications of Artificial Intelligence 90 (2020), 103557. https://doi.org/10.1016/j.engappai.2020.103557Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Dalton Meitei Thounaojam, Amit Trivedi, Kh. Manglem Singh, and Sudipta Roy. 2014. A Survey on Video Segmentation. In Intelligent Computing, Networking, and Informatics. Springer India, New Delhi, 903--912.Google ScholarGoogle Scholar
  38. K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. 2011. Empowering Visual Categorization with the GPU. IEEE Transactions on Multimedia 13, 1 (2011), 60--70. http://www.science.uva.nl/research/publications/2011/vandeSandeITM2011Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Aravind Vembu, Pradeep Natarajan, Shuang Wu, Rohit Prasad, and Prem Natarajan. 2013. Graph based multimodal word clustering for video event detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3667--3671. https://doi.org/10.1109/icassp.2013.6638342Google ScholarGoogle ScholarCross RefCross Ref
  40. J. Vendrig and M. Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Stefanos Vrochidis, Benoit Huet, Edward Chang, and Ioannis Kompatsiaris. 2019. Big Data Analytics for Large-Scale Multimedia Search. Wiley. https://doi.org/10.1002/9781119376996Google ScholarGoogle Scholar
  42. Kai Wang, Charles-Edmond Bichot, Yan Li, and Bailin Li. 2017. Local binary circumferential and radial derivative pattern for texture classification. Pattern Recognition 67 (2017), 213--229. https://doi.org/10.1016/j.patcog.2017.01.034Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhifang Wang, Erfu Wang, Shuangshuang Wang, and Qun Ding. 2011. Multimodal Biometric System Using Face-Iris Fusion Feature. JCP 6 (2011), 931--938.Google ScholarGoogle Scholar
  44. H. Yang, J. Liu, M. Zhang, and J. Zeng. 2018. Face recognition algorithm based on orthogonal gradient difference local directional pattern. Laser and Optoelectronics Progress 55, 4 (2018). https://doi.org/10.3788/LOP55.041008Google ScholarGoogle Scholar
  45. Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of Video by Clustering and Graph Analysis. Computer Vision and Image Understanding 71, 1 (1998), 94--109. https://doi.org/10.1006/cviu.1997.0628Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huanbo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep Fusion of Multiple Semantic Cues for Complex Event Recognition. IEEE Transactions on Image Processing 25, 3 (March 2016), 1033--1046. https://doi.org/10.1109/tip.2015.2511585Google ScholarGoogle Scholar

Index Terms

  1. Evaluating Early Fusion Operators at Mid-Level Feature Space

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WebMedia '20: Proceedings of the Brazilian Symposium on Multimedia and the Web
            November 2020
            364 pages
            ISBN:9781450381963
            DOI:10.1145/3428658

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 November 2020

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            WebMedia '20 Paper Acceptance Rate34of87submissions,39%Overall Acceptance Rate270of873submissions,31%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader