skip to main content
research-article

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Published: 22 May 2018 Publication History

Abstract

While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity—that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity—that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

References

[1]
Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 328--335.
[2]
Dan Banica and Cristian Sminchisescu. 2015. Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in rgb-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3517--3526.
[3]
Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. Hierarchical matching pursuit for image classification: Architecture and fast algorithms. In Conference on Neural Information Processing Systems (NIPS’11). 2115--2123.
[4]
Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu. 2012. Semantic segmentation with second-order pooling. In European Conference on Computer Vision (ECCV’12). 430--443.
[5]
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC’14).
[6]
Anoop Cherian, Julien Mairal, Karteek Alahari, and Cordelia Schmid. 2014. Mixing body-part sequences for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 2353--2360.
[7]
Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-CNN: Pose-based CNN features for action recognition. In IEEE International Conference on Computer Vision (ICCV’15). 3218--3226.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.
[9]
Victor Escorcia and Juan Niebles. 2013. Spatio-temporal human-object interactions for action recognition in videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 508--514.
[10]
Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conference on Image Analysis. Springer, 363--370.
[11]
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 759--768.
[12]
Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision (ECCV’14). 392--407.
[13]
Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision 112, 2 (2014), 133--149.
[14]
Swastik Gupta, Pablo Arbelaez, and Jagannath Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 564--571.
[15]
Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision (ECCV’14). 345--360.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.
[17]
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5344--5352.
[18]
Junzhou Huang, Tong Zhang, and Dimitris Metaxas. 2011. Learning with structured sparsity. Journal of Machine Learning Research 12 (2011), 3371--3412.
[19]
Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. 2013. A category-level 3d object dataset: Putting the Kinect to work. In Consumer Depth Cameras for Computer Vision. Springer, 141--165.
[20]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.
[21]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In IEEE International Conference on Computer Vision (ICCV’13). 3192--3199.
[22]
Deguang Kong, Ryohei Fujimaki, Ji Liu, Feiping Nie, and Chris Ding. 2014. Exclusive feature learning on arbitrary structures via -norm. In Conference on Neural Information Processing Systems (NIPS’14). 1655--1663.
[23]
Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1054--1062.
[24]
Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32, 8 (2013), 951--970.
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NIPS’12). 1097--1105.
[26]
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.
[27]
Yiyi Liao, Sarath Kodagoda, Yue Wang, Lei Shi, and Yong Liu. 2015. Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks. arXiv preprint arXiv:1509.06470 (2015).
[28]
Ivan Lillo, Juan Carlos Niebles, and Alvaro Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1981--1990.
[29]
Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.
[30]
Nikhil Naikal, Allen Y. Yang, and S. Shankar Sastry. 2011. Informative feature selection for object recognition via sparse PCA. In IEEE International Conference on Computer Vision (ICCV’11). 818--825.
[31]
David Sontag, Nathan Silberman, and Rob Fergus. 2014. Instance segmentation of indoor scenes using a coverage loss. In European Conference on Computer Vision (ECCV’14). 616--631.
[32]
Pushmeet Kohli, Nathan Silberman, Derek Hoiem, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12).
[33]
Omar Oreifej and Zicheng Liu. 2013. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 716--723.
[34]
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV’10). 143--156.
[35]
Xiaofeng Ren, Liefeng Bo, and Dieter Fox. 2012. Rgb-(d) scene labeling: Features and algorithms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2759--2766.
[36]
Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1194--1201.
[37]
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.
[38]
Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, and Gang Wang. 2016. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2123--2129.
[39]
Nathan Silberman and Rob Fergus. 2011. Indoor scene segmentation using a structured light sensor. In IEEE International Conference on Computer Vision Workshops. 601--608.
[40]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12). 746--760.
[41]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Conference on Neural Information Processing Systems (NIPS’14). 568--576.
[42]
Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 567--576.
[43]
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 1 (1996), 267--288.
[44]
Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2016. Modality and component aware feature fusion for RGB-D scene classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
[45]
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 3169--3176.
[46]
Hua Wang, Feiping Nie, Heng Huang, Shannon Risacher, Chibiao Ding, Andrew J. Saykin, and Li Shen. 2011. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In IEEE International Conference on Computer Vision (ICCV’11). 557--562.
[47]
Hua Wang, Feiping Nie, Heng Huang, Shannon L. Risacher, Andrew J. Saykin, and Li Shen. 2012. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28, 12 (2012), i127--i136.
[48]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV’13). 3551--3558.
[49]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5 (2014), 914--927.
[50]
John Wright, Allen Y. Yang, Arvind Ganesh, Shankar S. Sastry, and Yi Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 210--227.
[51]
Jianxiong Xiao, Andrew Owens, and Antonio Torralba. 2013. SUN3D: A database of big spaces reconstructed using sfm and object labels. In IEEE International Conference on Computer Vision (ICCV’13). 1625--1632.
[52]
Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv preprint arXiv:1504.05843 (2015).
[53]
Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. 2010. Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, 11 (2010), 2861--2873.
[54]
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In IEEE International Conference on Computer Vision (ICCV’11). 1331--1338.
[55]
Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).
[56]
Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.
[57]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4694--4702.
[58]
Yu Zhang, Xiu-shen Wei, Jianxin Wu, Jianfei Cai, Jiangbo Lu, Viet-Anh Nguyen, and Minh N. Do. 2015. Weakly supervised fine-grained image categorization. arXiv preprint arXiv:1504.04943 (2015).
[59]
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Conference on Neural Information Processing Systems (NIPS’14). 487--495.
[60]
Yang Zhou, Rong Jin, and Steven Hoi. 2010. Exclusive lasso for multi-task feature selection. In International Conference on Artificial Intelligence and Statistics. 988--995.
[61]
Yang Zhou, Bingbing Ni, Richang Hong, Meng Wang, and Qi Tian. 2015. Interaction part mining: A mid-level approach for fine-grained action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3323--3331.
[62]
Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, and Xudong Jiang. 2014. Learning discriminative and shareable features for scene classification. In European Conference on Computer Vision (ECCV’14). 552--568.

Cited By

View all
  • (2024)A comprehensive construction of deep neural network‐based encoder–decoder framework for automatic image captioning systemsIET Image Processing10.1049/ipr2.1328718:14(4778-4798)Online publication date: 25-Nov-2024
  • (2023)Integration of the latent variable knowledge into deep image captioning with Bayesian modelingIET Image Processing10.1049/ipr2.1279017:7(2256-2271)Online publication date: 25-Mar-2023
  • (2020)FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW50498.2020.00054(416-425)Online publication date: Jun-2020
  • Show More Cited By

Index Terms

  1. Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 14, Issue 2s
    April 2018
    287 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3210485
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 May 2018
    Accepted: 01 May 2017
    Revised: 01 April 2017
    Received: 01 October 2016
    Published in TOMM Volume 14, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Feature fusion
    2. RGB-D scene classification
    3. action recognition
    4. group sparsity
    5. multimodal analytics

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A comprehensive construction of deep neural network‐based encoder–decoder framework for automatic image captioning systemsIET Image Processing10.1049/ipr2.1328718:14(4778-4798)Online publication date: 25-Nov-2024
    • (2023)Integration of the latent variable knowledge into deep image captioning with Bayesian modelingIET Image Processing10.1049/ipr2.1279017:7(2256-2271)Online publication date: 25-Mar-2023
    • (2020)FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW50498.2020.00054(416-425)Online publication date: Jun-2020
    • (2020)An Efficient RGB-D Scene Recognition Method Based on Multi-Information FusionIEEE Access10.1109/ACCESS.2020.30398738(212351-212360)Online publication date: 2020
    • (2019)Image classification and captioning model considering a CAM‐based disagreement lossETRI Journal10.4218/etrij.2018-0621Online publication date: 25-Jul-2019
    • (2019)An End-to-End Attention-Based Neural Model for Complementary Clothing MatchingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/336807115:4(1-16)Online publication date: 16-Dec-2019
    • (2019)Deep learning for Coating Condition Assessment with Active perceptionProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3342966(75-80)Online publication date: 22-Jun-2019
    • (2019)Multi-source Multi-level Attention Networks for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/331676715:2s(1-20)Online publication date: 19-Jul-2019
    • (2019)JPEG image tampering localization based on normalized gray level co-occurrence matrixMultimedia Tools and Applications10.1007/s11042-018-6611-378:8(9895-9918)Online publication date: 25-May-2019
    • (2018)LAWNProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196066(1-6)Online publication date: 24-Jun-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media