skip to main content
10.1145/3495243.3560519acmconferencesArticle/Chapter ViewAbstractPublication PagesmobicomConference Proceedingsconference-collections
research-article

Cosmo: contrastive fusion learning with small data for multimodal human activity recognition

Published: 14 October 2022 Publication History

Abstract

Human activity recognition (HAR) is a key enabling technology for a wide range of emerging applications. Although multimodal sensing systems are essential for capturing complex and dynamic human activities in real-world settings, they bring several new challenges including limited labeled multimodal data. In this paper, we propose Cosmo, a new system for contrastive fusion learning with small data in multimodal HAR applications. Cosmo features a novel two-stage training strategy that leverages both unlabeled data on the cloud and limited labeled data on the edge. By integrating novel fusion-based contrastive learning and quality-guided attention mechanisms, Cosmo can effectively extract both consistent and complementary information across different modalities for efficient fusion. Our evaluation on a cloud-edge testbed using two public datasets and a new multimodal HAR dataset shows that Cosmo delivers significant improvement over state-of-the-art baselines in both recognition accuracy and convergence delay.

References

[1]
2017. ALZHEIMER'S DIGITAL BIOMARKERS. https://www.alzdiscovery.org/research-and-grants/funding-opportunities/diagnostics-accelerator-digital-biomarkers-program.
[2]
2022. NVDIA JETSON TX2. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/.
[3]
Rebecca Adaimi, Howard Yong, and Edison Thomaz. 2021. Ok Google, What Am I Doing? Acoustic Activity Recognition Bounded by Conversational Assistant Interactions. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1--24.
[4]
Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems 33 (2020).
[5]
Chongguang Bi, Guoliang Xing, Tian Hao, Jina Huh, Wei Peng, and Mengyan Ma. 2017. FamilyLog: A mobile system for monitoring family mealtime activities. In 2017 IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 21--30.
[6]
Tara Boroushaki, Isaac Perper, Mergen Nachin, Alberto Rodriguez, and Fadel Adib. 2021. RFusion: Robotic Grasping via RF-Visual Sensing and Learning. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 192--205.
[7]
Deng Cai, Chiyuan Zhang, and Xiaofei He. 2010. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 333--342.
[8]
Hong Cai, Belal Korany, Chitra R Karanam, and Yasamin Mostofi. 2020. Teaching rf to sense without rf training measurements. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 4 (2020), 1--22.
[9]
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, et al. 2021. Multimodal clustering networks for self-supervised learning from unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8012--8021.
[10]
Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2015. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International conference on image processing (ICIP). IEEE, 168--172.
[11]
Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, and Niki Trigoni. 2019. Selective sensor fusion for neural visual-inertial odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10542--10551.
[12]
Richard Chen, Filip Jankovic, Nikki Marinsek, Luca Foschini, Lampros Kourtis, Alessio Signorini, Melissa Pugh, Jie Shen, Roy Yaari, Vera Maljkovic, et al. 2019. Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2145--2155.
[13]
Xiaoran Fan, Longfei Shangguan, Siddharth Rupavatharam, Yanyong Zhang, Jie Xiong, Yunfei Ma, and Richard Howard. 2021. HeadFi: bringing intelligence to all headphones. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 147--159.
[14]
Rohan Ghosh, Anupam Gupta, Andrei Nakagawa, Alcimar Soares, and Nitish Thakor. 2019. Spatiotemporal filtering for event-based action recognition. arXiv preprint arXiv:1903.07067 (2019).
[15]
Gene Glass and Kenneth Hopkins. 1996. Statistical methods in education and psychology. Psyccritiques 41, 12 (1996).
[16]
Weili Guan, Haokun Wen, Xuemeng Song, Chung-Hsing Yeh, Xiaojun Chang, and Liqiang Nie. 2021. Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations. In Proceedings of the 29th ACM International Conference on Multimedia. 2299--2307.
[17]
Guoliang Xing. 2022. Machine Learning Technologies for Advancing Digital Biomarkers for Alzheimer's Disease, Alzheimer's Drug Discovery Foundation. https://www.alzdiscovery.org/research-and-grants/portfolio-details/21130887.
[18]
Tian Hao, Guoliang Xing, and Gang Zhou. 2013. iSleep: unobtrusive sleep quality monitoring using smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems. 1--14.
[19]
Harish Haresamudram, Irfan Essa, and Thomas Plötz. 2021. Contrastive predictive coding for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 2 (2021), 1--26.
[20]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28, 1 (1979), 100--108.
[21]
Kalun Ho, Franz-Josef Pfreundt, Janis Keuper, and Margret Keuper. 2021. Estimating the Robustness of Classification Models by the Structure of the Learned Feature-Space. arXiv preprint arXiv:2106.12303 (2021).
[22]
Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9248--9257.
[23]
Zhizhang Hu, Tong Yu, Yue Zhang, and Shijia Pan. 2020. Fine-grained activities recognition with coarse-grained labeled multi-modal data. In Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers. 644--649.
[24]
Sinh Huynh, Rajesh Krishna Balan, JeongGil Ko, and Youngki Lee. 2019. VitaMon: measuring heart rate variability using smartphone front camera. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems. 1--14.
[25]
Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei Xue, Chen Song, Xin Ma, Dimitrios Koutsonikolas, et al. 2018. Towards environment independent device free human activity recognition. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 289--304.
[26]
Lampros C Kourtis, Oliver B Regele, Justin M Wright, and Graham B Jones. 2019. Digital biomarkers for Alzheimer's disease: the mobile/wearable devices opportunity. NPJ digital medicine 2, 1 (2019), 1--9.
[27]
Clayton Frederick Souza Leite and Yu Xiao. 2021. Optimal sensor channel selection for resource-efficient deep activity recognition. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021). 371--383.
[28]
Jia Li, Yu Rong, Helen Meng, Zhihui Lu, Timothy Kwok, and Hong Cheng. 2018. Tatc: Predicting alzheimer's disease with actigraphy data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 509--518.
[29]
Tianxing Li, Jin Huang, Erik Risinger, and Deepak Ganesan. 2021. Low-latency speculative inference on distributed multi-modal data streams. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 67--80.
[30]
Jonathan Liono, A Kai Qin, and Flora D Salim. 2016. Optimal time window for temporal segmentation of sensor streams in multi-activity recognition. In Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. 10--19.
[31]
Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. 2021. Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 97--110.
[32]
Xiulong Liu, Dongdong Liu, Jiuwu Zhang, Tao Gu, and Keqiu Li. 2021. RFID and camera fusion for recognition of human-object interactions. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 296--308.
[33]
Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. 2021. Contrastive multimodal fusion with tupleinfonce. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 754--763.
[34]
Yilin Liu, Shijia Zhang, and Mahanth Gowda. 2021. When video meets inertial sensors: zero-shot domain adaptation for finger motion analytics with inertial sensors. In Proceedings of the International Conference on Internet-of-Things Design and Implementation. 182--194.
[35]
Chris Xiaoxuan Lu, Muhamad Risqi U Saputra, Peijun Zhao, Yasin Almalioglu, Pedro PB de Gusmao, Changhao Chen, Ke Sun, Niki Trigoni, and Andrew Markham. 2020. milliEgo: single-chip mmWave radar aided egomotion estimation via deep sensor fusion. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 109--122.
[36]
Haojie Ma, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. 2019. AttnSense: Multi-level Attention Mechanism For Multimodal Human Activity Recognition. In IJCAI. 3109--3115.
[37]
Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen. 2017. CNN-based sensor fusion techniques for multimodal human activity recognition. In Proceedings of the 2017 ACM International Symposium on Wearable Computers. 158--165.
[38]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[39]
Xiaomin Ouyang, Zhiyuan Xie, Jiayu Zhou, Jianwei Huang, and Guoliang Xing. 2021. ClusterFL: a similarity-aware federated learning system for human activity recognition. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 54--66.
[40]
Xiaomin Ouyang, Zhiyuan Xie, Jiayu Zhou, Guoliang Xing, and Jianwei Huang. 2022. ClusterFL: A Clustering-based Federated Learning System for Human Activity Recognition. ACM Transactions on Sensor Networks (TOSN) (2022).
[41]
Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV). 631--648.
[42]
Rui Miguel Pascoal, Ana de Almeida, and Rute C Sofia. 2019. Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems. In Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers. 446--453.
[43]
Fazlay Rabbi, Taiwoo Park, Biyi Fang, Mi Zhang, and Youngki Lee. 2018. When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2, 2 (2018), 1--21.
[44]
Jorge-L Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. Transition-aware human activity recognition using smartphones. Neurocomputing 171 (2016), 754--767.
[45]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53--65.
[46]
Xian Shuai, Yulin Shen, Yi Tang, Shuyao Shi, Luping Ji, and Guoliang Xing. 2021. millieye: A lightweight mmwave radar and camera fusion system for robust object detection. In Proceedings of the International Conference on Internet-of-Things Design and Implementation. 145--157.
[47]
Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM conference on embedded networked sensor systems. 127--140.
[48]
Chi Ian Tang, Ignacio Perez-Pozuelo, Dimitris Spathis, Soren Brage, Nick Wareham, and Cecilia Mascolo. 2021. Selfhar: Improving human activity recognition through self-training with unlabeled data. arXiv preprint arXiv:2102.06073 (2021).
[49]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In European conference on computer vision. Springer, 776--794.
[50]
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243 (2020).
[51]
LinlinTu, Xiaomin Ouyang, Jiayu Zhou, Yuze He, and Guoliang Xing. 2021. Feddl: Federated learning via dynamic layer sharing for human activity recognition. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 15--28.
[52]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[53]
Zhiyuan Xie, Xiaomin Ouyang, Xiaoming Liu, and Guoliang Xing. 2021. Ultra-Depth: Exposing High-Resolution Texture from Depth Cameras. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 302--315.
[54]
Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen. 2021. LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 220--233.
[55]
Hongfei Xue, Wenjun Jiang, Chenglin Miao, Ye Yuan, Fenglong Ma, Xin Ma, Yijiang Wang, Shuochao Yao, Wenyao Xu, Aidong Zhang, et al. 2019. Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data. In Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing. 151--160.
[56]
Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. Deepsense: A unified deep learning framework for time-series mobile sensing data processing. In Proceedings of the 26th international conference on world wide web. 351--360.
[57]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[58]
Fengda Zhang, Kun Kuang, Zhaoyang You, Tao Shen, Jun Xiao, Yin Zhang, Chao Wu, Yueting Zhuang, and Xiaolin Li. 2020. Federated unsupervised representation learning. arXiv preprint arXiv:2010.08982 (2020).
[59]
Hanbin Zhang, Gabriel Guo, Chen Song, Chenhan Xu, Kevin Cheung, Jasleen Alexis, Huining Li, Dongmei Li, Kun Wang, and Wenyao Xu. 2020. PDLens: smartphone knows drug effectiveness among Parkinson's via daily-life activity fusion. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1--14.
[60]
Mi Zhang and Alexander A Sawchuk. 2012. USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM conference on ubiquitous computing. 1036--1043.
[61]
Hongyuan Zhu, Jean-Baptiste Weibel, and Shijian Lu. 2016. Discriminative multi-modal feature fusion for rgbd indoor scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2969--2976.

Cited By

View all
  • (2025)Real-Time Continuous Activity Recognition With a Commercial mmWave RadarIEEE Transactions on Mobile Computing10.1109/TMC.2024.348381324:3(1684-1698)Online publication date: Mar-2025
  • (2025)Temporal Contrastive Learning for Sensor-Based Human Activity Recognition: A Self-Supervised ApproachIEEE Sensors Journal10.1109/JSEN.2024.349193325:1(1839-1850)Online publication date: 1-Jan-2025
  • (2024)BodyFlow: An Open-Source Library for Multimodal Human Activity RecognitionSensors10.3390/s2420672924:20(6729)Online publication date: 19-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MobiCom '22: Proceedings of the 28th Annual International Conference on Mobile Computing And Networking
October 2022
932 pages
ISBN:9781450391818
DOI:10.1145/3495243
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. heterogeneous multimodal fusion
  3. human activity recognition

Qualifiers

  • Research-article

Funding Sources

  • Shenzhen Institute of Artificial Intelligence and Robotics for Society
  • GRF Grants of Research Grants Council (RGC) of Hong Kong
  • Guangdong Basic and Applied Basic Research Foundation
  • Shenzhen Science and Technology Program
  • Alzheimer?s Drug Discovery Foundation

Conference

ACM MobiCom '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 440 of 2,972 submissions, 15%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,052
  • Downloads (Last 6 weeks)82
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Real-Time Continuous Activity Recognition With a Commercial mmWave RadarIEEE Transactions on Mobile Computing10.1109/TMC.2024.348381324:3(1684-1698)Online publication date: Mar-2025
  • (2025)Temporal Contrastive Learning for Sensor-Based Human Activity Recognition: A Self-Supervised ApproachIEEE Sensors Journal10.1109/JSEN.2024.349193325:1(1839-1850)Online publication date: 1-Jan-2025
  • (2024)BodyFlow: An Open-Source Library for Multimodal Human Activity RecognitionSensors10.3390/s2420672924:20(6729)Online publication date: 19-Oct-2024
  • (2024)Robust Human Activity Recognition for Intelligent Transportation Systems Using Smartphone Sensors: A Position-Independent ApproachApplied Sciences10.3390/app14221046114:22(10461)Online publication date: 13-Nov-2024
  • (2024)MultimodalHD: Federated Learning Over Heterogeneous Sensor Modalities using Hyperdimensional Computing2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546794(1-6)Online publication date: 25-Mar-2024
  • (2024)SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal SignalsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997798:4(1-30)Online publication date: 21-Nov-2024
  • (2024)Self-supervised Learning for Accelerometer-based Human Activity Recognition: A SurveyProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997678:4(1-42)Online publication date: 21-Nov-2024
  • (2024)ContrastSense: Domain-invariant Contrastive Learning for In-the-Wild Wearable SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997448:4(1-32)Online publication date: 21-Nov-2024
  • (2024)BP3: Improving Cuff-less Blood Pressure Monitoring Performance by Fusing mmWave Pulse Wave Sensing and Physiological FactorsProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699370(730-743)Online publication date: 4-Nov-2024
  • (2024)Towards Efficient Heterogeneous Multi-Modal Federated Learning with Hierarchical Knowledge DisentanglementProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699360(592-605)Online publication date: 4-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media