skip to main content
research-article

Cross-Modality Graph-based Language and Sensor Data Co-Learning of Human-Mobility Interaction

Published: 27 September 2023 Publication History

Abstract

Learning the human--mobility interaction (HMI) on interactive scenes (e.g., how a vehicle turns at an intersection in response to traffic lights and other oncoming vehicles) can enhance the safety, efficiency, and resilience of smart mobility systems (e.g., autonomous vehicles) and many other ubiquitous computing applications. Towards the ubiquitous and understandable HMI learning, this paper considers both "spoken language" (e.g., human textual annotations) and "unspoken language" (e.g., visual and sensor-based behavioral mobility information related to the HMI scenes) in terms of information modalities from the real-world HMI scenarios. We aim to extract the important but possibly implicit HMI concepts (as the named entities) from the textual annotations (provided by human annotators) through a novel human language and sensor data co-learning design.
To this end, we propose CG-HMI, a novel Cross-modality Graph fusion approach for extracting important Human-Mobility Interaction concepts from co-learning of textual annotations as well as the visual and behavioral sensor data. In order to fuse both unspoken and spoken "languages", we have designed a unified representation called the human--mobility interaction graph (HMIG) for each modality related to the HMI scenes, i.e., textual annotations, visual video frames, and behavioral sensor time-series (e.g., from the on-board or smartphone inertial measurement units). The nodes of the HMIG in these modalities correspond to the textual words (tokenized for ease of processing) related to HMI concepts, the detected traffic participant/environment categories, and the vehicle maneuver behavior types determined from the behavioral sensor time-series. To extract the inter- and intra-modality semantic correspondences and interactions in the HMIG, we have designed a novel graph interaction fusion approach with differentiable pooling-based graph attention. The resulting graph embeddings are then processed to identify and retrieve the HMI concepts within the annotations, which can benefit the downstream human-computer interaction and ubiquitous computing applications. We have developed and implemented CG-HMI into a system prototype, and performed extensive studies upon three real-world HMI datasets (two on car driving and the third one on e-scooter riding). We have corroborated the excellent performance (on average 13.11% higher accuracy than the other baselines in terms of precision, recall, and F1 measure) and effectiveness of CG-HMI in recognizing and extracting the important HMI concepts through cross-modality learning. Our CG-HMI studies also provide real-world implications (e.g., road safety and driving behaviors) about the interactions between the drivers and other traffic participants.

References

[1]
2022. How Mercedes-Benz Is Using AI & NLP To Give Driving A Tech Makeover. https://analyticsindiamag.com/how-mercedes-benz-is-using-ai-nlp-to-give-driving-a-tech-makeover/.
[2]
2022. Natural language processing enhances autonomous vehicles experience. https://www.autonomousvehicleinternational.com/features/natural-language-processing-enhances-autonomous-vehicles-experience.html.
[3]
Utku Günay Acer, Marc van den Broeck, Chulhong Min, Mallesham Dasari, and Fahim Kawsar. 2022. The City as a Personal Assistant: Turning Urban Landmarks into Conversational Agents for Serving Hyper Local Information. Proc. ACM IMWUT 6, 2 (2022), 1--31.
[4]
Sonia Baee, Erfan Pakdamanian, Inki Kim, Lu Feng, Vicente Ordonez, and Laura Barnes. 2021. Medirl: Predicting the visual attention of drivers via maximum entropy deep inverse reinforcement learning. In Proc. IEEE/CVF ICCV. 13178--13188.
[5]
Hédi Ben-Younes, Éloi Zablocki, Patrick Pérez, and Matthieu Cord. 2022. Driving behavior explanation with multi-level fusion. Pattern Recognition 123 (2022), 108421.
[6]
Qingqing Cao, Prerna Khanna, Nicholas D Lane, and Aruna Balasubramanian. 2022. MobiVQA: Efficient On-Device Visual Question Answering. Proc. ACM IMWUT 6, 2 (2022), 1--23.
[7]
Hou Pong Chan, Mingxi Guo, and Cheng-Zhong Xu. 2022. Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention. In Proc. IEEE/RSJ IROS. IEEE, 12464--12470.
[8]
Dongyao Chen, Kyong-Tak Cho, Sihui Han, Zhizhuo Jin, and Kang G Shin. 2015. Invisible sensing of vehicle steering with smartphones. In Proc. ACM MobiSys. 1--13.
[9]
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal named entity recognition with image attributes and image knowledge. In Proc. Springer DASFAA. Springer, 186--201.
[10]
Dongyao Chen and Kang G Shin. 2019. TurnsMap: Enhancing driving safety at intersections with mobile crowdsensing and deep learning. Proc. ACM IMWUT 3, 3 (2019), 1--22.
[11]
Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. arXiv (2022).
[12]
Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the association for computational linguistics 4 (2016), 357--370.
[13]
Xiang Dai, Sarvnaz Karimi, Ben Hachey, and Cecile Paris. 2020. An effective transition-based model for discontinuous NER. arXiv (2020).
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv (2018).
[15]
Chaoyue Ding, Shiliang Sun, and Jing Zhao. 2023. MST-GAT: A multimodal spatial--temporal graph attention network for time series anomaly detection. Information Fusion 89 (2023), 527--536.
[16]
Drover. 2022. AI-Powered Computer Vision for Micromobility. Retrieved October 11, 2022 from https://drover.ai/
[17]
Jianwu Fang, Dingxin Yan, Jiahuan Qiao, Jianru Xue, and Hongkai Yu. 2021. DADA: Driver attention prediction in driving accident scenarios. IEEE T-ITS (2021).
[18]
Martin Hahner, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2021. Fog simulation on real LiDAR point clouds for 3D object detection in adverse weather. In Proc. IEEE/CVF CVPR. 15283--15292.
[19]
Suining He and Kang G. Shin. 2022. Socially-Equitable Interactive Graph Information Fusion-Based Prediction for Urban Dockless E-Scooter Sharing. In Proc. WWW. Association for Computing Machinery, 3269--3279.
[20]
Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proc. ACL Anthology. 5666--5675.
[21]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv (2015).
[22]
Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, and John Canny. 2019. Grounding human-to-vehicle advice for self-driving vehicles. In Proc. IEEE/CVF CVPR. 10591--10599.
[23]
Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, and John Canny. 2020. Advisable learning for self-driving vehicles by internalizing observation-to-action rules. In Proc. IEEE/CVF CVPR. 9661--9670.
[24]
Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. 2018. Textual explanations for self-driving vehicles. In Proc. ECCV. 563--578.
[25]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv (2016).
[26]
Luc Le Mero, Dewei Yi, Mehrdad Dianati, and Alexandros Mouzakitis. 2022. A survey on imitation learning techniques for end-to-end autonomous vehicles. IEEE T-ITS (2022).
[27]
Fei Li, ZhiChao Lin, Meishan Zhang, and Donghong Ji. 2021. A span-based model for joint overlapped and discontinuous named entity recognition. arXiv (2021).
[28]
Fei Li, Zheng Wang, Siu Cheung Hui, Lejian Liao, Dandan Song, and Jing Xu. 2021. Effective named entity recognition with boundary-aware bidirectional neural networks. In Proc. WWW. 1695--1703.
[29]
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. A survey on deep learning for named entity recognition. IEEE TKDE 34, 1 (2020), 50--70.
[30]
Max Guangyu Li, Bo Jiang, Zhengping Che, Xuefeng Shi, Mengyao Liu, Yiping Meng, Jieping Ye, and Yan Liu. 2019. DBUS: Human Driving Behavior Understanding System. In Proc. IEEE/CVF ICCV. 2436--2444.
[31]
Pei Li, Mohamed Abdel-Aty, Qing Cai, and Zubayer Islam. 2020. A deep learning approach to detect real-time vehicle maneuvers based on smartphone sensors. IEEE T-ITS (2020).
[32]
Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Alireza Dirafzoon, Aparajita Saraf, Amy Bearman, and Babak Damavandi. 2022. IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text. arXiv preprint arXiv:2210.14395 (2022).
[33]
Ramin Nabati and Hairong Qi. 2019. RRPN: Radar region proposal network for object detection in autonomous vehicles. In Proc. IEEE ICIP. IEEE.
[34]
Hongliang Pan, Zheng Lin, Peng Fu, Yatao Qi, and Weiping Wang. 2020. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1383--1392.
[35]
Kun Qian, Shilin Zhu, Xinyu Zhang, and Li Erran Li. 2021. Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. In Proc. IEEE/CVF CVPR. 444--453.
[36]
Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D Lane, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. 2018. Multimodal deep learning for activity and context recognition. Proc. ACM IMWUT 1, 4 (2018), 1--27.
[37]
Dave Raggett. 2015. The Web of Things: Challenges and Opportunities. Computer 48, 5 (2015), 26--32.
[38]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proc. IEEE CVPR. 779--788.
[39]
Matteo Simoncini, Douglas Coimbra de Andrade, Leonardo Taccari, Samuele Salti, Luca Kubin, Fabio Schoen, and Francesco Sambo. 2022. Unsafe Maneuver Classification From Dashcam Video and GPS/IMU Sensors Using Spatio-Temporal Attention Selector. IEEE T-ITS (2022).
[40]
Jithesh Gugan Sreeram, Xiao Luo, and Renran Tian. 2021. Contextual and Behavior Factors Extraction from Pedestrian Encounter Scenes Using Deep Language Models. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 131--136.
[41]
Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng. 2021. RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In Proc. AAAI, Vol. 35. 13860--13868.
[42]
Mahan Tabatabaie and Suining He. 2023. Naturalistic E-Scooter Maneuver Recognition with Federated Contrastive Rider Interaction Learning. Proc. ACM IMWUT 6, 4, Article 205 (Jan 2023), 27 pages.
[43]
Mahan Tabatabaie, Suining He, and Xi Yang. 2021. Reinforced Feature Extraction and Multi-Resolution Learning for Driver Mobility Fingerprint Identification. In Proc. ACM SIGSPATIAL. 69--80.
[44]
Mahan Tabatabaie, Suining He, and Xi Yang. 2022. Driver Maneuver Identification with Multi-Representation Learning and Meta Model Update Designs. Proc. ACM IMWUT 6, 2 (2022), 1--23.
[45]
Chuanqi Tan, Wei Qiu, Mosha Chen, Rui Wang, and Fei Huang. 2020. Boundary enhanced neural span classification for nested named entity recognition. In Proc. AAAI, Vol. 34. 9016--9023.
[46]
Ultralytics. 2022. Yolov5. Retrieved October 3, 2022 from https://github.com/ultralytics/yolov5
[47]
Yonatan Vaizman, Nadir Weibel, and Gert Lanckriet. 2018. Context recognition in-the-wild: Unified model for multi-modal sensors and multi-label classification. Proc. ACM IMWUT 1, 4 (2018), 1--22.
[48]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[49]
Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Jiabo Ye, Ming Yan, and Yanghua Xiao. 2022. PromptMNER:Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. In International Conference on Database Systems for Advanced Applications. Springer, 297--305.
[50]
Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, and Changsheng Xu. 2020. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proc. ACM ICMR. 540--547.
[51]
Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In Proc. ICML. PMLR, 6861--6871.
[52]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[53]
Ye Xia, Jinkyu Kim, John Canny, Karl Zipser, Teresa Canas-Bajo, and David Whitney. 2020. Periphery-fovea multi-resolution driving model guided by human attention. In Proc. IEEE/CVF CVPR. 1767--1775.
[54]
Shuo Xu, Yuxiang Jia, Changyong Niu, and Hongying Zan. 2022. MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 6802--6807.
[55]
Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. TENER: adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 (2019).
[56]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proc. ACL Anthology. 1480--1489.
[57]
Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. Proc. NeurIPS 31 (2018).
[58]
Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. 2020. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proc. IEEE/CVF CVPR. 2636--2645.
[59]
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. Proc. ACL.
[60]
Yuting Zhan and Hamed Haddadi. 2019. Towards automating smart homes: Contextual and temporal dynamics of activity prediction. In UbiComp/ISWC. 413--417.
[61]
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proc. AAAI, Vol. 35. 14347--14355.
[62]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. Proc. AAAI 32, 1 (Apr. 2018).

Cited By

View all
  • (2024)TS2ACTProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314457:4(1-22)Online publication date: 12-Jan-2024

Index Terms

  1. Cross-Modality Graph-based Language and Sensor Data Co-Learning of Human-Mobility Interaction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 7, Issue 3
    September 2023
    1734 pages
    EISSN:2474-9567
    DOI:10.1145/3626192
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2023
    Published in IMWUT Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Human-mobility interaction
    2. cross-modality graph interaction fusion
    3. human-mobility interaction concept extraction
    4. language and sensor data co-learning
    5. named entity recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)187
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TS2ACTProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314457:4(1-22)Online publication date: 12-Jan-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media