research-article

ChatCam: Embracing LLMs for Contextual Chatting-to-Camera with Interest-Oriented Video Summarization

Authors:

Wei DongAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 8, Issue 4

Article No.: 168, Pages 1 - 34

https://doi.org/10.1145/3699731

Published: 21 November 2024 Publication History

Abstract

Cameras are ubiquitous in society, with users increasingly looking to extract insights about the physical world. Current human-to-camera interaction methods, while advanced, still need to support an intuitive, conversational interaction as one would expect in human-to-human communication. To achieve a more natural interaction between humans and cameras, we proposed a novel contextual chatting-to-camera paradigm. This paradigm allows users to interact with the camera using natural language including raising interests and questions. In response, the camera can customize specific tasks tailored to these interests and attempt to provide answers to the questions asked. We designed ChatCam, embracing LLMs for contextual chatting-to-camera with interest-oriented video summarization. With a novel prompt with the actor-critic LLMs approach, ChatCam can understand users' interests and translate them into some tasks and objects. ChatCam can also customize relevant models with the help of the multi-modal large language model and deep reinforcement learning on the resource-constrained edge and maintain high accuracy. Results show that ChatCam achieves an improvement up to 43.9% in understanding user interests and 21.1% in model accuracy compared to state-of-the-art methods in multiple settings. Various examples and the user study also prove the effectiveness of ChatCam in practice.

References

[1]

2021. all-mpnet-base-v2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2.

[2]

2023. Chroma. https://github.com/chroma-core/chroma.

[3]

2023. Evaluating the ideal chunk size for a RAG system using LLaMaindex. https://www.llamaindex.ai.

[4]

2023. Gemini. https://deepmind.google/technologies/gemini.

[5]

2023. GPT-4V(ison). https://openai.com/research/gpt-4v-system-card.

[6]

2023. Jetson Xavier NX. https://www.nvidia.cn/autonomous-machines/embedded-systems/jetson-xavier-nx/.

[7]

2023. Qdrant. https://github.com/qdrant/qdrant.

[8]

2023. Recursively split by character. https://python.langchain.com/docs.

[9]

Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. 2023. Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games. arXiv preprint arXiv:2309.17234 (2023).

[10]

Pranav Adarsh, Pratibha Rathi, and Manoj Kumar. 2020. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. In 2020 6th international conference on advanced computing and communication systems (ICACCS). IEEE, 687--694.

[11]

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016).

[12]

Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 119--135.

[13]

Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya Dulloor. 2019. Scaling video analytics on constrained edge nodes. Proceedings of Machine Learning and Systems 1 (2019), 406--417.

[14]

Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, and Andre A Cire. 2021. Combining reinforcement learning and constraint programming for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3677--3687.

[15]

Kabalan Chaccour, Rony Darazi, Amir Hajjam El Hassani, and Emmanuel Andres. 2016. From fall detection to fall prevention: A generic classification of fall-related systems. IEEE Sensors Journal 17, 3 (2016), 812--822.

[16]

Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168.

[17]

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. 2022. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning. PMLR, 3852--3878.

[18]

Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision. 4794--4802.

[19]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).

[20]

Zhou Fan, Rui Su, Weinan Zhang, and Yong Yu. 2019. Hybrid actor-critic reinforcement learning in parameterized action space. arXiv preprint arXiv:1903.01344 (2019).

[21]

Aleksandr Fedorov, Kseniia Nikolskaia, Sergey Ivanov, Vladimir Shepelev, and Alexey Minbaleev. 2019. Traffic flow estimation with data from a video surveillance camera. Journal of Big Data 6 (2019), 1--15.

[22]

Paulo Finardi, Leonardo Avila, Rodrigo Castaldoni, Pedro Gengo, Celio Larcher, Marcos Piau, Pablo Costa, and Vinicius Caridá. 2024. The Chronicles of RAG: The Retriever, the Chunk and the Generator. arXiv preprint arXiv:2401.07883 (2024).

[23]

Sven Fleck and Wolfgang Straßer. 2008. Smart camera based monitoring system and its application to assisted living. Proc. IEEE 96, 10 (2008), 1698--1714.

[24]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).

[25]

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129 (2021), 1789--1819.

Digital Library

[26]

Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. 2012. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, part C (applications and reviews) 42, 6 (2012), 1291--1307.

Digital Library

[27]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).

[28]

Anhong Guo, Anuraag Jain, Shomiron Ghose, Gierad Laput, Chris Harrison, and Jeffrey P Bigham. 2018. Crowd-ai camera sensing in the real world. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 1--20.

Digital Library

[29]

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, et al. 2022. Manu: a cloud native vector database management system. Proceedings of the VLDB Endowment 15, 12 (2022), 3548--3561.

Digital Library

[30]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024).

[31]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[32]

Kyota Higa and Kota Iwamoto. 2018. Robust estimation of product amount on store shelves from a surveillance camera for improving on-shelf availability. In 2018 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 1--6.

Digital Library

[33]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. 2023. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]

[34]

Alain Hore and Djemel Ziou. 2010. Image quality metrics: PSNR vs. SSIM. In 2010 20th international conference on pattern recognition. IEEE, 2366--2369.

Digital Library

[35]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision. 1314--1324.

[36]

Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying large video datasets with low latency and low cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 269--286.

[37]

Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodik, Leana Golubchik, Minlan Yu, Paramvir Bahl, and Matthai Philipose. 2018. Videoedge: Processing camera streams using hierarchical clusters. In 2018 IEEE/ACM Symposium on Edge Computing (SEC). IEEE, 115--131.

[38]

Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 253--266.

Digital Library

[39]

Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. arXiv preprint arXiv:1805.01046 (2018).

[40]

Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529 (2017).

[41]

Minsu Kim, Junyoung Park, and Jinkyoo Park. 2022. Sym-nco: Leveraging symmetricity for neural combinatorial optimization. Advances in Neural Information Processing Systems 35 (2022), 1936--1949.

[42]

Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.

[43]

Chunyuan Li. 2023. Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895 (2023).

[44]

Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. 2019. FMT: A wearable camera-based object tracking memory aid for older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1--25.

Digital Library

[45]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems 36 (2023), 51991--52008.

[46]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).

[47]

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. 2023. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470 (2023).

[48]

Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. 2020. Reducto: On-camera filtering for resource-efficient real-time video analytics. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 359--376.

Digital Library

[49]

Jingwen Liu, Yanlei Gu, and Shunsuke Kamijo. 2015. Customer behavior recognition in retail store from surveillance camera. In 2015 IEEE international symposium on multimedia (ISM). IEEE, 154--159.

[50]

Shinan Liu, Tarun Mangla, Ted Shaowang, Jinjin Zhao, John Paparrizos, Sanjay Krishnan, and Nick Feamster. 2023. AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 1 (2023), 1--26.

Digital Library

[51]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).

[52]

Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems 35 (2022), 17703--17716.

[53]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273--1282.

[54]

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. 2024. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14420--14431.

[55]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529--533.

[56]

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. Clip-it! language-guided video summarization. Advances in neural information processing systems 34 (2021), 13988--14000.

[57]

Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Ananthanarayanan, Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, and Ravi Netravali. 2023. Gemel: Model Merging for {Memory-Efficient}, {Real-Time } Video Analytics at the Edge. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 973--994.

[58]

Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng Li. 2023. Retrieving-to-answer: Zero-shot video question answering with frozen large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 272--283.

[59]

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. 2024. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13009--13018.

[60]

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision. Springer, 525--542.

[61]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[62]

Andrzej Ruta, Fatih Porikli, Shintaro Watanabe, and Yongmin Li. 2011. In-vehicle camera traffic sign detection and recognition. Machine Vision and Applications 22 (2011), 359--375.

Digital Library

[63]

Aidean Sharghi, Boqing Gong, and Mubarak Shah. 2016. Query-focused extractive video summarization. In Computer Vision-ECCV 2016:14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 3--19.

[64]

Aidean Sharghi, Jacob S Laurel, and Boqing Gong. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4788--4797.

[65]

George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. 2023. ZipIt! Merging Models from Different Tasks without Training. arXiv preprint arXiv:2305.03053 (2023).

[66]

Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, and Toshihiko Yamasaki. 2024. Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video. arXiv preprint arXiv:2405.08890 (2024).

[67]

Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.

[68]

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022).

[69]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data. 2614--2627.

Digital Library

[70]

Lin Wang and Kuk-Jin Yoon. 2021. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence 44, 6 (2021), 3048--3068.

[71]

Shibo Wang, Shusen Yang, and Cong Zhao. 2020. SurveilEdge: Real-time video query based on collaborative cloud-edge deep learning. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2519--2528.

Digital Library

[72]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2023. CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]

[73]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv:2002.10957 [cs.CL]

[74]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).

[75]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747 (2022).

[76]

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. 2022. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems 35 (2022), 8483--8497.

[77]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.

[78]

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning. PMLR, 23965--23998.

[79]

Guande Wu, Jianzhe Lin, and Claudio T Silva. 2022. Intentvizor: Towards generic query guided interactive video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10503--10512.

[80]

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4820--4828.

[81]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 (2023).

[82]

Ronghua Xu, Seyed Yahya Nikouei, Yu Chen, Aleksey Polunchenko, Sejun Song, Chengbin Deng, and Timothy R Faughnan. 2018. Real-time human objects tracking for smart surveillance at the edge. In 2018 IEEE International conference on communications (ICC). IEEE, 1--6.

[83]

Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu. 2023. Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv:2311.08152 (2023).

[84]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).

[85]

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. 2020. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2636--2645.

[86]

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023).

[87]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685 (2023).

[88]

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. 2024. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289 (2024).

[89]

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).

Cited By

Ghose DGitelson OScassellati B(2024)Integrating Multimodal Affective Signals for Stress Detection from Audio-Visual DataProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685717(22-32)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685717
Alsadi MParacha AArshad J(2024)NFTMosaic: Piecing Together Assets in a Unified Blockchain TokenKnowledge Science, Engineering and Management10.1007/978-981-97-5501-1_27(359-374)Online publication date: 16-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5501-1_27

Index Terms

ChatCam: Embracing LLMs for Contextual Chatting-to-Camera with Interest-Oriented Video Summarization
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Prompt Design Using Past Dialogue Summarization for LLMs to Generate the Current Appropriate Dialogue
Artificial Neural Networks and Machine Learning – ICANN 2024
Abstract
Recent technological innovations in large language models (LLMs) produce incredible performance. This also has a similar impact on dialogue systems. However, following fluently current dialogue from the past dialogue is crucial, especially for ...
Multimodal Local Feature Enhancement Network for Video Summarization
Pattern Recognition and Computer Vision
Abstract
Multimodal information processing has garnered considerable attention in recent years. Due to the inherent multimodal information in videos, multimodal learning has been introduced in the domain of video summarization, leading to a significant ...
Personalized Video Summarization by Multimodal Video Understanding
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 8, Issue 4

December 2024

1788 pages

EISSN:2474-9567

DOI:10.1145/3705705

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2024

Published in IMWUT Volume 8, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
139
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)60

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ghose DGitelson OScassellati B(2024)Integrating Multimodal Affective Signals for Stress Detection from Audio-Visual DataProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685717(22-32)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685717
Alsadi MParacha AArshad J(2024)NFTMosaic: Piecing Together Assets in a Unified Blockchain TokenKnowledge Science, Engineering and Management10.1007/978-981-97-5501-1_27(359-374)Online publication date: 16-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5501-1_27

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents