skip to main content
research-article

E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition

Published: 10 December 2022 Publication History

Abstract

High-resolution video object recognition (VOR) evolves so fast but is very compute-intensive. This is because VOR leverages compute-intensive deep neural network (DNN) for better accuracy. Although many works have been proposed for speedup, they mostly focus on DNN algorithm and hardware acceleration on the edge side. We observe that most video streams need to be losslessly compressed before going online and an encoder should have all the video information. Moreover, as the cloud should have abundant computing power to handle sophisticated VOR algorithms, we propose to take a one-shot effort for a modified VOR algorithm at the encoding stage in cloud and integrate the full VOR regeneration into a slightly extended decoder on the device. The scheme can enable lightweight VOR with server-class accuracy by simply leveraging the classic and economic video decoder universal to any mobile device. Meanwhile, the scheme can save massive computing power for not repetitively processing the same video on different user devices that makes it extremely sustainable for green computing across the whole network.
We propose E2-VOR, an end-to-end encoder and decoder architecture for efficient VOR. We carefully design the scheme to have minimum impact on the video bitstream transmitted. In the cloud, the VOR extended video encoder tracks on a macro-block basis and packs intelligent information into the video stream for increased VOR accuracy and fast regenerating process. On the edge device, we extend the traditional video decoder with a small piece of dedicated hardware to enable the efficient VOR regeneration. Our experiment shows that E2-VOR can achieve 5.0× performance improvement with less than 0.4% VOR accuracy loss compared to the state-of-the-art FAVOS scheme. On average, E2-VOR can run over 54 frames-per-second (FPS) for 480P videos on an edge device.

References

[1]
Rangarajan Aravind, M. Reha Civanlar, and Amy R. Reibman. 1996. Packet loss resilience of MPEG-2 scalable video coding algorithms. IEEE Transactions on Circuits and Systems for Video Technology 6, 5 (1996), 426–435.
[2]
Paul Bao, Lei Zhang, and Xiaolin Wu. 2005. Canny edge detection enhancement by scale multiplication. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 9 (2005), 1485–1490.
[3]
Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. 2018. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 331–346.
[4]
Mark Buckler, Philip Bedoukian, Suren Jayasuriya, and Adrian Sampson. 2018. EVA\(^2\): Exploiting temporal redundancy in live computer vision. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 533–546.
[5]
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221–230.
[6]
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News 38, 3 (2010), 247–257.
[7]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062. https://arxiv.org/pdf/1412.7062.pdf.
[8]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.
[9]
Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, and Ming-Hsuan Yang. 2018. Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7415–7424.
[10]
Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and Ming-Hsuan Yang. 2017. Segflow: Joint learning for video object segmentation and optical flow. In Proceedings of the IEEE International Conference on Computer Vision. 686–695.
[11]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766.
[12]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92–104.
[13]
Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.
[14]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
[15]
Vinayak Gokhale, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In CVPR Workshops. 682–687.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1904–1916.
[17]
Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. 2017. Impression network for video object detection. arXiv:1712.05896. https://arxiv.org/pdf/1712.05896.pdf.
[18]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[19]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.
[20]
Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. 2019. DaVinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS’19). IEEE, 1–44.
[21]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[22]
David Nilsson and Cristian Sminchisescu. 2018. Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6819–6828.
[23]
Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In ICCD, Vol. 2013. 13–19.
[24]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
[25]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 Davis challenge on video object segmentation. arXiv:1704.00675. https://arxiv.org/pdf/1704.00675.pdf.
[26]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.
[27]
Zhuoran Song, Feiyang Wu, Xueyuan Liu, Jing Ke, Naifeng Jing, and Xiaoyao Liang. 2020. VR-DANN: Real-time video recognition via decoder-assisted neural network acceleration. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 698–710.
[28]
Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 1649–1668.
[29]
FFmpeg team. 2019. FFmpeg. https://ffmpeg.org.
[30]
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P. Jouppi. 2008. CACTI 5.1. Technical Report HPL-2008-20, HP Labs. https://www.hpl.hp.com/techreports/2008/HPL-2008-20.html.
[31]
David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes, Aamer Jaleel, and Bruce Jacob. 2005. DRAMsim: A memory system simulator. ACM SIGARCH Computer Architecture News 33, 4 (2005), 100–107.
[32]
Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, and Matthew Mattina. 2019. FixyNN: Efficient hardware for mobile computer vision via transfer learning. arXiv:1902.11128. https://arxiv.org/pdf/1902.11128.pdf.
[33]
Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu. 2018. DeepCache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 129–144.
[34]
Seongwook Park, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee, Sungpill Choi, and Hoi-Jun Yoo. 2015. A 1.93 TOPS/W scalable deep learning/inference processor with tetra-parallel mimd architecture for big data applications. In 2015 ISSCC. IEEE, 80–81.
[35]
Zhe Yuan, Yixiong Yang, Jinshan Yue, Ruoyang Liu, Xiaoyu Feng, Zhiting Lin, Xiulong Wu, Xueqing Li, Huazhong Yang, and Yongpan Liu. 2020. 14.2 A 65nm 24.7\(\mu\)J/Frame 12.3 mW activation-similarity-aware convolutional neural network video processor using hybrid precision, inter-frame data reuse and mixed-bit-width difference-frame data codec. In 2020 IEEE International Solid-State Circuits Conference (ISSCC’20). IEEE, 232–234.
[36]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.
[37]
Yijia Zhang, Daniel C. Wilson, Ioannis Ch Paschalidis, and Ayse K. Coskun. 2021. A data center demand response policy for real-world workload scenarios in HPC. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 282–287.
[38]
Dajiang Zhou, Shihao Wang, Heming Sun, Jianbin Zhou, Jiayi Zhu, Yijin Zhao, Jinjia Zhou, Shuping Zhang, Shinji Kimura, Takeshi Yoshimura, et al. 2016. 14.7 A 4Gpixel/s 8/10b H. 265/HEVC video decoder chip for 8K Ultra HD applications. In 2016 IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, 266–268.
[39]
Qiuling Zhu, Ofer Shacham, Albert Meixner, Jason Rupert Redgrave, Daniel Frederic Finchelstein, David Patterson, Neeti Desai, Donald Stark, Edward T. Chang, William R. Mark, et al. 2018. Architecture for high performance, power efficient, programmable image processing. US Patent No. 9,965,824. Patent: May 8, 2018.
[40]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408–417.
[41]
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2349–2358.
[42]
Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-SoC co-design for low-power mobile continuous vision. arXiv:1803.11232. https://arxiv.org/pdf/1803.11232.pdf.

Cited By

View all
  • (2023)Real-Time Video Recognition via Decoder-Assisted Neural Network Acceleration FrameworkIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321766742:7(2238-2251)Online publication date: 1-Jul-2023

Index Terms

  1. E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Design Automation of Electronic Systems
    ACM Transactions on Design Automation of Electronic Systems  Volume 28, Issue 1
    January 2023
    321 pages
    ISSN:1084-4309
    EISSN:1557-7309
    DOI:10.1145/3573313
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 10 December 2022
    Online AM: 17 June 2022
    Accepted: 03 June 2022
    Revised: 28 April 2022
    Received: 28 May 2021
    Published in TODAES Volume 28, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Video object recognition
    2. neural network
    3. accelerator
    4. end-to-end

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key Research and Development Program of China
    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)62
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Real-Time Video Recognition via Decoder-Assisted Neural Network Acceleration FrameworkIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.321766742:7(2238-2251)Online publication date: 1-Jul-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media