research-article

Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos

Authors:

Peng Wu,

Xiankai Lu,

Jianbing Shen,

Yilong YinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 105 - 115

https://doi.org/10.1145/3581783.3611978

Published: 27 October 2023 Publication History

Get Access

Abstract

Human mesh reconstruction (HMR) from monocular video is the key step to many mixed reality and robotic applications. Although existing methods show promising results by capturing frames' temporal information, these methods predict human mesh with the design of implicit temporal learning modules in a sequence to frame manner. To mine more temporal information from the video, we present a bi-level clip inference network for HMR, which leverages both local motion and global context explicitly for dense 3D reconstruction. Specifically, we propose a novel bi-level temporal fusion strategy that takes both neighboring and long-range relations into consideration. In addition, different from traditional frame-wise operation, we investigate an alternative perspective by treating video-based HMR as clip-wise inference. We evaluate the proposed method on multiple datasets (3DPW, Human3.6M, and MPI-INF-3DHP) quantitatively and qualitatively, demonstrating a significant improvement over existing methods (in terms of PA-MPJPE, ACC-Error etc). Furthermore, we extend the proposed method on more challenging Multiple Shots HMR task to demonstrate its generalizability. Some visual demos can be seen https://github.com/bicf0/bicf_demo.

Supplemental Material

MP4 File

This is the presentation for the accepted paper by ACM International Conference on Multimedia, that is Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos. In this paper presentation, six parts are included. First, we formulate Human Mesh Reconstruction and introduce the related work. Next, we discuss the main challenge when we try reconstructing human mesh from monocular videos. Then, we introduce the idea to address the above issues. And then, we present the details of our proposed method. After that, we demonstrate the effectiveness of our proposed method. Finally, we summarize the main content of this paper.

Download
18.44 MB

References

[1]

Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. 2018. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5167--5176.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose

HyperGraph based human mesh hierarchical representation and reconstruction from a single image

Resolution method for mixed integer bi-level linear problems based on decomposition technique

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations