Research Progress
Researchers Propose MMGT for High-Precision Co-Speech Gesture Video Generation from Audio and Image
Co-speech gesture video generation aims to synthesize realistic talking-head and gesture videos from audio and a static reference image. The challenge in this task stems from the inherent differences in motion amplitude across different body parts and the heterogeneous correlations between these movements and the audio.
Recently, a research team from the Intelligent Detection and Equipment Department at the Shenyang Institute of Automation (SIA) of the Chinese Academy of Sciences, proposed an innovative method for high-quality co-speech gesture video generation.
The research findings, titled MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation, have been published in the international journal IEEE Transactions on Multimedia. The first author is WANG Siyuan, with co-authors LIU Jiawei, WANG Ting, JIN Yinye, Du Jinsong, and HAN Zhi.
Traditional audio-driven gesture video generation methods face a dilemma: methods relying solely on audio often struggle to capture large-scale gestures, leading to blurry and distorted results. The core innovation of MMGT lies in its two-stage generation strategy. The research team first designed a spatially mask-guided audio-to-pose generation network. In this stage, the model extracts speech rhythm and emotional cues from audio to generate pose sequences synchronized with speech, while simultaneously identifying regions with significant motion in each frame, referred to as the “motion mask.” In the second stage, a motion mask-guided hierarchical audio attention module is introduced. This module takes the generated pose sequence and motion mask, together with the input audio, as conditions and feeds them into a video diffusion model for refined rendering. As an end-to-end framework, MMGT can generate high-quality, temporally consistent videos from audio and a single reference image, achieving natural gestures and precise audio–visual synchronization of facial expressions and lip movements.
The research team conducted extensive experiments on the public PATS dataset. The results demonstrate that MMGT outperforms state-of-the-art audio-driven video generation methods across multiple metrics, including video quality, lip-sync accuracy, and gesture realism. Notably, in terms of the Fréchet Video Distance (FVD), MMGT achieves a score 6% lower than the previous best method, indicating substantially improved video quality and temporal coherence.
The MMGT model, characterized by its efficiency, accuracy, and practical usability, shows strong potential for applications in virtual digital human generation and human–computer interaction, including virtual avatars, online education, video conferencing, and gaming. It opens new opportunities for AI-driven content generation in the metaverse and multimedia applications.