HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Abstract

A girl plays "house" with plush toys in the living room.

A woman takes a selfie in a busy city. A woman holds a smartphone in one hand and makes a peace sign with the other. The background is a bustling street scene with various signs and pedestrians.

A dog is chasing a cat in the park.

A woman is holding a paintbrush, drawing a picture of a cat on her home blackboard.

A man is presenting the chips in his hand beside the swimming pool.

On the modern city streets, a man asks a woman for directions, but she doesn't understand what he's saying.

A woman wearing Hanfu is reading a book in the study room.

A person is wearing a burgundy suit with a black shirt underneath. They are wearing a yellow hat with panda ears and are smiling at the camera. Against the gray wall background, it seems that they are about to or are in the middle of a magic show.

Two people wearing light yellow dresses are attentively mending a gray - white puppy in front of them.

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation.

Method

We propose HunyuanCustom, a multi-modal, conditional, and controllable generation model centered on subject consistency, built upon the Hunyuan Video generation framework. It enables the generation of subject-consistent videos conditioned on text, images, audio, and video inputs. Specifically, HunyuanCustom introduces an image-text fusion module based on LLaVA to facilitate interaction between images and text, allowing identity information from images to be effectively integrated into textual descriptions. Additionally, an image ID enhancement module is proposed, which concatenates image information along the temporal axis and leverages the video model's efficient temporal modeling ability to enhance subject identity throughout the video. To support conditional injection of audio and video, HunyuanCustom designs distinct injection mechanisms for each modality, which are effectively disentangled with the identity condition module. HunyuanCustom ultimately achieves decoupled control over image, audio, and video conditions, demonstrating great potential in subject-centric multi-modal video generation.

Single-Subject Video Customization

Consistency Story Generation

Comparison with State-of-the-Art Methods

Prompt: A man is listening to music and cooking snail noodles in the kitchen.

Prompt: In a music rehearsal room, a woman holds a violin, concentrating on playing; the melodious sound reverberates within.

Prompt: A girl plays "house" with plush toys in the living room.

Prompt: A dog is chasing a cat in the park.

Prompt: Family members sit around and share the dumplings on the porcelain plate with dipping sauce, full of reunion happiness.

Prompt: In an outdoor cafe, a woman wearing a dress is enjoying a delicious dessert. She sits in a comfortable chair with a cup of coffee in her hand, her face full of happiness.

Multi-Subject Video Customization

Multi-Subject Customization

A woman is holding a paintbrush, drawing a picture of a cat on her home blackboard.

A woman rides a tiger, wandering through the fields.

A woman is boxing with a panda, and they are at a stalemate.

A man is presenting the chips in his hand beside the swimming pool.

A man is drinking Moutai in the pavilion.

A man is sitting in a spacious and bright living room, smiling and greeting a cute penguin. The penguin nods back at him in a friendly manner, as if responding to his greeting.

A woman wearing Hanfu is reading a book in the study room.

A man wearing Hanfu walks across an ancient stone bridge holding an umbrella, raindrops tapping against it.

Comparison with State-of-the-Art Methods

Prompt: A woman is holding a paintbrush, drawing a picture of a cat on her home blackboard.

Prompt: A woman rides a tiger, wandering through the fields.

Prompt: A woman is boxing with a panda, and they are at a stalemate.

Prompt: A man is presenting the chips in his hand beside the swimming pool.

Prompt: A man is drinking Moutai in the pavilion.

Prompt: A man is sitting in a spacious and bright living room, smiling and greeting a cute penguin. The penguin nods back at him in a friendly manner, as if responding to his greeting.

Prompt: A woman wearing Hanfu is reading a book in the study room.

Prompt: A man wearing Hanfu walks across an ancient stone bridge holding an umbrella, raindrops tapping against it.

Audio-Driven Video Customization

HunyuanCustom first enables audio-driven human customization, where the character speaks the corresponding audio in a text-described scene, allowing for more flexible and controllable audio-driven human animation. Please turn on the sound to watch the video.

A single person, in the dressing room. A woman is holding a lipstick, trying it on, and introducing it.

In a bakery, A woman holds a cake, introducing it.

A woman plays the zither and sings by the ancient bridge in the old town.

A man delivers a powerful speech in the grand palace of China's Ming Dynasty.

At the store counter, a man is holding a mechanical watch in his hand and introducing it.

A man holding a guitar sings the song of youth cheerfully on the bustling street.

Video-Driven Video Customization

Based on subject customization, HunyuanCustom also supports using a video as the input to replace or add specified objects. Specifically, given a video with its specified actions and environment, and then given an image with specified ID information, HunyuanCustom can replace the specified object in the video with the ID specified in the image.

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

Reference Image

Source Video

Edit Result

BibTeX

@misc{hu2025hunyuancustommultimodaldrivenarchitecturecustomized,
      title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation}, 
      author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
      year={2025},
      eprint={2505.04512},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.04512}, 
}