ByteDance’s AI Model Can Generate Video from Single Image
February 6, 2025
ByteDance has developed a generative model that can use a single photo to generate photorealistic video of humans in motion. Called OmniHuman-1, the multimodal system supports various visual and audio styles and can generate people doing things like singing, dancing, speaking and moving in a natural fashion. ByteDance says its new technology clears hurdles that hinder existing human-generators — obstacles like short play times and over-reliance on high-quality training data. The diffusion transformer-based OmniHuman addressed those challenges by mixing motion-related conditions into the training phase, a solution ByteDance researchers claim is new.
Although it’s not in any sort of release — not even preview — ByteDance is discussing it and has made available some video demos. Based on those, TechCrunch writes OmniHuman-1 “can generate perhaps the most realistic deepfake videos to date,” writing that it manages “to clear the uncanny valley,” avoiding the usual “tell or obvious sign that AI was involved somewhere.”
It “supports various portrait contents (face close-up, portrait, half-body, full-body)” and “handles human-object interactions and challenging body poses, and accommodates different image styles,” ByteDance (which owns TikTok) explains in a technical paper.
OmniHuman “generates full-body videos that show people gesturing and moving in ways that match their speech, surpassing previous AI models that could only animate faces or upper bodies,” reports VentureBeat, characterizing the new Chinese model as “a breakthrough that could reshape digital entertainment and communications.”
OmniHuman leverages data driven motion generation, supporting multiple driving modalities. “For audio-driven scenarios, all conditions except pose are activated. For pose-related combinations, all conditions are activated, but for pose-only driving, audio is disabled,” according to the paper. While driving modalities influence an AI system’s behavior, they are distinct from prompts.
In terms of controlling inputs, OmniHuman accepts audio to guide lip sync and speech-related gestures, and can use video clips to reference motion, text to steer specific actions or behaviors and images to reference pose (drive full-body animation).
The ByteDance development team “trained OmniHuman on more than 18,700 hours of human video data using a novel approach that combines multiple types of inputs — text, audio and body movements. This ‘omni-conditions’ training strategy allows the AI to learn from much larger and more diverse datasets than previous methods,” VentureBeat reports.
No Comments Yet
You can be the first to comment!
Leave a comment
You must be logged in to post a comment.