Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.
Comparison with classic TI2V tasks. Our task requires video generation conditioned on any number of images at any positions, which unifies existing classic TI2V tasks. The images with blue and pink edges are condition images, and images with green edges are generated video frames.
Overview of the proposed FlexTI2V approach. We invert the condition image embedding to noisy representation `\tilde{mathbf{x}}_t` at each step. The final noise `\tilde{\mathbf{x}}_T` is reused as initialization for video synthesis. At step t, we directly replace the video frames with images at the desired positions. Then, for each video frame, we randomly swap a portion of patches with bounded condition images based on the relative distance between the frame and images. Though we show a special case of using two condition images in this figure, our method can naturally extend to any number of images at any positions. Note that all operations of our method occur in the latent space. We visualize RGB images and frames on the latent representations only for intuitive understanding.
Quantitative comparison with previous methods. “DC” is short for DynamiCrafter. “User” denotes the user study results. ↓ means a lower score in this metric indicates a better performance. The best results are highlighted with boldface. The orange row refers to our FlexTI2V method.
@article{lai2025incorporating,
title={Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training},
author={Lai, Bolin and Lee, Sangmin and Cao, Xu and Li, Xiang and Rehg, James M},
journal={arXiv preprint arXiv:2505.20629},
year={2025}}