Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.
There is a natural trade-off between reconstruction (how well the decoder can recover encoded images) and generation (how close the synthetic image distribution is to real image distribution) with regard to the dimension of latent space, as illustrated in Fig. 1. When the number of channel expands, the reconstruction performance consistently improves (blue), while the generation performance improves at the beginning and then decreases (green).
Finding 1: Decoder relies more on the information encoded in high-frequency latent embeddingsto reconstruct details and semantics in RGB space.
Finding 2: In RGB space, most information of images exist in a narrow low-frequency band, which is different from the distribution in latent space.
Finding 3: Extremely high-frequency components in RGB space have marginal contributions to the image quality, but may impede the encoding of other high-frequency signals.
We filter out high-frequency components above a frequency threshold r0 in the RGB space. The filtered images are forwarded into a pretrained autoencoder. We train diffusion models or flow matching models on top of the latent space in the early training stage for warm-up. Note that the autoencoder is kept frozen throughout training in our method.