★
SIGGRAPH 2026
Abstract
Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the image’s global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.Motivation
Many artistic processes, such as painting and sculpting, begin with coarse structure and gradually refine fine details. Diffusion models follow a similar process: early denoising steps recover low-frequency information (global structure), while later steps introduce high-frequency details. In this work, we leverage this property to condition image generation using an initial sketch. By constraining only the low frequencies, we preserve the overall structure while allowing the model to freely generate details. As shown below, this enables image generation from colorful sketches that naturally resemble the early stages of drawing. In contrast, traditional conditioning methods such as ControlNet often rely on detailed conditions, resulting in rigid generations with limited diversity, where variations are mostly restricted to low-frequency properties like color. Moreover, because these methods constrain the entire diffusion process, they can reduce output quality. Conditioning only the initial noise preserves generation flexibility and maintains image quality.
Method
Our approach manipulates the low-frequency components of the initial Gaussian noise before the denoising process begins. By replacing or blending the low-frequency bands of the noise with a downsampled color prior, the diffusion model’s generation is steered toward the desired palette and composition — without any retraining, fine-tuning, or additional model components. The high-frequency noise bands remain random, ensuring diversity in fine detail across different seeds and prompts.
Results & Examples
The same structural input conditioned on different color priors. Low-frequency hue manipulation steers the global palette while preserving layout and fine detail.
The same input noise conditioned on different text prompts. Global color and structure remain consistent while semantic content shifts according to each prompt.
Multiple outputs from the same input and prompt, sampled with different high-frequency noise seeds. Fine details vary while the overall structure and color remain anchored to the condition.
A broader gallery of inputs and their generated outputs across diverse subjects, prompts, and color conditions.
Applications
Beyond direct color conditioning, Colorful-Noise enables a range of downstream creative and editing tasks. Because our method operates purely on the initial noise — with no model modifications — it composes naturally with any text-to-image pipeline.
Given a reference style image, we extract its low-frequency color palette and inject it into the initial noise. This guides the generated output to match the global color distribution and tonal character of the reference — without copying its content or structure. The result is a stylistically aligned image that remains fully controlled by the text prompt, enabling coherent multi-image sets that share a consistent visual language.
Colorful-Noise can be effectively combined with other conditioning methods, such as Canny edge maps (via ControlNet1) and stylization techniques (e.g., Conditional Balanced2 Style-Aligned3). As demonstrated, given a content image whose colors should be preserved, Colorful-Noise retains its low-frequency information, while ControlNet1 constrains the high-frequency structural details. When stylization is applied, the resulting image preserves the original content colors while transferring the texture and geometric characteristics of the reference style. Furthermore, the strength of this effect can be controlled through linear interpolation between Colorful-Noise and white noise, enabling a gradual transition between content preservation and generative flexibility.
Citation
@misc{cohen2026colorfulnoise, title = {Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation}, author = {Nadav Z. Cohen and Ofir Abramovich and Ariel Shamir}, year = {2026}, eprint = {2605.00548}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, doi = {https://doi.org/10.1145/3799902.3811104}, url = {https://arxiv.org/abs/2605.00548} }