Conditional Balance:
Improving Multi-Conditioning Trade-Offs in Image Generation

1Reichman University, 2Microsoft
Interpolate start reference image.

Abstract

Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both.

This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.

Conditional Balance: An easy-to-follow video explaining the idea behind our method.

How Does it Work

Our experiments with current style generation methods reveal that various methods apply style to the output image with varying levels of intensity. Some methods apply excessive style, leading to content loss as the content conditionals are overshadowed, while others apply insufficient style, resulting in weak stylization as the content conditionals dominate the style. We hypothesize that achieving the optimal balance of conditioning for content and style—and directing it to the appropriate parts of the model—can produce a visually satisfying harmony between content and style conditionals. Below, we briefly explain how we detect these sensitivities.

Image Collections

We first generate conditional inputs which vary the style aspect we wish to find its sensitive parts in the generation process. To make sure our analysis is clean we solidify all other aspects. In the following examples we solidify the content and structure of the image while varying style.

Collection Mapping

Using the pre-generated conditionals, we generate the collection of images which vary only in the spesific analyzed aspect. While generating the images, we extract the information from the attention layers of the denoising UNet and store them for further analysis.

Interpolate start reference image.

Layer Ranking

After storing the attention features for each layer at each timestep, we rank the layers by sensitivity at each timestep. The ranking is based on a clustering score that evaluates each layer by the closeness of its features for images sharing the same style (Inner Distance) and the separation of features for images with different styles (Outer Distance).

Balanced Inference

By analyzing the features for both style and structure aspects, we set λS (blue) and λT (orange) to determine the ratio of the most sensitive layers to use for conditioning. By directing the conditionals to layers with higher sensitivity to style and content, we enable all conditionals to be utilized effectively without overshadowing one another.

Interpolate start reference image.

Balancing Style ( λS )

Using our balanced style injection strategy, we can generate images conditioned by complex combinations of style and content inputs without compromising conditional information or image quality.
To achieve this balance, we introduce an balancing parameter, λS, which determines the number of 'style-sensitive' layers utilized for conditioning at each timestep.
Below, we illustrate the impact of λS on the balancing effect. When λS=0, relying solely on text stylization, the generated image appears highly realistic. At our recommended value of λS≈0.4, the output achieves a harmonious balance between content and style. However, as λS exceeds 0.5, the style begins to dominate over the content as we use inject style to content sensitive layers, leading to visible artifacts and a noticeable misalignment with the content.
Notice that the styilization potential reaches its maximum at λS≈0.4. when using higher value the style stays approximatly the same.

Interpolate start reference image.

Realistic Output
(No Text Stylization)

Loading...

A woman riding on the back of a large camel in a sandy desert. She is wearing a wide straw hat and holding a stick. The surrounding is covered with cactus, in the style of s*.

Interpolation end reference image.

Style Reference (s*)

Results

Geometric Style Interpolation ( λT )

Our experiments showed that using structure-conditioning images, such as Canny and Depth maps, significantly overshadows the effect of the style-conditioning input. More specifically, as the structure-conditioning image imposes geometric information on the generated image, we found that the geometric aspects of a style are affected the most. To address this, we used our analysis method to identify the parts of the diffusion process that are most sensitive to the geometric aspects of an artistic style. Subsequently, we leveraged these layers as indicators of where we should not apply structure conditioning, enabling the model to generate a structure-conditioned image while preserving its geometric style freedom. To achieve this effect, we introduced an interpolation parameter, λT, which controls the geometric freedom of the model when conditioned on both a style image and a structure-conditioning image. Using λT=0 applies the default conditioning, which enforces a strong structure constraint on the generated image, while increasing λT up to λT=1 gradually enhances the model's geometric style freedom.

Interpolate start reference image.

Content Input

Loading...

A portrait of a woman, in the style of s*.

Interpolation end reference image.

Style Reference (s*)

Results

Additional Applications

In addition to balanced style generation our method can be utilized for other style oriented applications. We present a few ideas in the following sections.

Interpolate start reference image.

Copying the works of old masters is a time-honored tradition in the art world, dating back to the origins of painting itself. This practice serves as a tool for artists to refine their techniques and develop their unique personal styles. Throughout history, many renowned painters have engaged in this approach - examples include Vincent van Gogh, who copied works by Jean-François Millet, and Pablo Picasso, who reinterpreted works by Diego Velázquez such as Las Meninas. This tradition has even given rise to several iconic artworks, such as Edgar Degas' studies of Old Masters like Nicolas Poussin and Rembrandt, or Francis Bacon's reimaginings of Diego Velázquez's Portrait of Pope Innocent X. Inspired by this classical method of artistic learning, we utilize our stylization approach, which enables the application of distinctive styles to the works of old masters - a process we call ReStyle. Additionally, our method extends the model's geometric flexibility, allowing for the reimagining of an artwork's content in innovative ways - a feature we refer to as ReContent.

Interpolate start reference image.

Style transfer is a well-known technique that applies the artistic style of a given style image to a separate content image. While diffusion models have significantly advanced computer-based artistic generation, they still face challenges in achieving traditional style transfer. This is because diffusion models generate images starting from noise rather than directly modifying a content image, requiring innovative methods for representing style within conditioning techniques rather than relying on direct optimization. In our experiments, we found that integrating our layer selection strategy into existing methods substantially enhances the balance between style and content in style transfer tasks. Here, we present results using B-LoRA (Frenkel et al.), with parameters set to λS ≈ 0.57 and λT ≈ 0.85.

BibTeX

@misc{cohen2024conditionalbalanceimprovingmulticonditioning,
      title={Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation}, 
      author={Nadav Z. Cohen and Oron Nir and Ariel Shamir},
      year={2024},
      eprint={2412.19853},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.19853}, 
}