Conditional Balance:
Improving Multi-Conditioning Trade-Offs in Image Generation

Nadav Z. Cohen¹, Oron Nir^1,2, Ariel Shamir¹

¹Reichman University, ²Microsoft

Abstract

Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both.

This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.

Conditional Balance: An easy-to-follow video explaining the idea behind our method.

How Does it Work

Our experiments with current style generation methods reveal that various methods apply style to the output image with varying levels of intensity. Some methods apply excessive style, leading to content loss as the content conditionals are overshadowed, while others apply insufficient style, resulting in weak stylization as the content conditionals dominate the style. We hypothesize that achieving the optimal balance of conditioning for content and style—and directing it to the appropriate parts of the model—can produce a visually satisfying harmony between content and style conditionals. Below, we briefly explain how we detect these sensitivities.

Image Collections

We first generate conditional inputs which vary the style aspect we wish to find its sensitive parts in the generation process. To make sure our analysis is clean we solidify all other aspects. In the following examples we solidify the content and structure of the image while varying style.

Collection Mapping

Using the pre-generated conditionals, we generate the collection of images which vary only in the spesific analyzed aspect. While generating the images, we extract the information from the attention layers of the denoising UNet and store them for further analysis.

Layer Ranking

After storing the attention features for each layer at each timestep, we rank the layers by sensitivity at each timestep. The ranking is based on a clustering score that evaluates each layer by the closeness of its features for images sharing the same style (Inner Distance) and the separation of features for images with different styles (Outer Distance).

Balanced Inference

By analyzing the features for both style and structure aspects, we set λ_S (blue) and λ_T (orange) to determine the ratio of the most sensitive layers to use for conditioning. By directing the conditionals to layers with higher sensitivity to style and content, we enable all conditionals to be utilized effectively without overshadowing one another.

Balancing Style ( λ_S )

Using our balanced style injection strategy, we can generate images conditioned by complex combinations of style and content inputs without compromising conditional information or image quality.
To achieve this balance, we introduce an balancing parameter, λ_S, which determines the number of 'style-sensitive' layers utilized for conditioning at each timestep.
Below, we illustrate the impact of λ_S on the balancing effect. When λ_S=0, relying solely on text stylization, the generated image appears highly realistic. At our recommended value of λ_S≈0.4, the output achieves a harmonious balance between content and style. However, as λ_S exceeds 0.5, the style begins to dominate over the content as we use inject style to content sensitive layers, leading to visible artifacts and a noticeable misalignment with the content.
Notice that the styilization potential reaches its maximum at λ_S≈0.4. when using higher value the style stays approximatly the same.

Realistic Output
(No Text Stylization)

Loading...

A woman riding on the back of a large camel in a sandy desert. She is wearing a wide straw hat and holding a stick. The surrounding is covered with cactus, in the style of s*.

Style Reference (s*)

Results

Geometric Style Interpolation ( λ_T )

Our experiments showed that using structure-conditioning images, such as Canny and Depth maps, significantly overshadows the effect of the style-conditioning input. More specifically, as the structure-conditioning image imposes geometric information on the generated image, we found that the geometric aspects of a style are affected the most. To address this, we used our analysis method to identify the parts of the diffusion process that are most sensitive to the geometric aspects of an artistic style. Subsequently, we leveraged these layers as indicators of where we should not apply structure conditioning, enabling the model to generate a structure-conditioned image while preserving its geometric style freedom. To achieve this effect, we introduced an interpolation parameter, λ_T, which controls the geometric freedom of the model when conditioned on both a style image and a structure-conditioning image. Using λ_T=0 applies the default conditioning, which enforces a strong structure constraint on the generated image, while increasing λ_T up to λ_T=1 gradually enhances the model's geometric style freedom.

Content Input

Loading...

A portrait of a woman, in the style of s*.

Style Reference (s*)

Results

Additional Applications

In addition to balanced style generation our method can be utilized for other style oriented applications. We present a few ideas in the following sections.

Copying the works of old masters is a time-honored tradition in the art world, dating back to the origins of painting itself. This practice serves as a tool for artists to refine their techniques and develop their unique personal styles. Throughout history, many renowned painters have engaged in this approach - examples include Vincent van Gogh, who copied works by Jean-François Millet, and Pablo Picasso, who reinterpreted works by Diego Velázquez such as Las Meninas. This tradition has even given rise to several iconic artworks, such as Edgar Degas' studies of Old Masters like Nicolas Poussin and Rembrandt, or Francis Bacon's reimaginings of Diego Velázquez's Portrait of Pope Innocent X. Inspired by this classical method of artistic learning, we utilize our stylization approach, which enables the application of distinctive styles to the works of old masters - a process we call ReStyle. Additionally, our method extends the model's geometric flexibility, allowing for the reimagining of an artwork's content in innovative ways - a feature we refer to as ReContent.

Style transfer is a well-known technique that applies the artistic style of a given style image to a separate content image. While diffusion models have significantly advanced computer-based artistic generation, they still face challenges in achieving traditional style transfer. This is because diffusion models generate images starting from noise rather than directly modifying a content image, requiring innovative methods for representing style within conditioning techniques rather than relying on direct optimization. In our experiments, we found that integrating our layer selection strategy into existing methods substantially enhances the balance between style and content in style transfer tasks. Here, we present results using B-LoRA (Frenkel et al.), with parameters set to λ_S ≈ 0.57 and λ_T ≈ 0.85.

Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

Abstract

Conditional Balance: An easy-to-follow video explaining the idea behind our method.

How Does it Work

Image Collections

Collection Mapping

Layer Ranking

Balanced Inference

Balancing Style ( λS )

Results

Geometric Style Interpolation ( λT )

Results

Additional Applications

BibTeX

Conditional Balance:
Improving Multi-Conditioning Trade-Offs in Image Generation

Balancing Style ( λ_S )

Geometric Style Interpolation ( λ_T )