Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

*Equal contribution
1KAIST
ICML 2026 (Spotlight)

TL;DR

AGSM is a lightweight, reward-free post-training method that improves text-image alignment in diffusion models. The key design is a dual-token architecture that separates positive and negative alignment guidance during training. AGSM improves semantic faithfulness, reduces counting errors, works across multiple backbones, and is complementary to RL post-training methods.

Key Contributions

  • Reward-free Plackett-Luce formulation. AGSM models text-image alignment as preference learning over diffusion scores, avoiding external reward models.
  • Explicit negative guidance. The method avoids unbounded contrastive divergence by assigning bounded score-level directions to negative pairs.
  • Dual token architecture. AGSM separates positive and negative alignment regions with \(\psi^+\) and \(\psi^-\), preventing conflicting gradients from being absorbed into one shared token space.
  • Efficiency and Versatility. AGSM optimizes a small set of soft tokens (1.8M), works across multiple diffusion backbones, and can be combined with diffusion-RL post-training methods.

Method Overview

AGSM treats text-image alignment as a reward-free preference problem inside the diffusion score-matching objective. It first estimates an intrinsic alignment reward from denoising likelihood, normalizes text candidates with a Plackett–Luce model, and then modifies the score-matching target with bounded alignment guidance.

Overview diagram of Alignment-Guided Score Matching with positive and negative soft-token guidance

AGSM increases alignment rewards for positive pairs and decreasing those for negative pairs. $\epsilon_\theta^+$, $\epsilon_\theta^-$ are conditioned on positive, negative soft tokens $(\psi^+, \psi^-)$. Target noise is adjusted using alignment guidance derived from implicit-reward–weighted EMA predictions $(\hat\epsilon_\theta^+, \hat\epsilon_\theta^-)$.

From Contrastive Pushes[1] to Score Guidance

  • Reward-free alignment signal. Derive alignment rewards from the diffusion model’s own denoising likelihood, without external reward models.
  • Normalized preference guidance. Use a Plackett-Luce formulation to model alignment probabilities.
  • Target-level score correction. Shift the score-matching target rather than directly maximizing negative pairs' denoising errors.

Dual Soft Tokens

  • Architectural separation. \(\psi^+\) and \(\psi^-\) decouple positive and negative alignment regions.
  • Cleaner loss design. The split objective assigns separate score-matching targets to positive and negative pairs, enabling bounded negative guidance instead of over-optimizing negative pairs.
  • Ablation-backed benefit. With the same total token budget, the dual-token design outperforms positive-only and shared-token variants on alignment metrics.

Text-to-Image Generation

AGSM improves text-image alignment across SD1.5, SDXL, and SD3 while keeping image quality competitive.

ImageReward versus FID plot comparing baseline, SoftREPA, and AGSM for generation

AGSM achieves a better Pareto frontier across complementary metrics (top-left) for both image generation and editing.

Qualitative generation comparisons between SD3, SoftREPA, and AGSM

AGSM follows the prompt more faithfully and avoids repeated objects more often than SoftREPA.

Text-Guided Image Editing

AGSM improves target-text alignment for image editing while preserving the source image structure.

Qualitative image editing comparisons between baseline methods, SoftREPA, and AGSM

AGSM demonstrates a superior balance between text alignment and structural consistency.

Complementary to Diffusion RL

AGSM can be combined with existing preference optimization methods. Adding AGSM to DiffusionDPO[2], SPO[3], and InPO[4] improves COCO val5K metrics on SD1.5 and SDXL.

Additional qualitative generation comparisons with diffusion RL methods

AGSM improves prompt fidelity when added to diffusion RL methods.

References

[1] Lee et al., Aligning Text to Image in Diffusion Models is Easier Than You Think.

[2] Wallace et al., Diffusion Model Alignment Using Direct Preference Optimization.

[3] Liang et al., Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-Step Preference Optimization.

[4] Lu et al., InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment.

BibTeX