Aligning Text-to-Image Diffusion Models With Constrained Reinforcement Learning

Ziyi Zhang, Sen Zhang, Li Shen, Yibing Zhan, Yong Luo*, Han Hu, Bo Du, Yonggang Wen, Dacheng Tao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model’s ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.

Original languageEnglish
Pages (from-to)9550-9562
Number of pages13
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume47
Issue number11
DOIs
Publication statusPublished - 2025
Externally publishedYes

Keywords

  • Diffusion models
  • constrained optimization
  • reinforcement learning
  • reward overoptimization
  • text-to-image

Fingerprint

Dive into the research topics of 'Aligning Text-to-Image Diffusion Models With Constrained Reinforcement Learning'. Together they form a unique fingerprint.

Cite this