Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13%~18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.
This phenomenon arises because optimizing for reward maximization encourages the model to focus on a small set of high-reward outputs. As a result, "safe" or high-scoring patterns are repeatedly reinforced, while creative or less common behaviors are gradually suppressed (Cui et al.;Xiao et al.).
However, we raise a deeper question: Is diversity degradation an inevitable byproduct of reward optimization, or is it a symptom of misaligned learning objectives and generation dynamics?
From a dynamical viewpoint, the model’s conditional distribution can be written as a mixture of semantic modes:
When training relies on single-sample rewards, the mixture weights \(w_k\) evolve according to replicator dynamics:
where \(\bar{r}_k\) is the average reward of mode \(k\). Modes with above-average reward grow, while others shrink. Over time, this process converges to a degenerate equilibrium,
in which only the highest-reward mode survives. The outcome is a unimodal and homogenized output distribution, leading to mode extinction and loss of diversity.
Motivation 1: Single-sample rewards suffer from overlooking global distribution and easily inducing mode collapse, thus necessitating the adoption of distribution-level rewards.
Beyond external reward signals, we reveal a key insight from the intrinsic denoising dynamics of diffusion models. By measuring perceptual similarity with DreamSim, we find that samples sharing more denoising steps become increasingly similar.
Crucially, diversity collapses much faster in the early denoising phase: the first one-third of steps accounts for nearly 66% of the total diversity loss. This shows that early denoising plays a dominant role in determining visual diversity, and later steps focus on refinement.
From the perspective of mitigating mode collapse, the denoising trajectory forms an imbalanced diversity budget. However, the traditional KL penalty becomes effectively weakest exactly when the budget should be highest, resulting in a structural mismatch that accelerates mode extinction.
Motivation 2: Structural mismatch accelerates mode extinction, highlighting a fundamental limitation in current diffusion training regularization.
Top row: Flow-GRPO | Bottom row: DiverseGRPO
Top row: Flow-GRPO | Bottom row: DiverseGRPO
Top row: Flow-GRPO | Bottom row: DiverseGRPO