DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

Henglin Liu1,2,   Huijuan Huang2,   Jing Wang2,3,   Chang Liu1,   Xiu Li1,   Xiangyang Ji1,  
1Tsinghua University,   2Kling Team, Kuaishou Technology,   3Shenzhen Campus of Sun Yat-Sen University  

Abstract

Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13%~18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.


Overview

DiverseGRPO enhances image quality while mitigating mode collapse during GRPO training. Watch our video presentation for a quick summary of our key contributions.


Problem

MY ALT TEXT

A significant decline in the diversity of generated images is observed as the GRPO training progresses.

This phenomenon arises because optimizing for reward maximization encourages the model to focus on a small set of high-reward outputs. As a result, "safe" or high-scoring patterns are repeatedly reinforced, while creative or less common behaviors are gradually suppressed (Cui et al.;Xiao et al.).

However, we raise a deeper question: Is diversity degradation an inevitable byproduct of reward optimization, or is it a symptom of misaligned learning objectives and generation dynamics?


Motivation

MY ALT TEXT

From a dynamical viewpoint, the model’s conditional distribution can be written as a mixture of semantic modes:

\[ \pi_{\theta}(x \mid p) = \sum_{k=1}^{K} w_k \, \pi_{\theta}^{k}(x \mid p) \]

When training relies on single-sample rewards, the mixture weights \(w_k\) evolve according to replicator dynamics:

\[ \frac{dw_k}{dt} = w_k \bigl( \bar{r}_k - \mathbb{E}_{j}[ \bar{r}_j ] \bigr) \]

where \(\bar{r}_k\) is the average reward of mode \(k\). Modes with above-average reward grow, while others shrink. Over time, this process converges to a degenerate equilibrium,

\[ w_k = \mathbb{1}\{ k = \arg\max_{j} \bar{r}_j \} \]

in which only the highest-reward mode survives. The outcome is a unimodal and homogenized output distribution, leading to mode extinction and loss of diversity.

Motivation 1: Single-sample rewards suffer from overlooking global distribution and easily inducing mode collapse, thus necessitating the adoption of distribution-level rewards.


MY ALT TEXT

Beyond external reward signals, we reveal a key insight from the intrinsic denoising dynamics of diffusion models. By measuring perceptual similarity with DreamSim, we find that samples sharing more denoising steps become increasingly similar.

Crucially, diversity collapses much faster in the early denoising phase: the first one-third of steps accounts for nearly 66% of the total diversity loss. This shows that early denoising plays a dominant role in determining visual diversity, and later steps focus on refinement.

From the perspective of mitigating mode collapse, the denoising trajectory forms an imbalanced diversity budget. However, the traditional KL penalty becomes effectively weakest exactly when the budget should be highest, resulting in a structural mismatch that accelerates mode extinction.

Motivation 2: Structural mismatch accelerates mode extinction, highlighting a fundamental limitation in current diffusion training regularization.


Method Overview

MY ALT TEXT


Quantitative Results

MY ALT TEXT

To comprehensively evaluate our approach, we conduct experiments with different backbones (SD3.5-M / Flux.1-dev) and reward functions (Pickscore / HPSv3) as shown in the table. Our approach consistently improves all diversity metrics, achieving a superior quality-diversity Pareto front under comparable visual quality conditions. This confirms that our reward mechanism encourages exploration of novel visual modes, preventing convergence to a few high-reward patterns.


Ablation study

MY ALT TEXT

We evaluate the separate effects of the Structure-Aware Regularization (SA-Reg) and Creativity Reward modules. As shown in Figure (a), using both modules together achieves the best trade-off between quality and diversity. This indicates that structure-aware regularization helps maintain diverse image patterns, while the creativity reward pushes the model to explore even more semantic variations. Figure (b) and (c) examine the effects of the creativity reward coefficient and the number of structured-aware regularization steps, respectively. Increasing the coefficient boosts diversity, notably at the highest value, by encouraging exploration, but the additional gains in diversity begin to plateau in later stages. This suggests that beyond a certain point, the model finds a balance between exploration and exploitation. Likewise, more regularization steps enhance diversity through structured regularization, but higher step counts lead to increased computational expense with diminishing returns (further details in the appendix).


Qualitative Results

We provide a visual comparison of images generated by two methods. The top row in each figure set shows results from the baseline method, while the bottom row shows results from our method. The comparison reveals that while different backbone networks produce stylistic differences, the baseline method suffers from significant mode collapse after training.


Human Evaluation

MY ALT TEXT

We conducted a human preference evaluation, and the results clearly demonstrate the superiority of DiverseGRPO in image quality (including visual quality and text alignment) and diversity (including style diversity and content diversity).


More Analysis

MY ALT TEXT

Training Process: The figure illustrates the evolution of image quality and diversity throughout the training process. Compared to the baseline, our method maintains comparable image quality while experiencing a significantly slower decline in diversity, underscoring its effectiveness in balancing both critical aspects.

MY ALT TEXT

Impact of Exploration Bonus: At training step 720, a rare side-view elephant sample emerges. By constructing a distance matrix that reveals distinct disparities between side-view samples and the majority, as shown in heatmap, our method effectively identifies such sample of high value in terms of diversity. By assigning these samples higher exploration rewards, the model is guided to not only continue generating them in subsequent training stages, but also to refine the overall quality of its outputs, achieving a more diverse and high-fidelity synthesis.