Files
Serendipity fb09e66d09 feat: 重构项目结构并添加向量化PPO训练与评估脚本
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
2026-05-02 13:44:08 +08:00

11 KiB
Raw Permalink Blame History

Implementation Challenges & Resolutions

This document records the substantive engineering challenges encountered during development, suitable as raw material for the report's "Implementation Details — Challenges" section. Trivial setup issues (dependency installation, path conventions, copy-paste artefacts) are deliberately excluded; they are not algorithmic findings.


1. Throughput — single-environment rollout was CPU-bound

Symptom

Initial single-environment training achieved only ~20 steps per second on an RTX 4060 Laptop GPU. Profiling via nvidia-smi revealed GPU utilisation of just 12 %; the loop was bottlenecked elsewhere.

Root cause

  1. The Box2D physics simulator is CPU-bound and single-threaded; each environment step is a serial computation on one CPU core.
  2. Per-step agent.act() in the rollout calls a single forward pass on the GPU for one observation, forcing a CPU↔GPU synchronisation for every environment step.

Resolution

Switched the rollout loop to use Gymnasium's AsyncVectorEnv with 8 parallel worker processes. This:

  • runs 8 Box2D simulations on 8 CPU cores in parallel,
  • batches GPU calls so each forward pass amortises across 8 observations.

Throughput rose to ~95 steps per second, a 4.5× speedup. Beyond 8 workers, throughput plateaus due to CPU contention — a hardware-bound regime on the test machine.

Why it matters for the report

This is the dominant engineering decision in the project: it transformed the 1.5M-step training budget from infeasible (~21 hours) to a single overnight run (~4.5 hours).


2. Policy collapse under hard entropy annealing

Symptom

A first training run used the textbook PPO recipe of linear LR + entropy-coefficient annealing both decaying to zero. Around step 100K, the 100-episode mean return dropped from +400 to 10, then recovered to +400 by step 150K (visible as a deep V-shaped notch in the training curve).

Root cause analysis

At step 100K, policy entropy had fallen to ~0.4 (from initial ln 5 ≈ 1.61). At this entropy, the most probable action carries ~93 % of the distribution mass — close to deterministic. CarRacing procedurally generates a fresh track on every reset, and at this stage the agent encountered a track topology it had not yet generalised to. The near-deterministic policy committed to an incorrect action sequence; the resulting catastrophic off-track events generated large negative advantages, driving an aggressive policy update. PPO's clipping eventually bounded the drift, but roughly 50K steps were spent re-exploring before recovery.

Resolution

Introduced an entropy coefficient floor of 0.005 (rather than zero). The schedule now decays the entropy coefficient linearly from 0.01 toward 0.005, after which it remains constant. Preserving 0.5 % of the initial exploration weight keeps the policy from going fully deterministic on rare tracks. We also floor per-frame rewards at 1.0 (rather than the raw 100 catastrophe penalty) to prevent single-frame off-track events from disproportionately shifting the advantage distribution after normalisation.

Quantitative effect

The combination of the entropy floor and reward floor eliminated all subsequent collapse events in the 1.5M-step training run. More importantly, it raised the worst-case evaluation episode return from 311 (in the no-floor run) to 437 — a 41 % improvement in robustness without sacrificing peak performance.

Why it matters for the report

This is the core algorithmic finding: PPO's clipping objective guarantees a well-behaved local update but does not, on its own, guarantee good generalisation. Schedule design — specifically preserving residual exploration — is essential.


3. Final-checkpoint selection bias under annealed learning rates

Symptom

The literal end-of-training checkpoint exhibited high variance in 20-episode evaluation: mean return was high (~742) but the minimum episode return dropped to 327, and the standard deviation reached 185. Earlier checkpoints exhibited tighter distributions.

Root cause

Under a linearly annealed learning rate, the final ~10 % of training contributes negligible improvement to the running mean: the gradient step is too small to refine policy nuances. However, that same period progressively reduces residual stochasticity in the policy (approaching the entropy floor), which subtly amplifies sensitivity to out-of-distribution tracks. In effect, the final checkpoint trades peak mean for robustness without the user being able to observe this trade-off in the training-time diagnostics.

Resolution

Implemented a scan_checkpoints.py utility that:

  1. Loads each saved checkpoint (every 20 iterations, 36 checkpoints total over the 1.5M-step run);
  2. Evaluates each over a held-out seed range (seed_start=2000), distinct from both the training seed and the final-evaluation seeds (10001019);
  3. Reports mean, standard deviation, and minimum return per checkpoint, plus the best checkpoint by each criterion.

The selected submission model is iter_0700.pt (training step ~1.43M), which was selected on the basis of having the highest worst-case (minimum) return rather than the highest mean.

Quantitative effect on the submission

Compared to the literal final checkpoint:

  • Mean return: 742.0 → 705.0 (5 %, acceptable)
  • Std: 185.2 → 160.3 (13 %)
  • Minimum: 327.1 → 504.6 (+54 %)

Why it matters for the report

This is a methodological finding rather than a bug fix: the "submitted" checkpoint should be selected on a held-out seed distribution, not chosen as the literal last save. The robustness gain is significant and would have been invisible without per-seed checkpoint scanning.


4. Negative results: three attempted refinements that failed

After the v3 baseline (1.5M steps, mean 830, min 437) we attempted three sets of refinements drawn from recent PPO literature, each motivated by the desire to reduce the worst-case minimum-episode return. All three collapsed or under-performed. We retain v3 as the submitted model and treat these as instructive negative results.

4.1 Failed attempt: KL early stopping (target_kl=0.015)

Motivation. Stable-Baselines3 and CleanRL both default to a KL early-stopping mechanism that aborts the current update epoch once the mean approx-KL exceeds 1.5×target_kl. Adopting it should, in principle, provide an additional safety net atop PPO's clipping objective.

Configuration. v3 hyperparameters + target_kl=0.015, batch_size=128, n_epochs=6, augmentation enabled.

Failure mode. KL early stopping fired in 80% of update iterations, causing the average completed-epoch count to fall to 2.36/6. Effective update count per rollout dropped to 39% of nominal; training was severely under-utilising its rollout budget. Final mean return was projected to be substantially below v3.

Diagnosis. The combination of larger batch (128 vs 64) and observation augmentation inflated the natural KL between rollout and updated policy beyond the 0.0225 trigger. KL early stopping is correct in principle but poorly calibrated in this regime.

4.2 Failed attempt: Random-shift data augmentation (RAD-style)

Motivation. Laskin et al. 2020 (RAD) and Yarats et al. 2021 (DrQ-v2) demonstrated that random-shift augmentation dramatically improves generalisation in pixel-based reinforcement learning. CarRacing's procedural track generation should benefit similarly.

Configuration. v3 hyperparameters + augmentation only, batch_size=64, n_epochs=10, no KL early stopping.

Failure mode. Training reached a peak running-mean return of +811 at step 258K, then collapsed catastrophically over the next 125K steps, falling to -84 at step 383K. Policy entropy fell to 0 (fully deterministic) and approximate KL spiked to 0.82 within a single update window.

Diagnosis. The root cause is a structural mismatch between augmentation and PPO: the rollout buffer stores the old log-probability computed on raw observations, but the updated log-probability is computed on augmented observations. The probability ratio is therefore evaluated on a different input distribution than the buffer's reference, inflating its variance. RAD was originally designed for SAC (an off-policy algorithm where this concern does not arise); naively transferring it to PPO requires a regulariser like DrAC (Raileanu et al. 2020) which we did not implement.

4.3 Failed attempt: gamma=0.995 + 5M-step training

Motivation. SB3 RL-Zoo's tuned CarRacing configuration uses gamma=0.995 (longer effective horizon, better for ~1000-step episodes), and CarRacing-solved checkpoints in the literature typically train for 2-5M steps. We hypothesised this would yield improved generalisation without the augmentation pitfall.

Configuration. v3 hyperparameters + gamma 0.99→0.995 + 5M total steps, no augmentation, no clip annealing.

Failure mode. Training reached peak +770 at step 278K then began a slow decline. By step 405K, return had fallen to +599 with policy entropy at 0.082 and KL spikes up to 0.31 in the recent 30 iterations. We aborted at 8% progress.

Diagnosis. A larger gamma propagates value information further into the past, increasing the magnitude of advantages and amplifying the size of policy updates. Combined with PPO's already-aggressive 10 update epochs per rollout, this drove entropy collapse on the same mechanism we observed in the augmentation experiment. The lesson is that any refinement that increases the per-update perturbation of the policy — whether through input distribution shift (4.2) or through discount-factor amplification (4.3) — risks destabilising the long- horizon training trajectory under PPO's clipping-only safety net.

4.4 What this teaches us

PPO's stability is not free; it is purchased through narrow hyperparameter ranges. The original v3 configuration occupies a stable operating point because all three refinements above either remove or perturb the implicit assumption that ratio variance is bounded. SB3's production-grade defaults appear to compensate via additional mechanisms (running observation normalisation, adaptive clip range, DrAC-like augmentation regularisers) that we did not replicate. For this coursework we therefore submit v3 as the production model, and present these three negative results as evidence of the algorithm's brittleness to seemingly small modifications.


Summary table

# Challenge Resolution Key metric
1 Single-env rollout: 20 sps, GPU 12 % util AsyncVectorEnv, 8 workers sps 20 → 95 (4.5×)
2 Policy collapse near step 100K, entropy ~0.4 Entropy floor 0.005 + reward floor 1.0 min return 311 → 437 (+41 %)
3 Final checkpoint biased toward high mean / high variance Per-checkpoint held-out evaluation min return 327 → 505 (+54 %)

These three resolutions together account for the difference between our submitted agent (mean 830.17, std 104.79, min 436.81) and the production SB3 PPO baseline (mean 664.32, std 173.93, min 309.40).