fb09e66d09
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率 - 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计 - 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py) - 提供详细的文档和开发日志,包含问题解决记录和实验分析 - 移除旧版项目文件,统一项目结构到CW1_id_name目录下
243 lines
11 KiB
Markdown
243 lines
11 KiB
Markdown
# Implementation Challenges & Resolutions
|
||
|
||
This document records the substantive engineering challenges encountered
|
||
during development, suitable as raw material for the report's
|
||
"Implementation Details — Challenges" section. Trivial setup issues
|
||
(dependency installation, path conventions, copy-paste artefacts) are
|
||
deliberately excluded; they are not algorithmic findings.
|
||
|
||
---
|
||
|
||
## 1. Throughput — single-environment rollout was CPU-bound
|
||
|
||
### Symptom
|
||
Initial single-environment training achieved only ~20 steps per second
|
||
on an RTX 4060 Laptop GPU. Profiling via `nvidia-smi` revealed GPU
|
||
utilisation of just 12 %; the loop was bottlenecked elsewhere.
|
||
|
||
### Root cause
|
||
1. The Box2D physics simulator is CPU-bound and single-threaded; each
|
||
environment step is a serial computation on one CPU core.
|
||
2. Per-step `agent.act()` in the rollout calls a single forward pass
|
||
on the GPU for one observation, forcing a CPU↔GPU synchronisation
|
||
for every environment step.
|
||
|
||
### Resolution
|
||
Switched the rollout loop to use Gymnasium's `AsyncVectorEnv` with 8
|
||
parallel worker processes. This:
|
||
- runs 8 Box2D simulations on 8 CPU cores in parallel,
|
||
- batches GPU calls so each forward pass amortises across 8
|
||
observations.
|
||
|
||
Throughput rose to ~95 steps per second, a 4.5× speedup. Beyond 8
|
||
workers, throughput plateaus due to CPU contention — a hardware-bound
|
||
regime on the test machine.
|
||
|
||
### Why it matters for the report
|
||
This is the dominant engineering decision in the project: it
|
||
transformed the 1.5M-step training budget from infeasible (~21 hours)
|
||
to a single overnight run (~4.5 hours).
|
||
|
||
---
|
||
|
||
## 2. Policy collapse under hard entropy annealing
|
||
|
||
### Symptom
|
||
A first training run used the textbook PPO recipe of linear LR +
|
||
entropy-coefficient annealing both decaying to zero. Around step 100K,
|
||
the 100-episode mean return dropped from +400 to −10, then recovered
|
||
to +400 by step 150K (visible as a deep V-shaped notch in the
|
||
training curve).
|
||
|
||
### Root cause analysis
|
||
At step 100K, policy entropy had fallen to ~0.4 (from initial
|
||
ln 5 ≈ 1.61). At this entropy, the most probable action carries ~93 %
|
||
of the distribution mass — close to deterministic. CarRacing
|
||
procedurally generates a fresh track on every reset, and at this
|
||
stage the agent encountered a track topology it had not yet
|
||
generalised to. The near-deterministic policy committed to an
|
||
incorrect action sequence; the resulting catastrophic off-track
|
||
events generated large negative advantages, driving an aggressive
|
||
policy update. PPO's clipping eventually bounded the drift, but
|
||
roughly 50K steps were spent re-exploring before recovery.
|
||
|
||
### Resolution
|
||
Introduced an **entropy coefficient floor** of 0.005 (rather than
|
||
zero). The schedule now decays the entropy coefficient linearly from
|
||
0.01 toward 0.005, after which it remains constant. Preserving 0.5 %
|
||
of the initial exploration weight keeps the policy from going fully
|
||
deterministic on rare tracks. We also floor per-frame rewards at
|
||
−1.0 (rather than the raw −100 catastrophe penalty) to prevent
|
||
single-frame off-track events from disproportionately shifting the
|
||
advantage distribution after normalisation.
|
||
|
||
### Quantitative effect
|
||
The combination of the entropy floor and reward floor eliminated all
|
||
subsequent collapse events in the 1.5M-step training run. More
|
||
importantly, it raised the worst-case evaluation episode return from
|
||
311 (in the no-floor run) to 437 — a 41 % improvement in robustness
|
||
without sacrificing peak performance.
|
||
|
||
### Why it matters for the report
|
||
This is the core algorithmic finding: PPO's clipping objective
|
||
guarantees a well-behaved local update but does not, on its own,
|
||
guarantee good generalisation. Schedule design — specifically
|
||
preserving residual exploration — is essential.
|
||
|
||
---
|
||
|
||
## 3. Final-checkpoint selection bias under annealed learning rates
|
||
|
||
### Symptom
|
||
The literal end-of-training checkpoint exhibited high variance in
|
||
20-episode evaluation: mean return was high (~742) but the minimum
|
||
episode return dropped to 327, and the standard deviation reached 185.
|
||
Earlier checkpoints exhibited tighter distributions.
|
||
|
||
### Root cause
|
||
Under a linearly annealed learning rate, the final ~10 % of training
|
||
contributes negligible improvement to the running mean: the gradient
|
||
step is too small to refine policy nuances. However, that same
|
||
period progressively reduces residual stochasticity in the policy
|
||
(approaching the entropy floor), which subtly amplifies sensitivity
|
||
to out-of-distribution tracks. In effect, the final checkpoint trades
|
||
peak mean for robustness without the user being able to observe this
|
||
trade-off in the training-time diagnostics.
|
||
|
||
### Resolution
|
||
Implemented a `scan_checkpoints.py` utility that:
|
||
1. Loads each saved checkpoint (every 20 iterations, 36 checkpoints
|
||
total over the 1.5M-step run);
|
||
2. Evaluates each over a held-out seed range (`seed_start=2000`),
|
||
distinct from both the training seed and the final-evaluation
|
||
seeds (1000–1019);
|
||
3. Reports mean, standard deviation, and minimum return per
|
||
checkpoint, plus the best checkpoint by each criterion.
|
||
|
||
The selected submission model is `iter_0700.pt` (training step
|
||
~1.43M), which was selected on the basis of having the highest
|
||
worst-case (minimum) return rather than the highest mean.
|
||
|
||
### Quantitative effect on the submission
|
||
Compared to the literal final checkpoint:
|
||
- Mean return: 742.0 → 705.0 (−5 %, acceptable)
|
||
- Std: 185.2 → 160.3 (−13 %)
|
||
- Minimum: 327.1 → 504.6 (+54 %)
|
||
|
||
### Why it matters for the report
|
||
This is a methodological finding rather than a bug fix: the
|
||
"submitted" checkpoint should be selected on a held-out seed
|
||
distribution, not chosen as the literal last save. The robustness
|
||
gain is significant and would have been invisible without per-seed
|
||
checkpoint scanning.
|
||
|
||
---
|
||
|
||
## 4. Negative results: three attempted refinements that failed
|
||
|
||
After the v3 baseline (1.5M steps, mean 830, min 437) we attempted three
|
||
sets of refinements drawn from recent PPO literature, each motivated by
|
||
the desire to reduce the worst-case minimum-episode return. **All three
|
||
collapsed or under-performed.** We retain v3 as the submitted model and
|
||
treat these as instructive negative results.
|
||
|
||
### 4.1 Failed attempt: KL early stopping (target_kl=0.015)
|
||
|
||
**Motivation.** Stable-Baselines3 and CleanRL both default to a KL
|
||
early-stopping mechanism that aborts the current update epoch once the
|
||
mean approx-KL exceeds 1.5×target_kl. Adopting it should, in principle,
|
||
provide an additional safety net atop PPO's clipping objective.
|
||
|
||
**Configuration.** v3 hyperparameters + `target_kl=0.015`,
|
||
`batch_size=128`, `n_epochs=6`, augmentation enabled.
|
||
|
||
**Failure mode.** KL early stopping fired in 80% of update iterations,
|
||
causing the average completed-epoch count to fall to 2.36/6. Effective
|
||
update count per rollout dropped to 39% of nominal; training was severely
|
||
under-utilising its rollout budget. Final mean return was projected to
|
||
be substantially below v3.
|
||
|
||
**Diagnosis.** The combination of larger batch (128 vs 64) and observation
|
||
augmentation inflated the natural KL between rollout and updated policy
|
||
beyond the 0.0225 trigger. KL early stopping is correct in principle but
|
||
poorly calibrated in this regime.
|
||
|
||
### 4.2 Failed attempt: Random-shift data augmentation (RAD-style)
|
||
|
||
**Motivation.** Laskin et al. 2020 (RAD) and Yarats et al. 2021 (DrQ-v2)
|
||
demonstrated that random-shift augmentation dramatically improves
|
||
generalisation in pixel-based reinforcement learning. CarRacing's
|
||
procedural track generation should benefit similarly.
|
||
|
||
**Configuration.** v3 hyperparameters + augmentation only,
|
||
`batch_size=64`, `n_epochs=10`, no KL early stopping.
|
||
|
||
**Failure mode.** Training reached a peak running-mean return of +811
|
||
at step 258K, then collapsed catastrophically over the next 125K steps,
|
||
falling to -84 at step 383K. Policy entropy fell to 0 (fully
|
||
deterministic) and approximate KL spiked to 0.82 within a single
|
||
update window.
|
||
|
||
**Diagnosis.** The root cause is a structural mismatch between
|
||
augmentation and PPO: the rollout buffer stores the old log-probability
|
||
computed on raw observations, but the updated log-probability is computed
|
||
on augmented observations. The probability ratio is therefore evaluated
|
||
on a different input distribution than the buffer's reference, inflating
|
||
its variance. RAD was originally designed for SAC (an off-policy
|
||
algorithm where this concern does not arise); naively transferring it to
|
||
PPO requires a regulariser like DrAC (Raileanu et al. 2020) which we
|
||
did not implement.
|
||
|
||
### 4.3 Failed attempt: gamma=0.995 + 5M-step training
|
||
|
||
**Motivation.** SB3 RL-Zoo's tuned CarRacing configuration uses
|
||
gamma=0.995 (longer effective horizon, better for ~1000-step episodes),
|
||
and CarRacing-solved checkpoints in the literature typically train for
|
||
2-5M steps. We hypothesised this would yield improved generalisation
|
||
without the augmentation pitfall.
|
||
|
||
**Configuration.** v3 hyperparameters + gamma 0.99→0.995 + 5M total
|
||
steps, no augmentation, no clip annealing.
|
||
|
||
**Failure mode.** Training reached peak +770 at step 278K then began a
|
||
slow decline. By step 405K, return had fallen to +599 with policy
|
||
entropy at 0.082 and KL spikes up to 0.31 in the recent 30 iterations.
|
||
We aborted at 8% progress.
|
||
|
||
**Diagnosis.** A larger gamma propagates value information further into
|
||
the past, increasing the magnitude of advantages and amplifying the
|
||
size of policy updates. Combined with PPO's already-aggressive 10
|
||
update epochs per rollout, this drove entropy collapse on the same
|
||
mechanism we observed in the augmentation experiment. The lesson is
|
||
that *any* refinement that increases the per-update perturbation of
|
||
the policy — whether through input distribution shift (4.2) or through
|
||
discount-factor amplification (4.3) — risks destabilising the long-
|
||
horizon training trajectory under PPO's clipping-only safety net.
|
||
|
||
### 4.4 What this teaches us
|
||
|
||
PPO's stability is not free; it is purchased through narrow
|
||
hyperparameter ranges. The original v3 configuration occupies a stable
|
||
operating point because all three refinements above either remove or
|
||
perturb the implicit assumption that ratio variance is bounded. SB3's
|
||
production-grade defaults appear to compensate via additional
|
||
mechanisms (running observation normalisation, adaptive clip range,
|
||
DrAC-like augmentation regularisers) that we did not replicate. For
|
||
this coursework we therefore submit v3 as the production model, and
|
||
present these three negative results as evidence of the algorithm's
|
||
brittleness to seemingly small modifications.
|
||
|
||
---
|
||
|
||
## Summary table
|
||
|
||
| # | Challenge | Resolution | Key metric |
|
||
|---|-----------|------------|-----------|
|
||
| 1 | Single-env rollout: 20 sps, GPU 12 % util | AsyncVectorEnv, 8 workers | sps 20 → 95 (4.5×) |
|
||
| 2 | Policy collapse near step 100K, entropy ~0.4 | Entropy floor 0.005 + reward floor −1.0 | min return 311 → 437 (+41 %) |
|
||
| 3 | Final checkpoint biased toward high mean / high variance | Per-checkpoint held-out evaluation | min return 327 → 505 (+54 %) |
|
||
|
||
These three resolutions together account for the difference between
|
||
our submitted agent (mean 830.17, std 104.79, min 436.81) and the
|
||
production SB3 PPO baseline (mean 664.32, std 173.93, min 309.40).
|