rl-atari/CW1_id_name/docs/issues_and_fixes.md

# Implementation Challenges & Resolutions

This document records the substantive engineering challenges encountered
during development, suitable as raw material for the report's
"Implementation Details — Challenges" section. Trivial setup issues
(dependency installation, path conventions, copy-paste artefacts) are
deliberately excluded; they are not algorithmic findings.

---

## 1. Throughput — single-environment rollout was CPU-bound

### Symptom
Initial single-environment training achieved only ~20 steps per second
on an RTX 4060 Laptop GPU. Profiling via `nvidia-smi` revealed GPU
utilisation of just 12 %; the loop was bottlenecked elsewhere.

### Root cause
1. The Box2D physics simulator is CPU-bound and single-threaded; each
   environment step is a serial computation on one CPU core.
2. Per-step `agent.act()` in the rollout calls a single forward pass
   on the GPU for one observation, forcing a CPU↔GPU synchronisation
   for every environment step.

### Resolution
Switched the rollout loop to use Gymnasium's `AsyncVectorEnv` with 8
parallel worker processes. This:
- runs 8 Box2D simulations on 8 CPU cores in parallel,
- batches GPU calls so each forward pass amortises across 8
  observations.

Throughput rose to ~95 steps per second, a 4.5× speedup. Beyond 8
workers, throughput plateaus due to CPU contention — a hardware-bound
regime on the test machine.

### Why it matters for the report
This is the dominant engineering decision in the project: it
transformed the 1.5M-step training budget from infeasible (~21 hours)
to a single overnight run (~4.5 hours).

---

## 2. Policy collapse under hard entropy annealing

### Symptom
A first training run used the textbook PPO recipe of linear LR +
entropy-coefficient annealing both decaying to zero. Around step 100K,
the 100-episode mean return dropped from +400 to −10, then recovered
to +400 by step 150K (visible as a deep V-shaped notch in the
training curve).

### Root cause analysis
At step 100K, policy entropy had fallen to ~0.4 (from initial
ln 5 ≈ 1.61). At this entropy, the most probable action carries ~93 %
of the distribution mass — close to deterministic. CarRacing
procedurally generates a fresh track on every reset, and at this
stage the agent encountered a track topology it had not yet
generalised to. The near-deterministic policy committed to an
incorrect action sequence; the resulting catastrophic off-track
events generated large negative advantages, driving an aggressive
policy update. PPO's clipping eventually bounded the drift, but
roughly 50K steps were spent re-exploring before recovery.

### Resolution
Introduced an **entropy coefficient floor** of 0.005 (rather than
zero). The schedule now decays the entropy coefficient linearly from
0.01 toward 0.005, after which it remains constant. Preserving 0.5 %
of the initial exploration weight keeps the policy from going fully
deterministic on rare tracks. We also floor per-frame rewards at
−1.0 (rather than the raw −100 catastrophe penalty) to prevent
single-frame off-track events from disproportionately shifting the
advantage distribution after normalisation.

### Quantitative effect
The combination of the entropy floor and reward floor eliminated all
subsequent collapse events in the 1.5M-step training run. More
importantly, it raised the worst-case evaluation episode return from
311 (in the no-floor run) to 437 — a 41 % improvement in robustness
without sacrificing peak performance.

### Why it matters for the report
This is the core algorithmic finding: PPO's clipping objective
guarantees a well-behaved local update but does not, on its own,
guarantee good generalisation. Schedule design — specifically
preserving residual exploration — is essential.

---

## 3. Final-checkpoint selection bias under annealed learning rates

### Symptom
The literal end-of-training checkpoint exhibited high variance in
20-episode evaluation: mean return was high (~742) but the minimum
episode return dropped to 327, and the standard deviation reached 185.
Earlier checkpoints exhibited tighter distributions.

### Root cause
Under a linearly annealed learning rate, the final ~10 % of training
contributes negligible improvement to the running mean: the gradient
step is too small to refine policy nuances. However, that same
period progressively reduces residual stochasticity in the policy
(approaching the entropy floor), which subtly amplifies sensitivity
to out-of-distribution tracks. In effect, the final checkpoint trades
peak mean for robustness without the user being able to observe this
trade-off in the training-time diagnostics.

### Resolution
Implemented a `scan_checkpoints.py` utility that:
1. Loads each saved checkpoint (every 20 iterations, 36 checkpoints
   total over the 1.5M-step run);
2. Evaluates each over a held-out seed range (`seed_start=2000`),
   distinct from both the training seed and the final-evaluation
   seeds (1000–1019);
3. Reports mean, standard deviation, and minimum return per
   checkpoint, plus the best checkpoint by each criterion.

The selected submission model is `iter_0700.pt` (training step
~1.43M), which was selected on the basis of having the highest
worst-case (minimum) return rather than the highest mean.

### Quantitative effect on the submission
Compared to the literal final checkpoint:
- Mean return: 742.0 → 705.0  (−5 %, acceptable)
- Std: 185.2 → 160.3  (−13 %)
- Minimum: 327.1 → 504.6  (+54 %)

### Why it matters for the report
This is a methodological finding rather than a bug fix: the
"submitted" checkpoint should be selected on a held-out seed
distribution, not chosen as the literal last save. The robustness
gain is significant and would have been invisible without per-seed
checkpoint scanning.

---

## 4. Negative results: three attempted refinements that failed

After the v3 baseline (1.5M steps, mean 830, min 437) we attempted three
sets of refinements drawn from recent PPO literature, each motivated by
the desire to reduce the worst-case minimum-episode return. **All three
collapsed or under-performed.** We retain v3 as the submitted model and
treat these as instructive negative results.

### 4.1 Failed attempt: KL early stopping (target_kl=0.015)

**Motivation.** Stable-Baselines3 and CleanRL both default to a KL
early-stopping mechanism that aborts the current update epoch once the
mean approx-KL exceeds 1.5×target_kl. Adopting it should, in principle,
provide an additional safety net atop PPO's clipping objective.

**Configuration.** v3 hyperparameters + `target_kl=0.015`,
`batch_size=128`, `n_epochs=6`, augmentation enabled.

**Failure mode.** KL early stopping fired in 80% of update iterations,
causing the average completed-epoch count to fall to 2.36/6. Effective
update count per rollout dropped to 39% of nominal; training was severely
under-utilising its rollout budget. Final mean return was projected to
be substantially below v3.

**Diagnosis.** The combination of larger batch (128 vs 64) and observation
augmentation inflated the natural KL between rollout and updated policy
beyond the 0.0225 trigger. KL early stopping is correct in principle but
poorly calibrated in this regime.

### 4.2 Failed attempt: Random-shift data augmentation (RAD-style)

**Motivation.** Laskin et al. 2020 (RAD) and Yarats et al. 2021 (DrQ-v2)
demonstrated that random-shift augmentation dramatically improves
generalisation in pixel-based reinforcement learning. CarRacing's
procedural track generation should benefit similarly.

**Configuration.** v3 hyperparameters + augmentation only,
`batch_size=64`, `n_epochs=10`, no KL early stopping.

**Failure mode.** Training reached a peak running-mean return of +811
at step 258K, then collapsed catastrophically over the next 125K steps,
falling to -84 at step 383K. Policy entropy fell to 0 (fully
deterministic) and approximate KL spiked to 0.82 within a single
update window.

**Diagnosis.** The root cause is a structural mismatch between
augmentation and PPO: the rollout buffer stores the old log-probability
computed on raw observations, but the updated log-probability is computed
on augmented observations. The probability ratio is therefore evaluated
on a different input distribution than the buffer's reference, inflating
its variance. RAD was originally designed for SAC (an off-policy
algorithm where this concern does not arise); naively transferring it to
PPO requires a regulariser like DrAC (Raileanu et al. 2020) which we
did not implement.

### 4.3 Failed attempt: gamma=0.995 + 5M-step training

**Motivation.** SB3 RL-Zoo's tuned CarRacing configuration uses
gamma=0.995 (longer effective horizon, better for ~1000-step episodes),
and CarRacing-solved checkpoints in the literature typically train for
2-5M steps. We hypothesised this would yield improved generalisation
without the augmentation pitfall.

**Configuration.** v3 hyperparameters + gamma 0.99→0.995 + 5M total
steps, no augmentation, no clip annealing.

**Failure mode.** Training reached peak +770 at step 278K then began a
slow decline. By step 405K, return had fallen to +599 with policy
entropy at 0.082 and KL spikes up to 0.31 in the recent 30 iterations.
We aborted at 8% progress.

**Diagnosis.** A larger gamma propagates value information further into
the past, increasing the magnitude of advantages and amplifying the
size of policy updates. Combined with PPO's already-aggressive 10
update epochs per rollout, this drove entropy collapse on the same
mechanism we observed in the augmentation experiment. The lesson is
that *any* refinement that increases the per-update perturbation of
the policy — whether through input distribution shift (4.2) or through
discount-factor amplification (4.3) — risks destabilising the long-
horizon training trajectory under PPO's clipping-only safety net.

### 4.4 What this teaches us

PPO's stability is not free; it is purchased through narrow
hyperparameter ranges. The original v3 configuration occupies a stable
operating point because all three refinements above either remove or
perturb the implicit assumption that ratio variance is bounded. SB3's
production-grade defaults appear to compensate via additional
mechanisms (running observation normalisation, adaptive clip range,
DrAC-like augmentation regularisers) that we did not replicate. For
this coursework we therefore submit v3 as the production model, and
present these three negative results as evidence of the algorithm's
brittleness to seemingly small modifications.

---

## Summary table

| # | Challenge | Resolution | Key metric |
|---|-----------|------------|-----------|
| 1 | Single-env rollout: 20 sps, GPU 12 % util | AsyncVectorEnv, 8 workers | sps 20 → 95 (4.5×) |
| 2 | Policy collapse near step 100K, entropy ~0.4 | Entropy floor 0.005 + reward floor −1.0 | min return 311 → 437 (+41 %) |
| 3 | Final checkpoint biased toward high mean / high variance | Per-checkpoint held-out evaluation | min return 327 → 505 (+54 %) |

These three resolutions together account for the difference between
our submitted agent (mean 830.17, std 104.79, min 436.81) and the
production SB3 PPO baseline (mean 664.32, std 173.93, min 309.40).