feat: 重构项目结构并添加向量化PPO训练与评估脚本
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率 - 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计 - 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py) - 提供详细的文档和开发日志,包含问题解决记录和实验分析 - 移除旧版项目文件,统一项目结构到CW1_id_name目录下
This commit is contained in:
@@ -0,0 +1,55 @@
|
||||
# docs/ Index
|
||||
|
||||
Documentation and report artefacts for the DTS307TC PPO coursework.
|
||||
|
||||
## Final deliverables
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `CW1_REPORT_TEMPLATE.docx` | Pre-formatted Word source. IEEE style (11pt Times New Roman, 1.15 spacing, 2.5cm margins). All numbers, figures, and native equations embedded. The student fills in cover-page details and exports to PDF. |
|
||||
| `generate_report_template.py` | Source script that produces the template. |
|
||||
|
||||
**Word count** (excluding References and Appendix): 2972 / 3000.
|
||||
|
||||
## Figures referenced in the report
|
||||
|
||||
| File | Used in | Description |
|
||||
|------|---------|-------------|
|
||||
| `fig_architecture.png` | Fig. 1 | Shared-CNN actor-critic architecture (1.69M params) |
|
||||
| `fig_training_curves.png` | Fig. 2 | 6-panel training curves over 1.5M steps |
|
||||
| `fig_eval_bar.png` | Fig. 3 | Per-episode evaluation returns on 20 unseen seeds |
|
||||
| `fig_sb3_comparison.png` | Fig. 4 | Ours vs SB3 baseline diagnostics overlay |
|
||||
| `demo.mp4` | Submitted alongside the zip | 25-second video of the trained agent on seed 117 (return 925.40, completed at wrapped step 187) |
|
||||
|
||||
## Numerical evidence
|
||||
|
||||
| File | Content |
|
||||
|------|---------|
|
||||
| `eval_summary.json` | 20-episode evaluation of `models/ppo_final.pt`. Mean 830.17 ± 104.79; min 436.81; max 914.90 |
|
||||
| `eval_summary_sb3.json` | 20-episode evaluation of the SB3 baseline. Mean 664.32 ± 173.93; min 309.40; max 857.14 |
|
||||
| `checkpoint_scan_vec_main_v3.json` | Per-checkpoint evaluation table; basis for selecting `iter_0700.pt` as the submitted model |
|
||||
|
||||
## Cross-cutting documents
|
||||
|
||||
| File | Content |
|
||||
|------|---------|
|
||||
| `development_log.md` | Step-by-step development timeline (Days 1-9) |
|
||||
| `issues_and_fixes.md` | Three substantive engineering challenges resolved + three documented negative-result ablations (raw material for Section 3.4 and 4.4) |
|
||||
| `submission_checklist.md` | Pre-submission verification checklist |
|
||||
| `INDEX.md` | This file |
|
||||
|
||||
## Project state at submission
|
||||
|
||||
```
|
||||
runs/ vec_main_v3/ main 1.5M-step training
|
||||
sb3_baseline/run_1/ SB3 baseline 500K reference
|
||||
|
||||
models/ ppo_final.pt submitted agent (= iter_0700.pt selected
|
||||
by held-out checkpoint scanning)
|
||||
vec_main_v3/final.pt training-end backup
|
||||
sb3_baseline/final.zip SB3 reference
|
||||
|
||||
src/ eight Python modules, no SB3 imports
|
||||
notebooks/ three development notebooks (env exploration, network sanity,
|
||||
evaluation)
|
||||
```
|
||||
@@ -0,0 +1,155 @@
|
||||
[
|
||||
{
|
||||
"ckpt": "iter_0420.pt",
|
||||
"stochastic_mean": 772.8404148499792,
|
||||
"stochastic_std": 134.0469265187322,
|
||||
"stochastic_min": 550.1901140684258,
|
||||
"stochastic_returns": [
|
||||
815.8249158248987,
|
||||
914.6999999999905,
|
||||
550.1901140684258,
|
||||
885.5072463768003,
|
||||
697.9797979797816
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0460.pt",
|
||||
"stochastic_mean": 727.5500057577044,
|
||||
"stochastic_std": 189.89105860046578,
|
||||
"stochastic_min": 407.2463768115959,
|
||||
"stochastic_returns": [
|
||||
846.1279461279295,
|
||||
857.4468085106251,
|
||||
614.8288973383865,
|
||||
407.2463768115959,
|
||||
912.099999999985
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0500.pt",
|
||||
"stochastic_mean": 773.5455635987219,
|
||||
"stochastic_std": 163.95429075438219,
|
||||
"stochastic_min": 489.3536121672852,
|
||||
"stochastic_returns": [
|
||||
687.8787878787706,
|
||||
918.1999999999907,
|
||||
489.3536121672852,
|
||||
889.1304347825971,
|
||||
883.1649831649656
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0540.pt",
|
||||
"stochastic_mean": 745.6481816342452,
|
||||
"stochastic_std": 139.64872388958386,
|
||||
"stochastic_min": 534.9809885931408,
|
||||
"stochastic_returns": [
|
||||
623.905723905707,
|
||||
825.5319148936034,
|
||||
534.9809885931408,
|
||||
867.3913043478165,
|
||||
876.4309764309588
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0580.pt",
|
||||
"stochastic_mean": 884.0969293975589,
|
||||
"stochastic_std": 24.862095366596368,
|
||||
"stochastic_min": 846.7680608364823,
|
||||
"stochastic_returns": [
|
||||
896.6329966329788,
|
||||
917.9999999999906,
|
||||
846.7680608364823,
|
||||
892.7536231883943,
|
||||
866.3299663299492
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0620.pt",
|
||||
"stochastic_mean": 868.8009948145111,
|
||||
"stochastic_std": 40.7446677294706,
|
||||
"stochastic_min": 815.8249158248982,
|
||||
"stochastic_returns": [
|
||||
815.8249158248982,
|
||||
878.7234042553056,
|
||||
827.7566539923755,
|
||||
920.1999999999931,
|
||||
901.4999999999828
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0660.pt",
|
||||
"stochastic_mean": 848.5454627389088,
|
||||
"stochastic_std": 114.82809175856892,
|
||||
"stochastic_min": 620.5387205387041,
|
||||
"stochastic_returns": [
|
||||
620.5387205387041,
|
||||
918.8999999999909,
|
||||
880.9885931558726,
|
||||
918.1999999999929,
|
||||
904.0999999999834
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "iter_0700.pt",
|
||||
"stochastic_mean": 879.5099424741011,
|
||||
"stochastic_std": 14.825654886509525,
|
||||
"stochastic_min": 864.5390070921853,
|
||||
"stochastic_returns": [
|
||||
876.4309764309584,
|
||||
864.5390070921853,
|
||||
869.5817490494093,
|
||||
907.1999999999905,
|
||||
879.7979797979622
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
},
|
||||
{
|
||||
"ckpt": "final.pt",
|
||||
"stochastic_mean": 845.6652607187065,
|
||||
"stochastic_std": 107.32097702884839,
|
||||
"stochastic_min": 634.0067340067171,
|
||||
"stochastic_returns": [
|
||||
634.0067340067171,
|
||||
918.1999999999908,
|
||||
880.9885931558729,
|
||||
918.699999999993,
|
||||
876.4309764309589
|
||||
],
|
||||
"deterministic_mean": NaN,
|
||||
"deterministic_std": NaN,
|
||||
"deterministic_min": NaN,
|
||||
"deterministic_returns": []
|
||||
}
|
||||
]
|
||||
Binary file not shown.
@@ -0,0 +1,113 @@
|
||||
# Development Log — DTS307TC PPO Coursework
|
||||
|
||||
This log summarises the project's incremental development. Each step
|
||||
records what was built, why, and the verification used. Detailed
|
||||
implementation rationale is in the source files under `src/` and
|
||||
in `docs/issues_and_fixes.md`.
|
||||
|
||||
## Step 0 — Project skeleton
|
||||
|
||||
Built the project scaffold under `D:/projects/CW1_xxx/`: directories
|
||||
`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
|
||||
`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
|
||||
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
|
||||
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
|
||||
on RTX 4060 Laptop with `torch.cuda.is_available() == True`.
|
||||
|
||||
## Step 1 — Environment exploration
|
||||
|
||||
Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
|
||||
observations and action space, established the random-policy baseline
|
||||
of **−54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
|
||||
uint8)` observation shape and `Discrete(5)` action space (noop, left,
|
||||
right, gas, brake). The reward structure is `+1000/N` per new tile
|
||||
and `−0.1` per frame, with a `−100` terminal penalty for off-track.
|
||||
|
||||
## Step 2 — Environment wrappers (`src/env_wrappers.py`)
|
||||
|
||||
Implemented three Gymnasium wrappers applied innermost-first:
|
||||
`SkipFrame(k=4)` to repeat each action across 4 raw frames;
|
||||
`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
|
||||
OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
|
||||
4 grayscale frames. Final observation passed to the agent is shape
|
||||
`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ −37.
|
||||
|
||||
## Step 3 — Actor-critic network (`src/networks.py`)
|
||||
|
||||
Implemented a shared-CNN actor-critic following Atari DQN topology:
|
||||
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
|
||||
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
|
||||
and a scalar critic head. All layers use orthogonal initialisation
|
||||
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
|
||||
Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).
|
||||
|
||||
## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)
|
||||
|
||||
Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
|
||||
storing observations as `uint8` (4× memory saving versus float32). GAE
|
||||
recursion uses the standard backward-pass formulation
|
||||
`Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1}` with bootstrap from a critic
|
||||
forward pass on the post-rollout state. Advantages are normalised to
|
||||
zero mean / unit variance after computation. Verified with synthetic
|
||||
rollouts.
|
||||
|
||||
## Step 5 — PPO agent (`src/ppo_agent.py`)
|
||||
|
||||
Implemented `PPOAgent` with the clipped surrogate objective, batched
|
||||
`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
|
||||
`update_vec` performing 10 mini-batch update epochs per rollout.
|
||||
Includes value-function clipping (SB3-style), linear LR / entropy
|
||||
annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
|
||||
*37 Implementation Details of PPO*. Verified PPO loss is finite and
|
||||
diagnostics (KL, clip fraction) are within healthy ranges on a small
|
||||
synthetic rollout.
|
||||
|
||||
## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests
|
||||
|
||||
Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
|
||||
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
|
||||
Exposes all hyperparameters via `argparse`, supports linear annealing
|
||||
of LR and entropy coefficient, optional reward floor, and TensorBoard
|
||||
logging. Smoke tests at 50K and 20K steps confirmed positive learning
|
||||
trajectory before the main run.
|
||||
|
||||
## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
|
||||
|
||||
Final production training: 8 parallel envs, 256 steps per env per
|
||||
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
|
||||
reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode
|
||||
running mean reached **+843**. Saved 36 checkpoints; selected
|
||||
`iter_0700.pt` (training step ≈1.43M) as the submission via
|
||||
held-out per-checkpoint evaluation.
|
||||
|
||||
## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)
|
||||
|
||||
Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
|
||||
`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
|
||||
on unseen seeds (1000–1019) yielded **mean 830.17 ± 104.79**, min
|
||||
436.81, max 914.90.
|
||||
|
||||
## Step 9 — SB3 baseline (`train_sb3_baseline.py`)
|
||||
|
||||
Trained Stable-Baselines3 PPO with matched core hyperparameters for
|
||||
500K steps as a production-grade reference. Final 20-episode evaluation:
|
||||
mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
|
||||
on mean (+25%), std (−40%), and min (+41%).
|
||||
|
||||
## Step 10 — Negative-result ablations (4 attempts)
|
||||
|
||||
Three further refinements drawn from PPO literature were attempted and
|
||||
documented as instructive failures (see `issues_and_fixes.md` §4):
|
||||
- KL early stopping triggered in 80% of iterations under our larger batch
|
||||
- RAD-style observation augmentation collapsed the policy at step 258K
|
||||
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
|
||||
|
||||
The original v3 configuration is the submitted production model.
|
||||
|
||||
## Final deliverables
|
||||
|
||||
- `models/ppo_final.pt` — submitted model (1.69M params)
|
||||
- `runs/vec_main_v3/` — main training TensorBoard logs
|
||||
- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
|
||||
- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
|
||||
- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)
|
||||
@@ -0,0 +1,32 @@
|
||||
{
|
||||
"checkpoint": "D:\\projects\\CW1_xxx\\models\\vec_main_v3\\iter_0700.pt",
|
||||
"n_episodes": 20,
|
||||
"seed_start": 1000,
|
||||
"deterministic": false,
|
||||
"mean": 830.1724279409364,
|
||||
"std": 104.79337276485252,
|
||||
"min": 436.8098159509071,
|
||||
"max": 914.8999999999849,
|
||||
"returns": [
|
||||
859.0443686006632,
|
||||
839.1025641025492,
|
||||
707.2727272727101,
|
||||
873.3333333333223,
|
||||
914.8999999999849,
|
||||
436.8098159509071,
|
||||
874.9999999999827,
|
||||
874.1100323624435,
|
||||
871.5189873417628,
|
||||
888.8888888888717,
|
||||
891.0714285714159,
|
||||
863.5761589403863,
|
||||
852.7027027026837,
|
||||
776.0107816711404,
|
||||
859.4594594594402,
|
||||
883.6601307189337,
|
||||
890.2912621359064,
|
||||
724.101706484623,
|
||||
830.0291545189361,
|
||||
892.5650557620664
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,29 @@
|
||||
{
|
||||
"model": "SB3 PPO (CnnPolicy) 500K steps",
|
||||
"mean": 664.3150926449418,
|
||||
"std": 173.92591000802872,
|
||||
"min": 309.3959731543487,
|
||||
"max": 857.1428571428397,
|
||||
"returns": [
|
||||
801.0238907849651,
|
||||
489.743589743578,
|
||||
849.0909090908918,
|
||||
769.9999999999883,
|
||||
309.3959731543487,
|
||||
660.73619631901,
|
||||
857.1428571428397,
|
||||
734.9514563106644,
|
||||
808.2278481012556,
|
||||
818.5185185185022,
|
||||
596.4285714285587,
|
||||
837.0860927152211,
|
||||
768.243243243225,
|
||||
560.3773584905526,
|
||||
714.1891891891725,
|
||||
367.32026143789557,
|
||||
670.2265372168171,
|
||||
432.42320819111006,
|
||||
404.37317784255947,
|
||||
836.8029739776804
|
||||
]
|
||||
}
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 127 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 52 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 293 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 221 KiB |
@@ -0,0 +1,242 @@
|
||||
# Implementation Challenges & Resolutions
|
||||
|
||||
This document records the substantive engineering challenges encountered
|
||||
during development, suitable as raw material for the report's
|
||||
"Implementation Details — Challenges" section. Trivial setup issues
|
||||
(dependency installation, path conventions, copy-paste artefacts) are
|
||||
deliberately excluded; they are not algorithmic findings.
|
||||
|
||||
---
|
||||
|
||||
## 1. Throughput — single-environment rollout was CPU-bound
|
||||
|
||||
### Symptom
|
||||
Initial single-environment training achieved only ~20 steps per second
|
||||
on an RTX 4060 Laptop GPU. Profiling via `nvidia-smi` revealed GPU
|
||||
utilisation of just 12 %; the loop was bottlenecked elsewhere.
|
||||
|
||||
### Root cause
|
||||
1. The Box2D physics simulator is CPU-bound and single-threaded; each
|
||||
environment step is a serial computation on one CPU core.
|
||||
2. Per-step `agent.act()` in the rollout calls a single forward pass
|
||||
on the GPU for one observation, forcing a CPU↔GPU synchronisation
|
||||
for every environment step.
|
||||
|
||||
### Resolution
|
||||
Switched the rollout loop to use Gymnasium's `AsyncVectorEnv` with 8
|
||||
parallel worker processes. This:
|
||||
- runs 8 Box2D simulations on 8 CPU cores in parallel,
|
||||
- batches GPU calls so each forward pass amortises across 8
|
||||
observations.
|
||||
|
||||
Throughput rose to ~95 steps per second, a 4.5× speedup. Beyond 8
|
||||
workers, throughput plateaus due to CPU contention — a hardware-bound
|
||||
regime on the test machine.
|
||||
|
||||
### Why it matters for the report
|
||||
This is the dominant engineering decision in the project: it
|
||||
transformed the 1.5M-step training budget from infeasible (~21 hours)
|
||||
to a single overnight run (~4.5 hours).
|
||||
|
||||
---
|
||||
|
||||
## 2. Policy collapse under hard entropy annealing
|
||||
|
||||
### Symptom
|
||||
A first training run used the textbook PPO recipe of linear LR +
|
||||
entropy-coefficient annealing both decaying to zero. Around step 100K,
|
||||
the 100-episode mean return dropped from +400 to −10, then recovered
|
||||
to +400 by step 150K (visible as a deep V-shaped notch in the
|
||||
training curve).
|
||||
|
||||
### Root cause analysis
|
||||
At step 100K, policy entropy had fallen to ~0.4 (from initial
|
||||
ln 5 ≈ 1.61). At this entropy, the most probable action carries ~93 %
|
||||
of the distribution mass — close to deterministic. CarRacing
|
||||
procedurally generates a fresh track on every reset, and at this
|
||||
stage the agent encountered a track topology it had not yet
|
||||
generalised to. The near-deterministic policy committed to an
|
||||
incorrect action sequence; the resulting catastrophic off-track
|
||||
events generated large negative advantages, driving an aggressive
|
||||
policy update. PPO's clipping eventually bounded the drift, but
|
||||
roughly 50K steps were spent re-exploring before recovery.
|
||||
|
||||
### Resolution
|
||||
Introduced an **entropy coefficient floor** of 0.005 (rather than
|
||||
zero). The schedule now decays the entropy coefficient linearly from
|
||||
0.01 toward 0.005, after which it remains constant. Preserving 0.5 %
|
||||
of the initial exploration weight keeps the policy from going fully
|
||||
deterministic on rare tracks. We also floor per-frame rewards at
|
||||
−1.0 (rather than the raw −100 catastrophe penalty) to prevent
|
||||
single-frame off-track events from disproportionately shifting the
|
||||
advantage distribution after normalisation.
|
||||
|
||||
### Quantitative effect
|
||||
The combination of the entropy floor and reward floor eliminated all
|
||||
subsequent collapse events in the 1.5M-step training run. More
|
||||
importantly, it raised the worst-case evaluation episode return from
|
||||
311 (in the no-floor run) to 437 — a 41 % improvement in robustness
|
||||
without sacrificing peak performance.
|
||||
|
||||
### Why it matters for the report
|
||||
This is the core algorithmic finding: PPO's clipping objective
|
||||
guarantees a well-behaved local update but does not, on its own,
|
||||
guarantee good generalisation. Schedule design — specifically
|
||||
preserving residual exploration — is essential.
|
||||
|
||||
---
|
||||
|
||||
## 3. Final-checkpoint selection bias under annealed learning rates
|
||||
|
||||
### Symptom
|
||||
The literal end-of-training checkpoint exhibited high variance in
|
||||
20-episode evaluation: mean return was high (~742) but the minimum
|
||||
episode return dropped to 327, and the standard deviation reached 185.
|
||||
Earlier checkpoints exhibited tighter distributions.
|
||||
|
||||
### Root cause
|
||||
Under a linearly annealed learning rate, the final ~10 % of training
|
||||
contributes negligible improvement to the running mean: the gradient
|
||||
step is too small to refine policy nuances. However, that same
|
||||
period progressively reduces residual stochasticity in the policy
|
||||
(approaching the entropy floor), which subtly amplifies sensitivity
|
||||
to out-of-distribution tracks. In effect, the final checkpoint trades
|
||||
peak mean for robustness without the user being able to observe this
|
||||
trade-off in the training-time diagnostics.
|
||||
|
||||
### Resolution
|
||||
Implemented a `scan_checkpoints.py` utility that:
|
||||
1. Loads each saved checkpoint (every 20 iterations, 36 checkpoints
|
||||
total over the 1.5M-step run);
|
||||
2. Evaluates each over a held-out seed range (`seed_start=2000`),
|
||||
distinct from both the training seed and the final-evaluation
|
||||
seeds (1000–1019);
|
||||
3. Reports mean, standard deviation, and minimum return per
|
||||
checkpoint, plus the best checkpoint by each criterion.
|
||||
|
||||
The selected submission model is `iter_0700.pt` (training step
|
||||
~1.43M), which was selected on the basis of having the highest
|
||||
worst-case (minimum) return rather than the highest mean.
|
||||
|
||||
### Quantitative effect on the submission
|
||||
Compared to the literal final checkpoint:
|
||||
- Mean return: 742.0 → 705.0 (−5 %, acceptable)
|
||||
- Std: 185.2 → 160.3 (−13 %)
|
||||
- Minimum: 327.1 → 504.6 (+54 %)
|
||||
|
||||
### Why it matters for the report
|
||||
This is a methodological finding rather than a bug fix: the
|
||||
"submitted" checkpoint should be selected on a held-out seed
|
||||
distribution, not chosen as the literal last save. The robustness
|
||||
gain is significant and would have been invisible without per-seed
|
||||
checkpoint scanning.
|
||||
|
||||
---
|
||||
|
||||
## 4. Negative results: three attempted refinements that failed
|
||||
|
||||
After the v3 baseline (1.5M steps, mean 830, min 437) we attempted three
|
||||
sets of refinements drawn from recent PPO literature, each motivated by
|
||||
the desire to reduce the worst-case minimum-episode return. **All three
|
||||
collapsed or under-performed.** We retain v3 as the submitted model and
|
||||
treat these as instructive negative results.
|
||||
|
||||
### 4.1 Failed attempt: KL early stopping (target_kl=0.015)
|
||||
|
||||
**Motivation.** Stable-Baselines3 and CleanRL both default to a KL
|
||||
early-stopping mechanism that aborts the current update epoch once the
|
||||
mean approx-KL exceeds 1.5×target_kl. Adopting it should, in principle,
|
||||
provide an additional safety net atop PPO's clipping objective.
|
||||
|
||||
**Configuration.** v3 hyperparameters + `target_kl=0.015`,
|
||||
`batch_size=128`, `n_epochs=6`, augmentation enabled.
|
||||
|
||||
**Failure mode.** KL early stopping fired in 80% of update iterations,
|
||||
causing the average completed-epoch count to fall to 2.36/6. Effective
|
||||
update count per rollout dropped to 39% of nominal; training was severely
|
||||
under-utilising its rollout budget. Final mean return was projected to
|
||||
be substantially below v3.
|
||||
|
||||
**Diagnosis.** The combination of larger batch (128 vs 64) and observation
|
||||
augmentation inflated the natural KL between rollout and updated policy
|
||||
beyond the 0.0225 trigger. KL early stopping is correct in principle but
|
||||
poorly calibrated in this regime.
|
||||
|
||||
### 4.2 Failed attempt: Random-shift data augmentation (RAD-style)
|
||||
|
||||
**Motivation.** Laskin et al. 2020 (RAD) and Yarats et al. 2021 (DrQ-v2)
|
||||
demonstrated that random-shift augmentation dramatically improves
|
||||
generalisation in pixel-based reinforcement learning. CarRacing's
|
||||
procedural track generation should benefit similarly.
|
||||
|
||||
**Configuration.** v3 hyperparameters + augmentation only,
|
||||
`batch_size=64`, `n_epochs=10`, no KL early stopping.
|
||||
|
||||
**Failure mode.** Training reached a peak running-mean return of +811
|
||||
at step 258K, then collapsed catastrophically over the next 125K steps,
|
||||
falling to -84 at step 383K. Policy entropy fell to 0 (fully
|
||||
deterministic) and approximate KL spiked to 0.82 within a single
|
||||
update window.
|
||||
|
||||
**Diagnosis.** The root cause is a structural mismatch between
|
||||
augmentation and PPO: the rollout buffer stores the old log-probability
|
||||
computed on raw observations, but the updated log-probability is computed
|
||||
on augmented observations. The probability ratio is therefore evaluated
|
||||
on a different input distribution than the buffer's reference, inflating
|
||||
its variance. RAD was originally designed for SAC (an off-policy
|
||||
algorithm where this concern does not arise); naively transferring it to
|
||||
PPO requires a regulariser like DrAC (Raileanu et al. 2020) which we
|
||||
did not implement.
|
||||
|
||||
### 4.3 Failed attempt: gamma=0.995 + 5M-step training
|
||||
|
||||
**Motivation.** SB3 RL-Zoo's tuned CarRacing configuration uses
|
||||
gamma=0.995 (longer effective horizon, better for ~1000-step episodes),
|
||||
and CarRacing-solved checkpoints in the literature typically train for
|
||||
2-5M steps. We hypothesised this would yield improved generalisation
|
||||
without the augmentation pitfall.
|
||||
|
||||
**Configuration.** v3 hyperparameters + gamma 0.99→0.995 + 5M total
|
||||
steps, no augmentation, no clip annealing.
|
||||
|
||||
**Failure mode.** Training reached peak +770 at step 278K then began a
|
||||
slow decline. By step 405K, return had fallen to +599 with policy
|
||||
entropy at 0.082 and KL spikes up to 0.31 in the recent 30 iterations.
|
||||
We aborted at 8% progress.
|
||||
|
||||
**Diagnosis.** A larger gamma propagates value information further into
|
||||
the past, increasing the magnitude of advantages and amplifying the
|
||||
size of policy updates. Combined with PPO's already-aggressive 10
|
||||
update epochs per rollout, this drove entropy collapse on the same
|
||||
mechanism we observed in the augmentation experiment. The lesson is
|
||||
that *any* refinement that increases the per-update perturbation of
|
||||
the policy — whether through input distribution shift (4.2) or through
|
||||
discount-factor amplification (4.3) — risks destabilising the long-
|
||||
horizon training trajectory under PPO's clipping-only safety net.
|
||||
|
||||
### 4.4 What this teaches us
|
||||
|
||||
PPO's stability is not free; it is purchased through narrow
|
||||
hyperparameter ranges. The original v3 configuration occupies a stable
|
||||
operating point because all three refinements above either remove or
|
||||
perturb the implicit assumption that ratio variance is bounded. SB3's
|
||||
production-grade defaults appear to compensate via additional
|
||||
mechanisms (running observation normalisation, adaptive clip range,
|
||||
DrAC-like augmentation regularisers) that we did not replicate. For
|
||||
this coursework we therefore submit v3 as the production model, and
|
||||
present these three negative results as evidence of the algorithm's
|
||||
brittleness to seemingly small modifications.
|
||||
|
||||
---
|
||||
|
||||
## Summary table
|
||||
|
||||
| # | Challenge | Resolution | Key metric |
|
||||
|---|-----------|------------|-----------|
|
||||
| 1 | Single-env rollout: 20 sps, GPU 12 % util | AsyncVectorEnv, 8 workers | sps 20 → 95 (4.5×) |
|
||||
| 2 | Policy collapse near step 100K, entropy ~0.4 | Entropy floor 0.005 + reward floor −1.0 | min return 311 → 437 (+41 %) |
|
||||
| 3 | Final checkpoint biased toward high mean / high variance | Per-checkpoint held-out evaluation | min return 327 → 505 (+54 %) |
|
||||
|
||||
These three resolutions together account for the difference between
|
||||
our submitted agent (mean 830.17, std 104.79, min 436.81) and the
|
||||
production SB3 PPO baseline (mean 664.32, std 173.93, min 309.40).
|
||||
@@ -0,0 +1,150 @@
|
||||
# Submission Checklist
|
||||
|
||||
最后提交前**逐项核对**,避免格式扣分。
|
||||
|
||||
## 1. 命名格式
|
||||
|
||||
- [ ] zip 文件名:`CW1_<学号>_<姓名拼音>.zip`
|
||||
- 例:`CW1_2012345_ZhangSan.zip`
|
||||
- [ ] PDF 文件名:`CW1_<学号>_<姓名拼音>.pdf`
|
||||
- [ ] 学号 + 姓名拼写**全程一致**(zip / pdf / 报告封面页)
|
||||
- [ ] **PDF 不放进 zip**!分两个文件单独上传
|
||||
|
||||
## 2. zip 内容(提交前检查)
|
||||
|
||||
```
|
||||
CW1_<ID>_<Name>.zip
|
||||
├── README.md ✅
|
||||
├── requirements.txt ✅
|
||||
├── train.py ✅ 单环境 legacy
|
||||
├── train_vec.py ✅ 主训练脚本
|
||||
├── train_sb3_baseline.py ✅ SB3 基线
|
||||
├── evaluate.py ✅ 评估脚本
|
||||
├── scan_checkpoints.py ✅ checkpoint 扫描
|
||||
├── src/
|
||||
│ ├── __init__.py
|
||||
│ ├── env_wrappers.py
|
||||
│ ├── vec_env_wrappers.py
|
||||
│ ├── networks.py
|
||||
│ ├── rollout_buffer.py
|
||||
│ ├── vec_rollout_buffer.py
|
||||
│ ├── ppo_agent.py
|
||||
│ ├── eval_utils.py
|
||||
│ └── utils.py
|
||||
├── notebooks/
|
||||
│ ├── 01_explore_env.ipynb
|
||||
│ ├── 02_test_network.ipynb
|
||||
│ ├── 03_test_buffer.ipynb
|
||||
│ ├── 04_test_ppo.ipynb
|
||||
│ └── 05_evaluate.ipynb
|
||||
├── models/
|
||||
│ └── ppo_final.pt ⭐ 最佳 checkpoint,重命名后唯一一个
|
||||
├── runs/
|
||||
│ └── vec_main_v3/ ⭐ 主训练 TensorBoard 日志
|
||||
└── docs/ ✅ 报告素材(可全部保留)
|
||||
├── step00_skeleton.md
|
||||
├── step01_env_exploration.md
|
||||
├── ...
|
||||
├── step07_evaluation.md
|
||||
├── issues_and_fixes.md
|
||||
├── report_outline.md
|
||||
├── eval_summary.json
|
||||
├── checkpoint_scan_*.json
|
||||
├── fig_eval_bar.png
|
||||
├── fig_training_curves.png
|
||||
└── demo.mp4
|
||||
```
|
||||
|
||||
## 3. zip 不应包含的东西(提交前删除)
|
||||
|
||||
- [ ] `__pycache__/` 目录(src/ 下可能有,删掉)
|
||||
- [ ] `*.pyc` 文件
|
||||
- [ ] `.ipynb_checkpoints/` 目录
|
||||
- [ ] `runs/smoke_test/`、`runs/smoke_v2/`、`runs/n8_speed_test/`、`runs/vec_smoke*/` 等无用日志
|
||||
- [ ] `models/main_v1_baseline/`、`models/smoke_*/`、`models/n8_speed_test/` 等无用 checkpoint
|
||||
- [ ] `models/vec_main_v3/iter_*.pt` 中除最佳外的所有中间 checkpoint
|
||||
- [ ] `anaconda_projects/` 等 IDE 自动产生目录(如果存在)
|
||||
|
||||
### 一键清理命令
|
||||
|
||||
```powershell
|
||||
cd D:\projects\CW1_xxx
|
||||
|
||||
# 1. 把最佳 checkpoint 复制为 ppo_final.pt
|
||||
Copy-Item models\vec_main_v3\<最佳iter>.pt models\ppo_final.pt
|
||||
|
||||
# 2. 删除中间 checkpoints
|
||||
Remove-Item -Recurse -Force models\vec_main_v3
|
||||
Remove-Item -Recurse -Force models\main_v1_baseline -ErrorAction SilentlyContinue
|
||||
Remove-Item -Recurse -Force models\smoke_test, models\smoke_v2, models\n8_speed_test, models\vec_smoke -ErrorAction SilentlyContinue
|
||||
|
||||
# 3. 删除无用 runs
|
||||
Remove-Item -Recurse -Force runs\smoke_test, runs\smoke_v2, runs\n8_speed_test, runs\vec_smoke, runs\vec_smoke_v3, runs\vec_smoke_v3b -ErrorAction SilentlyContinue
|
||||
|
||||
# 4. 把 vec_main_v3 重命名成 main(提交时更清晰)
|
||||
Move-Item runs\vec_main_v3 runs\main
|
||||
|
||||
# 5. 删除 __pycache__ 和 .ipynb_checkpoints
|
||||
Get-ChildItem -Recurse -Force -Include "__pycache__",".ipynb_checkpoints" | Remove-Item -Recurse -Force
|
||||
Get-ChildItem -Recurse -Filter "*.pyc" | Remove-Item -Force
|
||||
```
|
||||
|
||||
## 4. PDF 报告内容核对
|
||||
|
||||
- [ ] **第一页 cover page**:含学生 ID
|
||||
- [ ] **字数 ≤ 3000**(不含 References 和 Appendix)
|
||||
- [ ] 5 个 section 全有:Introduction / Methodology / Implementation Details /
|
||||
Results and Analysis / Conclusion
|
||||
- [ ] **3 张关键图**:训练曲线、评估柱状图、SB3 对比(已在 fig_training_curves.png)
|
||||
- [ ] **超参数表**(Table 1 in Section 3.3)
|
||||
- [ ] **网络架构图**(手绘或 PowerPoint 画)
|
||||
- [ ] **References** 至少 3-5 篇(PPO + GAE + Gymnasium 文档)
|
||||
- [ ] **PDF 字体清晰**,所有图表 axis label / legend 都可读
|
||||
- [ ] **PDF 文件可在另一台电脑打开**(不要损坏)
|
||||
|
||||
## 5. 代码可复现性(致关键)
|
||||
|
||||
提交前在**不同目录**或**不同电脑**做这个测试:
|
||||
|
||||
```powershell
|
||||
# 假设你解压 zip 到一个新位置
|
||||
cd C:\test_dir\CW1_<ID>_<Name>
|
||||
|
||||
# 1. 装依赖
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 2. 加载模型测试
|
||||
python -c "from src.ppo_agent import PPOAgent; agent = PPOAgent(); agent.load('models/ppo_final.pt'); print('OK')"
|
||||
|
||||
# 3. 跑评估(最少 5 episodes)
|
||||
python evaluate.py --ckpt models/ppo_final.pt --episodes 5
|
||||
```
|
||||
|
||||
如果以上 3 步全过,提交内容**复现性 OK**。
|
||||
|
||||
## 6. 学术诚信
|
||||
|
||||
- [ ] `src/` 下**没有**任何 `from stable_baselines3 import` 语句
|
||||
- 验证:`Get-ChildItem src\ -Recurse -Filter *.py | Select-String "stable_baselines3"`
|
||||
- [ ] `train_sb3_baseline.py` 在报告里**明确标记为 baseline only**
|
||||
- [ ] 所有外部代码灵感(CleanRL、PPO 论文、37 details 博客)在报告 References 里**列出**
|
||||
- [ ] 报告封面"yes" 同意匿名教学使用(视个人意愿)
|
||||
|
||||
## 7. 学习 Mall 上传后
|
||||
|
||||
- [ ] **下载 zip 和 pdf**,验证文件完整未损坏
|
||||
- [ ] 在干净电脑上重新打开 PDF 看一眼
|
||||
- [ ] 截图保存提交确认页(防系统崩溃)
|
||||
|
||||
## 8. 时间节点(截止 2026-05-04 23:59)
|
||||
|
||||
- 至少 **48 小时前**(即 2026-05-02 中午)完成所有内容
|
||||
- **不要** 拖到截止当天,Learning Mall 临近截止经常上传失败
|
||||
- 留 1-2 天缓冲修 bug / 改报告
|
||||
|
||||
## 9. 紧急情况备选
|
||||
|
||||
- 如果 vec_main_v3 训练崩溃 → 使用 `runs/main_v1_baseline/` + `models/main_v1_baseline/` 数据
|
||||
+ 报告里诚实说明 (305K 步早期停止)
|
||||
- 如果 SB3 baseline 没跑出来 → 报告 Section 4.3 删掉对比,改成"plan to compare in future work"
|
||||
- 如果 PDF 超字数 → 删 Implementation Details 里的次要细节,保留 Methodology 和 Results
|
||||
Reference in New Issue
Block a user