feat: 重构项目结构并添加向量化PPO训练与评估脚本

- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
This commit is contained in:
2026-05-02 13:44:08 +08:00
parent 79ffb90823
commit fb09e66d09
80 changed files with 2971 additions and 4822 deletions
+113
View File
@@ -0,0 +1,113 @@
# Development Log — DTS307TC PPO Coursework
This log summarises the project's incremental development. Each step
records what was built, why, and the verification used. Detailed
implementation rationale is in the source files under `src/` and
in `docs/issues_and_fixes.md`.
## Step 0 — Project skeleton
Built the project scaffold under `D:/projects/CW1_xxx/`: directories
`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
on RTX 4060 Laptop with `torch.cuda.is_available() == True`.
## Step 1 — Environment exploration
Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
observations and action space, established the random-policy baseline
of **54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
uint8)` observation shape and `Discrete(5)` action space (noop, left,
right, gas, brake). The reward structure is `+1000/N` per new tile
and `0.1` per frame, with a `100` terminal penalty for off-track.
## Step 2 — Environment wrappers (`src/env_wrappers.py`)
Implemented three Gymnasium wrappers applied innermost-first:
`SkipFrame(k=4)` to repeat each action across 4 raw frames;
`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
4 grayscale frames. Final observation passed to the agent is shape
`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ 37.
## Step 3 — Actor-critic network (`src/networks.py`)
Implemented a shared-CNN actor-critic following Atari DQN topology:
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
and a scalar critic head. All layers use orthogonal initialisation
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).
## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)
Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
storing observations as `uint8` (4× memory saving versus float32). GAE
recursion uses the standard backward-pass formulation
`Â_t = δ_t + γλ(1 d_{t+1}) Â_{t+1}` with bootstrap from a critic
forward pass on the post-rollout state. Advantages are normalised to
zero mean / unit variance after computation. Verified with synthetic
rollouts.
## Step 5 — PPO agent (`src/ppo_agent.py`)
Implemented `PPOAgent` with the clipped surrogate objective, batched
`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
`update_vec` performing 10 mini-batch update epochs per rollout.
Includes value-function clipping (SB3-style), linear LR / entropy
annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
*37 Implementation Details of PPO*. Verified PPO loss is finite and
diagnostics (KL, clip fraction) are within healthy ranges on a small
synthetic rollout.
## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests
Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
Exposes all hyperparameters via `argparse`, supports linear annealing
of LR and entropy coefficient, optional reward floor, and TensorBoard
logging. Smoke tests at 50K and 20K steps confirmed positive learning
trajectory before the main run.
## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
Final production training: 8 parallel envs, 256 steps per env per
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
reward floor at 1.0. Linear LR / entropy annealing. Final 100-episode
running mean reached **+843**. Saved 36 checkpoints; selected
`iter_0700.pt` (training step ≈1.43M) as the submission via
held-out per-checkpoint evaluation.
## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)
Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
on unseen seeds (10001019) yielded **mean 830.17 ± 104.79**, min
436.81, max 914.90.
## Step 9 — SB3 baseline (`train_sb3_baseline.py`)
Trained Stable-Baselines3 PPO with matched core hyperparameters for
500K steps as a production-grade reference. Final 20-episode evaluation:
mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
on mean (+25%), std (40%), and min (+41%).
## Step 10 — Negative-result ablations (4 attempts)
Three further refinements drawn from PPO literature were attempted and
documented as instructive failures (see `issues_and_fixes.md` §4):
- KL early stopping triggered in 80% of iterations under our larger batch
- RAD-style observation augmentation collapsed the policy at step 258K
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
The original v3 configuration is the submitted production model.
## Final deliverables
- `models/ppo_final.pt` — submitted model (1.69M params)
- `runs/vec_main_v3/` — main training TensorBoard logs
- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)