feat: 重构项目结构并添加向量化PPO训练与评估脚本

- 将原始单环境训练代码重构为模块化结构，添加向量化环境支持以提高数据采集效率 - 实现完整的PPO训练流水线，包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计 - 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py) - 提供详细的文档和开发日志，包含问题解决记录和实验分析 - 移除旧版项目文件，统一项目结构到CW1_id_name目录下
2026-05-02 13:44:08 +08:00
parent 79ffb90823
commit fb09e66d09
80 changed files with 2971 additions and 4822 deletions
@@ -0,0 +1,113 @@
+# Development Log — DTS307TC PPO Coursework
+
+This log summarises the project's incremental development. Each step
+records what was built, why, and the verification used. Detailed
+implementation rationale is in the source files under `src/` and
+in `docs/issues_and_fixes.md`.
+
+## Step 0 — Project skeleton 
+
+Built the project scaffold under `D:/projects/CW1_xxx/`: directories
+`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
+`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
+OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
+Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
+on RTX 4060 Laptop with `torch.cuda.is_available() == True`.
+
+## Step 1 — Environment exploration
+
+Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
+observations and action space, established the random-policy baseline
+of **−54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
+uint8)` observation shape and `Discrete(5)` action space (noop, left,
+right, gas, brake). The reward structure is `+1000/N` per new tile
+and `−0.1` per frame, with a `−100` terminal penalty for off-track.
+
+## Step 2 — Environment wrappers (`src/env_wrappers.py`)
+
+Implemented three Gymnasium wrappers applied innermost-first:
+`SkipFrame(k=4)` to repeat each action across 4 raw frames;
+`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
+OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
+4 grayscale frames. Final observation passed to the agent is shape
+`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ −37.
+
+## Step 3 — Actor-critic network (`src/networks.py`)
+
+Implemented a shared-CNN actor-critic following Atari DQN topology:
+three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
+strides) plus a 512-unit FC layer, branching into a 5-logit actor head
+and a scalar critic head. All layers use orthogonal initialisation
+(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
+Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).
+
+## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)
+
+Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
+storing observations as `uint8` (4× memory saving versus float32). GAE
+recursion uses the standard backward-pass formulation
+`Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1}` with bootstrap from a critic
+forward pass on the post-rollout state. Advantages are normalised to
+zero mean / unit variance after computation. Verified with synthetic
+rollouts.
+
+## Step 5 — PPO agent (`src/ppo_agent.py`)
+
+Implemented `PPOAgent` with the clipped surrogate objective, batched
+`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
+`update_vec` performing 10 mini-batch update epochs per rollout.
+Includes value-function clipping (SB3-style), linear LR / entropy
+annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
+*37 Implementation Details of PPO*. Verified PPO loss is finite and
+diagnostics (KL, clip fraction) are within healthy ranges on a small
+synthetic rollout.
+
+## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests
+
+Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
+with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
+Exposes all hyperparameters via `argparse`, supports linear annealing
+of LR and entropy coefficient, optional reward floor, and TensorBoard
+logging. Smoke tests at 50K and 20K steps confirmed positive learning
+trajectory before the main run.
+
+## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
+
+Final production training: 8 parallel envs, 256 steps per env per
+rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
+reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode
+running mean reached **+843**. Saved 36 checkpoints; selected
+`iter_0700.pt` (training step ≈1.43M) as the submission via
+held-out per-checkpoint evaluation.
+
+## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)
+
+Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
+`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
+on unseen seeds (1000–1019) yielded **mean 830.17 ± 104.79**, min
+436.81, max 914.90.
+
+## Step 9 — SB3 baseline (`train_sb3_baseline.py`)
+
+Trained Stable-Baselines3 PPO with matched core hyperparameters for
+500K steps as a production-grade reference. Final 20-episode evaluation:
+mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
+on mean (+25%), std (−40%), and min (+41%).
+
+## Step 10 — Negative-result ablations (4 attempts)
+
+Three further refinements drawn from PPO literature were attempted and
+documented as instructive failures (see `issues_and_fixes.md` §4):
+- KL early stopping triggered in 80% of iterations under our larger batch
+- RAD-style observation augmentation collapsed the policy at step 258K
+- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
+
+The original v3 configuration is the submitted production model.
+
+## Final deliverables
+
+- `models/ppo_final.pt` — submitted model (1.69M params)
+- `runs/vec_main_v3/` — main training TensorBoard logs
+- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
+- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
+- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)