Files
Serendipity fb09e66d09 feat: 重构项目结构并添加向量化PPO训练与评估脚本
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
2026-05-02 13:44:08 +08:00

114 lines
5.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Development Log — DTS307TC PPO Coursework
This log summarises the project's incremental development. Each step
records what was built, why, and the verification used. Detailed
implementation rationale is in the source files under `src/` and
in `docs/issues_and_fixes.md`.
## Step 0 — Project skeleton
Built the project scaffold under `D:/projects/CW1_xxx/`: directories
`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
on RTX 4060 Laptop with `torch.cuda.is_available() == True`.
## Step 1 — Environment exploration
Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
observations and action space, established the random-policy baseline
of **54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
uint8)` observation shape and `Discrete(5)` action space (noop, left,
right, gas, brake). The reward structure is `+1000/N` per new tile
and `0.1` per frame, with a `100` terminal penalty for off-track.
## Step 2 — Environment wrappers (`src/env_wrappers.py`)
Implemented three Gymnasium wrappers applied innermost-first:
`SkipFrame(k=4)` to repeat each action across 4 raw frames;
`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
4 grayscale frames. Final observation passed to the agent is shape
`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ 37.
## Step 3 — Actor-critic network (`src/networks.py`)
Implemented a shared-CNN actor-critic following Atari DQN topology:
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
and a scalar critic head. All layers use orthogonal initialisation
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).
## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)
Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
storing observations as `uint8` (4× memory saving versus float32). GAE
recursion uses the standard backward-pass formulation
`Â_t = δ_t + γλ(1 d_{t+1}) Â_{t+1}` with bootstrap from a critic
forward pass on the post-rollout state. Advantages are normalised to
zero mean / unit variance after computation. Verified with synthetic
rollouts.
## Step 5 — PPO agent (`src/ppo_agent.py`)
Implemented `PPOAgent` with the clipped surrogate objective, batched
`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
`update_vec` performing 10 mini-batch update epochs per rollout.
Includes value-function clipping (SB3-style), linear LR / entropy
annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
*37 Implementation Details of PPO*. Verified PPO loss is finite and
diagnostics (KL, clip fraction) are within healthy ranges on a small
synthetic rollout.
## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests
Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
Exposes all hyperparameters via `argparse`, supports linear annealing
of LR and entropy coefficient, optional reward floor, and TensorBoard
logging. Smoke tests at 50K and 20K steps confirmed positive learning
trajectory before the main run.
## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
Final production training: 8 parallel envs, 256 steps per env per
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
reward floor at 1.0. Linear LR / entropy annealing. Final 100-episode
running mean reached **+843**. Saved 36 checkpoints; selected
`iter_0700.pt` (training step ≈1.43M) as the submission via
held-out per-checkpoint evaluation.
## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)
Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
on unseen seeds (10001019) yielded **mean 830.17 ± 104.79**, min
436.81, max 914.90.
## Step 9 — SB3 baseline (`train_sb3_baseline.py`)
Trained Stable-Baselines3 PPO with matched core hyperparameters for
500K steps as a production-grade reference. Final 20-episode evaluation:
mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
on mean (+25%), std (40%), and min (+41%).
## Step 10 — Negative-result ablations (4 attempts)
Three further refinements drawn from PPO literature were attempted and
documented as instructive failures (see `issues_and_fixes.md` §4):
- KL early stopping triggered in 80% of iterations under our larger batch
- RAD-style observation augmentation collapsed the policy at step 258K
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
The original v3 configuration is the submitted production model.
## Final deliverables
- `models/ppo_final.pt` — submitted model (1.69M params)
- `runs/vec_main_v3/` — main training TensorBoard logs
- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)