fb09e66d09
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率 - 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计 - 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py) - 提供详细的文档和开发日志,包含问题解决记录和实验分析 - 移除旧版项目文件,统一项目结构到CW1_id_name目录下
114 lines
5.3 KiB
Markdown
114 lines
5.3 KiB
Markdown
# Development Log — DTS307TC PPO Coursework
|
||
|
||
This log summarises the project's incremental development. Each step
|
||
records what was built, why, and the verification used. Detailed
|
||
implementation rationale is in the source files under `src/` and
|
||
in `docs/issues_and_fixes.md`.
|
||
|
||
## Step 0 — Project skeleton
|
||
|
||
Built the project scaffold under `D:/projects/CW1_xxx/`: directories
|
||
`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
|
||
`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
|
||
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
|
||
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
|
||
on RTX 4060 Laptop with `torch.cuda.is_available() == True`.
|
||
|
||
## Step 1 — Environment exploration
|
||
|
||
Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
|
||
observations and action space, established the random-policy baseline
|
||
of **−54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
|
||
uint8)` observation shape and `Discrete(5)` action space (noop, left,
|
||
right, gas, brake). The reward structure is `+1000/N` per new tile
|
||
and `−0.1` per frame, with a `−100` terminal penalty for off-track.
|
||
|
||
## Step 2 — Environment wrappers (`src/env_wrappers.py`)
|
||
|
||
Implemented three Gymnasium wrappers applied innermost-first:
|
||
`SkipFrame(k=4)` to repeat each action across 4 raw frames;
|
||
`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
|
||
OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
|
||
4 grayscale frames. Final observation passed to the agent is shape
|
||
`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ −37.
|
||
|
||
## Step 3 — Actor-critic network (`src/networks.py`)
|
||
|
||
Implemented a shared-CNN actor-critic following Atari DQN topology:
|
||
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
|
||
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
|
||
and a scalar critic head. All layers use orthogonal initialisation
|
||
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
|
||
Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).
|
||
|
||
## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)
|
||
|
||
Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
|
||
storing observations as `uint8` (4× memory saving versus float32). GAE
|
||
recursion uses the standard backward-pass formulation
|
||
`Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1}` with bootstrap from a critic
|
||
forward pass on the post-rollout state. Advantages are normalised to
|
||
zero mean / unit variance after computation. Verified with synthetic
|
||
rollouts.
|
||
|
||
## Step 5 — PPO agent (`src/ppo_agent.py`)
|
||
|
||
Implemented `PPOAgent` with the clipped surrogate objective, batched
|
||
`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
|
||
`update_vec` performing 10 mini-batch update epochs per rollout.
|
||
Includes value-function clipping (SB3-style), linear LR / entropy
|
||
annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
|
||
*37 Implementation Details of PPO*. Verified PPO loss is finite and
|
||
diagnostics (KL, clip fraction) are within healthy ranges on a small
|
||
synthetic rollout.
|
||
|
||
## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests
|
||
|
||
Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
|
||
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
|
||
Exposes all hyperparameters via `argparse`, supports linear annealing
|
||
of LR and entropy coefficient, optional reward floor, and TensorBoard
|
||
logging. Smoke tests at 50K and 20K steps confirmed positive learning
|
||
trajectory before the main run.
|
||
|
||
## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
|
||
|
||
Final production training: 8 parallel envs, 256 steps per env per
|
||
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
|
||
reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode
|
||
running mean reached **+843**. Saved 36 checkpoints; selected
|
||
`iter_0700.pt` (training step ≈1.43M) as the submission via
|
||
held-out per-checkpoint evaluation.
|
||
|
||
## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)
|
||
|
||
Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
|
||
`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
|
||
on unseen seeds (1000–1019) yielded **mean 830.17 ± 104.79**, min
|
||
436.81, max 914.90.
|
||
|
||
## Step 9 — SB3 baseline (`train_sb3_baseline.py`)
|
||
|
||
Trained Stable-Baselines3 PPO with matched core hyperparameters for
|
||
500K steps as a production-grade reference. Final 20-episode evaluation:
|
||
mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
|
||
on mean (+25%), std (−40%), and min (+41%).
|
||
|
||
## Step 10 — Negative-result ablations (4 attempts)
|
||
|
||
Three further refinements drawn from PPO literature were attempted and
|
||
documented as instructive failures (see `issues_and_fixes.md` §4):
|
||
- KL early stopping triggered in 80% of iterations under our larger batch
|
||
- RAD-style observation augmentation collapsed the policy at step 258K
|
||
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
|
||
|
||
The original v3 configuration is the submitted production model.
|
||
|
||
## Final deliverables
|
||
|
||
- `models/ppo_final.pt` — submitted model (1.69M params)
|
||
- `runs/vec_main_v3/` — main training TensorBoard logs
|
||
- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
|
||
- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
|
||
- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)
|