- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率 - 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计 - 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py) - 提供详细的文档和开发日志,包含问题解决记录和实验分析 - 移除旧版项目文件,统一项目结构到CW1_id_name目录下
5.3 KiB
Development Log — DTS307TC PPO Coursework
This log summarises the project's incremental development. Each step
records what was built, why, and the verification used. Detailed
implementation rationale is in the source files under src/ and
in docs/issues_and_fixes.md.
Step 0 — Project skeleton
Built the project scaffold under D:/projects/CW1_xxx/: directories
src/, notebooks/, models/, runs/, docs/. Created
requirements.txt (10 dependencies including PyTorch, Gymnasium,
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
on RTX 4060 Laptop with torch.cuda.is_available() == True.
Step 1 — Environment exploration
Notebook 01_explore_env.ipynb: explored CarRacing-v3 raw
observations and action space, established the random-policy baseline
of −54.19 ± 5.29 over 5 episodes. Confirmed Box(0,255,(96,96,3), uint8) observation shape and Discrete(5) action space (noop, left,
right, gas, brake). The reward structure is +1000/N per new tile
and −0.1 per frame, with a −100 terminal penalty for off-track.
Step 2 — Environment wrappers (src/env_wrappers.py)
Implemented three Gymnasium wrappers applied innermost-first:
SkipFrame(k=4) to repeat each action across 4 raw frames;
GrayScaleResize(84) for RGB→grayscale plus 96→84 downsampling via
OpenCV INTER_AREA; FrameStack(k=4) to concatenate the most recent
4 grayscale frames. Final observation passed to the agent is shape
(4, 84, 84) uint8. Verified wrapped random baseline ≈ −37.
Step 3 — Actor-critic network (src/networks.py)
Implemented a shared-CNN actor-critic following Atari DQN topology:
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
and a scalar critic head. All layers use orthogonal initialisation
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
Verified initial entropy is ln(5) ≈ 1.6094 (uniform policy).
Step 4 — Rollout buffer + GAE (src/vec_rollout_buffer.py)
Implemented a vectorised rollout buffer of shape (n_steps, n_envs, ...)
storing observations as uint8 (4× memory saving versus float32). GAE
recursion uses the standard backward-pass formulation
Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1} with bootstrap from a critic
forward pass on the post-rollout state. Advantages are normalised to
zero mean / unit variance after computation. Verified with synthetic
rollouts.
Step 5 — PPO agent (src/ppo_agent.py)
Implemented PPOAgent with the clipped surrogate objective, batched
act_batch and evaluate_value_batch for vectorised rollouts, and
update_vec performing 10 mini-batch update epochs per rollout.
Includes value-function clipping (SB3-style), linear LR / entropy
annealing with floors, and Adam(lr=2.5e-4, eps=1e-5) per the
37 Implementation Details of PPO. Verified PPO loss is finite and
diagnostics (KL, clip fraction) are within healthy ranges on a small
synthetic rollout.
Step 6 — Training entrypoint (train_vec.py) + smoke tests
Implemented the full training driver using gymnasium.vector.AsyncVectorEnv
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
Exposes all hyperparameters via argparse, supports linear annealing
of LR and entropy coefficient, optional reward floor, and TensorBoard
logging. Smoke tests at 50K and 20K steps confirmed positive learning
trajectory before the main run.
Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)
Final production training: 8 parallel envs, 256 steps per env per
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode
running mean reached +843. Saved 36 checkpoints; selected
iter_0700.pt (training step ≈1.43M) as the submission via
held-out per-checkpoint evaluation.
Step 8 — Evaluation (evaluate.py, notebooks/03_evaluate.ipynb)
Built src/eval_utils.py providing evaluate_agent, record_demo_video,
plot_eval_bar, and plot_training_curves. Final 20-episode evaluation
on unseen seeds (1000–1019) yielded mean 830.17 ± 104.79, min
436.81, max 914.90.
Step 9 — SB3 baseline (train_sb3_baseline.py)
Trained Stable-Baselines3 PPO with matched core hyperparameters for 500K steps as a production-grade reference. Final 20-episode evaluation: mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms on mean (+25%), std (−40%), and min (+41%).
Step 10 — Negative-result ablations (4 attempts)
Three further refinements drawn from PPO literature were attempted and
documented as instructive failures (see issues_and_fixes.md §4):
- KL early stopping triggered in 80% of iterations under our larger batch
- RAD-style observation augmentation collapsed the policy at step 258K
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K
The original v3 configuration is the submitted production model.
Final deliverables
models/ppo_final.pt— submitted model (1.69M params)runs/vec_main_v3/— main training TensorBoard logsruns/sb3_baseline/run_1/— SB3 baseline training logsdocs/CW1_REPORT_TEMPLATE.docx— Word source for the report PDFdocs/demo.mp4— agent demo on seed 117 (return 925, 187 wrapped steps)