Files
Serendipity fb09e66d09 feat: 重构项目结构并添加向量化PPO训练与评估脚本
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
2026-05-02 13:44:08 +08:00

5.3 KiB
Raw Permalink Blame History

Development Log — DTS307TC PPO Coursework

This log summarises the project's incremental development. Each step records what was built, why, and the verification used. Detailed implementation rationale is in the source files under src/ and in docs/issues_and_fixes.md.

Step 0 — Project skeleton

Built the project scaffold under D:/projects/CW1_xxx/: directories src/, notebooks/, models/, runs/, docs/. Created requirements.txt (10 dependencies including PyTorch, Gymnasium, OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for Section 4.3 baseline comparison). Verified GPU + Gymnasium availability on RTX 4060 Laptop with torch.cuda.is_available() == True.

Step 1 — Environment exploration

Notebook 01_explore_env.ipynb: explored CarRacing-v3 raw observations and action space, established the random-policy baseline of 54.19 ± 5.29 over 5 episodes. Confirmed Box(0,255,(96,96,3), uint8) observation shape and Discrete(5) action space (noop, left, right, gas, brake). The reward structure is +1000/N per new tile and 0.1 per frame, with a 100 terminal penalty for off-track.

Step 2 — Environment wrappers (src/env_wrappers.py)

Implemented three Gymnasium wrappers applied innermost-first: SkipFrame(k=4) to repeat each action across 4 raw frames; GrayScaleResize(84) for RGB→grayscale plus 96→84 downsampling via OpenCV INTER_AREA; FrameStack(k=4) to concatenate the most recent 4 grayscale frames. Final observation passed to the agent is shape (4, 84, 84) uint8. Verified wrapped random baseline ≈ 37.

Step 3 — Actor-critic network (src/networks.py)

Implemented a shared-CNN actor-critic following Atari DQN topology: three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1 strides) plus a 512-unit FC layer, branching into a 5-logit actor head and a scalar critic head. All layers use orthogonal initialisation (gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206. Verified initial entropy is ln(5) ≈ 1.6094 (uniform policy).

Step 4 — Rollout buffer + GAE (src/vec_rollout_buffer.py)

Implemented a vectorised rollout buffer of shape (n_steps, n_envs, ...) storing observations as uint8 (4× memory saving versus float32). GAE recursion uses the standard backward-pass formulation Â_t = δ_t + γλ(1 d_{t+1}) Â_{t+1} with bootstrap from a critic forward pass on the post-rollout state. Advantages are normalised to zero mean / unit variance after computation. Verified with synthetic rollouts.

Step 5 — PPO agent (src/ppo_agent.py)

Implemented PPOAgent with the clipped surrogate objective, batched act_batch and evaluate_value_batch for vectorised rollouts, and update_vec performing 10 mini-batch update epochs per rollout. Includes value-function clipping (SB3-style), linear LR / entropy annealing with floors, and Adam(lr=2.5e-4, eps=1e-5) per the 37 Implementation Details of PPO. Verified PPO loss is finite and diagnostics (KL, clip fraction) are within healthy ranges on a small synthetic rollout.

Step 6 — Training entrypoint (train_vec.py) + smoke tests

Implemented the full training driver using gymnasium.vector.AsyncVectorEnv with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop. Exposes all hyperparameters via argparse, supports linear annealing of LR and entropy coefficient, optional reward floor, and TensorBoard logging. Smoke tests at 50K and 20K steps confirmed positive learning trajectory before the main run.

Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)

Final production training: 8 parallel envs, 256 steps per env per rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005, reward floor at 1.0. Linear LR / entropy annealing. Final 100-episode running mean reached +843. Saved 36 checkpoints; selected iter_0700.pt (training step ≈1.43M) as the submission via held-out per-checkpoint evaluation.

Step 8 — Evaluation (evaluate.py, notebooks/03_evaluate.ipynb)

Built src/eval_utils.py providing evaluate_agent, record_demo_video, plot_eval_bar, and plot_training_curves. Final 20-episode evaluation on unseen seeds (10001019) yielded mean 830.17 ± 104.79, min 436.81, max 914.90.

Step 9 — SB3 baseline (train_sb3_baseline.py)

Trained Stable-Baselines3 PPO with matched core hyperparameters for 500K steps as a production-grade reference. Final 20-episode evaluation: mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms on mean (+25%), std (40%), and min (+41%).

Step 10 — Negative-result ablations (4 attempts)

Three further refinements drawn from PPO literature were attempted and documented as instructive failures (see issues_and_fixes.md §4):

  • KL early stopping triggered in 80% of iterations under our larger batch
  • RAD-style observation augmentation collapsed the policy at step 258K
  • γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K

The original v3 configuration is the submitted production model.

Final deliverables

  • models/ppo_final.pt — submitted model (1.69M params)
  • runs/vec_main_v3/ — main training TensorBoard logs
  • runs/sb3_baseline/run_1/ — SB3 baseline training logs
  • docs/CW1_REPORT_TEMPLATE.docx — Word source for the report PDF
  • docs/demo.mp4 — agent demo on seed 117 (return 925, 187 wrapped steps)