Files
Serendipity fb09e66d09 feat: 重构项目结构并添加向量化PPO训练与评估脚本
- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
2026-05-02 13:44:08 +08:00

10 KiB

# CW1  PPO on CarRacing-v3 XJTLU DTS307TC Reinforcement Learning Coursework 1. A from-scratch PyTorch implementation of Proximal Policy Optimization (PPO) that learns to play the Gymnasium CarRacing-v3 environment using a discrete 5-action space. Stable-Baselines3 is not used in the main implementation (only as an external baseline for the comparison plot). ## Author - Name: <your name> - Student ID: <your ID> ## Environment - Python 3.9 / 3.10 - PyTorch 2.7.0 (CUDA 12.8) - Tested on NVIDIA RTX 4060 Laptop GPU (8GB) - Windows 11 (Linux/macOS untested but should work) ## Setup powershell pip install -r requirements.txt If box2d-py fails to compile on Windows: powershell pip install swig pip install Box2D pip install gymnasium ## Project Structure CW1_xxx/ %%% README.md %%% requirements.txt %%% train.py Single-env training entry (legacy) %%% train_vec.py Vectorised-env training entry (recommended) %%% train_sb3_baseline.py SB3 PPO baseline for comparison only %%% evaluate.py CLI evaluation: returns + plots + video %%% src/ % %%% env_wrappers.py SkipFrame, GrayScaleResize, FrameStack % %%% vec_env_wrappers.py Vectorised env factory % %%% networks.py Shared-CNN ActorCritic % %%% rollout_buffer.py Single-env rollout buffer + GAE % %%% vec_rollout_buffer.py Vectorised rollout buffer + GAE % %%% ppo_agent.py PPO-Clip agent (act, update, schedule) % %%% eval_utils.py Evaluation / plotting / video helpers % %%% utils.py set_seed, format_seconds %%% notebooks/ % %%% 01_explore_env.ipynb Environment exploration % %%% 02_test_network.ipynb Network sanity checks % %%% 03_test_buffer.ipynb Buffer + GAE sanity checks % %%% 04_test_ppo.ipynb PPO update sanity checks % %%% 05_evaluate.ipynb Trained-agent evaluation (thin wrapper) %%% models/ % %%% vec_main/final.pt Trained agent (submitted) %%% runs/ TensorBoard logs (one subdir per experiment) %%% docs/ Per-step technical reports ## Training (recommended: vectorised) powershell # 500K steps, ~2.5h on a single RTX 4060 Laptop python train_vec.py --n-envs 4 --total-steps 500000 ^ --run-name vec_main --anneal-lr --anneal-ent --reward-clip 1.0 Key flags: - --n-envs 4: parallel environments (Async multi-process) - --anneal-lr: linear LR decay to 0 - --anneal-ent: linear entropy-coef decay to 0 - --reward-clip 1.0: floor per-frame reward at -1.0 ## Single-environment training (legacy) powershell python train.py --total-steps 500000 --run-name main ## Monitoring In a separate PowerShell: powershell tensorboard --logdir=runs --port=6006 Open http://localhost:6006 and tick whichever runs to compare. ## Evaluation powershell # Numerical eval + bar chart + training curves python evaluate.py --ckpt models/vec_main/final.pt # Same plus a demo mp4 to docs/demo.mp4 python evaluate.py --ckpt models/vec_main/final.pt --video # Deterministic-policy evaluation (argmax over logits) python evaluate.py --ckpt models/vec_main/final.pt --deterministic Outputs land in docs/: - eval_summary.json per-episode returns + mean / std - fig_eval_bar.png bar chart of evaluation returns - fig_training_curves.png 6-panel training curves (overlays available runs) - demo.mp4 (if --video) The notebook notebooks/05_evaluate.ipynb is a thin wrapper around the same helpers in src/eval_utils.py. ## SB3 baseline (optional, for the comparison plot) powershell python train_sb3_baseline.py --total-steps 500000 --run-name sb3_baseline After this run finishes, re-running evaluate.py will automatically include the SB3 curve in fig_training_curves.png if runs/sb3_baseline exists. ## Key hyperparameters (vec_main run) | Param | Value | Source | |-------|-------|--------| | Total steps | 500,000 | | | Parallel envs | 4 | AsyncVectorEnv | | Rollout per env | 512 | total per-iter samples = 2048 | | Update epochs | 10 | PPO paper | | Minibatch | 64 | PPO Atari | | Learning rate | 2.5e-4 ! 0 (linear) | annealed | | Adam eps | 1e-5 | "37 details" | |  (discount) | 0.99 | | |  (GAE) | 0.95 | | |  (clip) | 0.2 | PPO paper | | c1 (vf) | 0.5 | | | c2 (ent) | 0.01 ! 0 (linear) | annealed | | max-grad-norm | 0.5 | | | Reward floor | -1.0 | | ## Notes - src/ contains no Stable-Baselines3 imports. SB3 is referenced only in train_sb3_baseline.py and is purely for the external comparison plot in the report. - train_vec.py requires the if __name__ == "__main__" guard at the bottom (already present) for AsyncVectorEnv to work on Windows. - See docs/issues_and_fixes.md for a log of practical issues encountered during development and how they were resolved. ## License & academic integrity This is an individual coursework submission. Any external code (e.g. inspiration from CleanRL or PPO papers) is referenced in the report. No RL-specific libraries (Stable-Baselines3, RLLib, Tianshou, etc.) are used in the main src/ implementation.