b32490ae03
修复 replay buffer 中 log_probs 的维度错误,从 (buffer_size, action_dim) 改为 buffer_size 修正训练时状态张量格式,从 (N, H, W, C) 转换为 (N, C, H, W) 更新 collect_rollout 返回观测值并修正 log_prob 计算 添加项目配置文件和训练曲线生成脚本
PPO for CarRacing-v3
From-scratch PPO implementation for CarRacing-v3. No Stable-Baselines or other RL libraries used.
Setup
conda activate my_env
uv pip install -r requirements.txt
Train
python train.py --steps 500000
Evaluate
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
TensorBoard
tensorboard --logdir logs/tensorboard
Project Structure
src/
├── network.py # Actor (Gaussian policy) and Critic (Value) networks
├── replay_buffer.py # Rollout buffer with GAE computation
├── trainer.py # PPO update with clipped surrogate objective
├── utils.py # Environment wrappers (grayscale, resize, frame stack)
└── evaluate.py # Evaluation script
train.py # Main training entry point
models/ # Saved checkpoints
logs/tensorboard/ # TensorBoard logs
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 3e-4 |
| Gamma | 0.99 |
| GAE lambda | 0.95 |
| Clip epsilon | 0.2 |
| PPO epochs | 4 |
| Mini-batch size | 64 |
| Rollout steps | 2048 |
| Entropy coefficient | 0.01 |
| Value coefficient | 0.5 |
| Max gradient norm | 0.5 |