d353133b31
- 新增强化学习个人项目报告,包含基于PyTorch从零实现的PPO算法 - 重构课程作业报告代码结构,提取运行时路径管理和notebook执行逻辑到独立模块 - 更新依赖文件requirements.txt,添加强化学习相关依赖 - 简化模型比较结果表格,仅保留基线逻辑回归模型数据
57 lines
1.2 KiB
Markdown
57 lines
1.2 KiB
Markdown
# PPO for CarRacing-v3
|
|
|
|
From-scratch PPO implementation for CarRacing-v3. No Stable-Baselines or other RL libraries used.
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
conda activate my_env
|
|
uv pip install -r requirements.txt
|
|
```
|
|
|
|
## Train
|
|
|
|
```bash
|
|
python train.py --steps 500000
|
|
```
|
|
|
|
## Evaluate
|
|
|
|
```bash
|
|
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
|
|
```
|
|
|
|
## TensorBoard
|
|
|
|
```bash
|
|
tensorboard --logdir logs/tensorboard
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
src/
|
|
├── network.py # Actor (Gaussian policy) and Critic (Value) networks
|
|
├── replay_buffer.py # Rollout buffer with GAE computation
|
|
├── trainer.py # PPO update with clipped surrogate objective
|
|
├── utils.py # Environment wrappers (grayscale, resize, frame stack)
|
|
└── evaluate.py # Evaluation script
|
|
train.py # Main training entry point
|
|
models/ # Saved checkpoints
|
|
logs/tensorboard/ # TensorBoard logs
|
|
```
|
|
|
|
## Hyperparameters
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| Learning rate | 3e-4 |
|
|
| Gamma | 0.99 |
|
|
| GAE lambda | 0.95 |
|
|
| Clip epsilon | 0.2 |
|
|
| PPO epochs | 4 |
|
|
| Mini-batch size | 64 |
|
|
| Rollout steps | 2048 |
|
|
| Entropy coefficient | 0.01 |
|
|
| Value coefficient | 0.5 |
|
|
| Max gradient norm | 0.5 | |