feat: 重构项目结构并添加向量化PPO训练与评估脚本

- 将原始单环境训练代码重构为模块化结构,添加向量化环境支持以提高数据采集效率
- 实现完整的PPO训练流水线,包括共享CNN的Actor-Critic网络、向量化经验回放缓冲和GAE优势估计
- 添加训练脚本(train_vec.py)、评估脚本(evaluate.py)和SB3基线对比脚本(train_sb3_baseline.py)
- 提供详细的文档和开发日志,包含问题解决记录和实验分析
- 移除旧版项目文件,统一项目结构到CW1_id_name目录下
This commit is contained in:
2026-05-02 13:44:08 +08:00
parent 79ffb90823
commit fb09e66d09
80 changed files with 2971 additions and 4822 deletions
Binary file not shown.
Binary file not shown.
-57
View File
@@ -1,57 +0,0 @@
# PPO for CarRacing-v3
From-scratch PPO implementation for CarRacing-v3. No Stable-Baselines or other RL libraries used.
## Setup
```bash
conda activate my_env
uv pip install -r requirements.txt
```
## Train
```bash
python train.py --steps 500000
```
## Evaluate
```bash
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
```
## TensorBoard
```bash
tensorboard --logdir logs/tensorboard
```
## Project Structure
```
src/
├── network.py # Actor (Gaussian policy) and Critic (Value) networks
├── replay_buffer.py # Rollout buffer with GAE computation
├── trainer.py # PPO update with clipped surrogate objective
├── utils.py # Environment wrappers (grayscale, resize, frame stack)
└── evaluate.py # Evaluation script
train.py # Main training entry point
models/ # Saved checkpoints
logs/tensorboard/ # TensorBoard logs
```
## Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning rate | 3e-4 |
| Gamma | 0.99 |
| GAE lambda | 0.95 |
| Clip epsilon | 0.2 |
| PPO epochs | 4 |
| Mini-batch size | 64 |
| Rollout steps | 2048 |
| Entropy coefficient | 0.01 |
| Value coefficient | 0.5 |
| Max gradient norm | 0.5 |
@@ -1,157 +0,0 @@
# PPO + CarRacing-v3 任务进度追踪
> 生成时间:2026/04/30
---
## 作业要求
用 Python 从零实现 PPO 算法,在 CarRacing-v3 环境训练智能体,提交:
- 技术报告(≤3000 词,英文)PDF
- 源代码 + 训练模型 zip 文件
- 截止:04/May/2026 23:59
- **禁止使用**Stable-Baselines 等 RL 专用库
- **允许使用**TensorBoard、PyTorch、Gymnasium
---
## 一、已完成 ✅
| 步骤 | 内容 | 文件 |
|------|------|------|
| ✅ 项目结构 | src/ 目录、requirements.txt、README.md | [requirements.txt](requirements.txt)、[README.md](README.md) |
| ✅ 策略/价值网络 | Actor(高斯策略输出 μ, σ)+ Critic 实现,CNN 结构 | [src/network.py](src/network.py) |
| ✅ Rollout Buffer | 轨迹存储 + GAE 优势估计 + 返回值计算 | [src/replay_buffer.py](src/replay_buffer.py) |
| ✅ PPO Trainer | PPO 更新(clip 目标函数 + 熵正则 + 价值损失) | [src/trainer.py](src/trainer.py) |
| ✅ 环境预处理 | 灰度化 + Resize(84×84) + 帧堆叠(4帧) Wrapper | [src/utils.py](src/utils.py) |
| ✅ 评估脚本 | 渲染测试 + 多回合平均分数评估 | [src/evaluate.py](src/evaluate.py) |
| ✅ 训练入口 | 主训练循环、TensorBoard 记录、模型保存 | [train.py](train.py) |
| ✅ 并行训练 | 多环境并行采集 + WSL 支持 | [train_parallel.py](train_parallel.py) |
| ✅ WSL 脚本 | 环境配置 + 启动脚本 | [setup_wsl.sh](setup_wsl.sh)、[run_wsl.sh](run_wsl.sh)、[start_wsl_training.bat](start_wsl_training.bat) |
| ✅ 测试脚本 | 快速验证并行环境和网络 | [test_parallel.py](test_parallel.py) |
**核心算法实现要点**
- 策略网络:3 层 CNN + FC(512) → μ, σ(高斯策略,tanh 激活)
- 价值网络:3 层 CNN + FC(512) → V(s)
- GAE:λ=0.95,优势归一化
- PPO clip:ε=0.24 epoch 更新,mini-batch 64
---
## 二、待完成 ⬜
| 步骤 | 内容 | 优先级 |
|------|------|--------|
| ⬜ 安装依赖 | `uv pip install --system -r requirements.txt` | **高** |
| ⬜ 环境测试 | 短时间(~10000步)验证代码能跑通 | **高** |
| ⬜ 完整训练 | 运行 500k+ 步,预计 5-8 小时(后台) | **高(耗时)** |
| ⬜ 生成图表 | 从 TensorBoard 提取数据,用 matplotlib 绘图 | 中 |
| ⬜ 撰写报告 | 英文技术报告(≤3000 词),LaTeX 排版 | 中 |
| ⬜ 编译 PDF | XeLaTeX 编译生成 CW1_1234560.pdf | 中 |
| ⬜ 打包 zip | 源代码 + 模型打包 CW1_1234560.zip | 低 |
---
## 三、文件结构
```
强化学习个人项目报告/
├── src/
│ ├── __init__.py
│ ├── network.py # Actor + Critic CNN 网络
│ ├── replay_buffer.py # Rollout buffer + GAE
│ ├── trainer.py # PPO 更新逻辑
│ ├── utils.py # 环境预处理 wrappers
│ └── evaluate.py # 评估脚本
├── train.py # 单线程训练入口
├── train_parallel.py # 多环境并行训练(推荐)
├── setup_wsl.sh # WSL 环境配置
├── run_wsl.sh # WSL 训练启动脚本
├── start_wsl_training.bat # Windows 一键启动 WSL 训练
├── test_parallel.py # 并行训练测试
├── requirements.txt
├── README.md
├── WSL_README.md # WSL 训练指南
└── TASK_PROGRESS.md # 本文档
```
---
## 四、超参数配置
| 参数 | train.py (单线程) | train_parallel.py (并行) |
|------|-------------------|--------------------------|
| Learning rate | 3e-4 | 3e-4 |
| Gamma | 0.99 | 0.99 |
| GAE lambda | 0.95 | 0.98 |
| Clip epsilon | 0.2 | 0.1 |
| PPO epochs | 4 | 10 |
| Mini-batch size | 64 | 128 |
| Rollout steps | 2048 | 2048 |
| Entropy coefficient | 0.01 | 0.005 |
| Value coefficient | 0.5 | 0.75 |
| Max gradient norm | 0.5 | 0.5 |
| 总步数 | 500,000 | 2,000,000 |
| 环境数 | 1 | 4 |
| 预计时长 | ~8h | ~5h (4x) |
---
## 五、下一步行动
### 方案 A:WSL 并行训练(推荐)
```bash
# Windows 下双击 start_wsl_training.bat
# 或手动:
wsl
cd "/mnt/d/Code/doing_exercises/programs/外教作业外快/强化学习个人项目报告"
chmod +x setup_wsl.sh run_wsl.sh
./setup_wsl.sh # 首次运行
./run_wsl.sh # 开始训练
```
### 方案 BWindows 单线程训练
```bash
# 1. 安装依赖
uv pip install --system -r requirements.txt
# 2. 验证代码能跑(短测试)
python train.py --steps 10000
# 3. 开始正式训练(后台运行,预计 5-8 小时)
python train.py --steps 500000
```
### 训练完成后
```bash
# TensorBoard 可视化
tensorboard --logdir logs/tensorboard
# 评估模型
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
```
### 报告撰写后
```bash
# 编译 PDF
cd tex && xelatex CW1_1234560.tex
```
---
## 六、报告结构(≤3000 词)
1. **Introduction** — RL 背景、CarRacing-v3 任务、状态/动作/奖励空间定义
2. **Methodology** — PPO 数学公式、clip 机制、GAE 优势估计
3. **Implementation Details** — 网络结构、训练流程、超参数、问题与解决
4. **Results and Analysis** — 训练曲线图、评估结果、与 SB3 基线对比
5. **Conclusion** — PPO 敏感性、actor-critic 有效性总结
---
## 七、提交清单
- [ ] `CW1_1234560.pdf` — 技术报告(封面 + ≤3000 词)
- [ ] `CW1_1234560.zip` — 源代码 + 训练好的模型 .pt 文件
- [ ] 所有代码使用英文注释
- [ ] 图表坐标轴和图例使用英文
@@ -1,107 +0,0 @@
"""Generate training plots from TensorBoard logs."""
import os
import numpy as np
from tensorboard.backend.event_processing import event_accumulator
import matplotlib.pyplot as plt
def extract_metrics(log_dir):
"""Extract metrics from TensorBoard log directory."""
ea = event_accumulator.EventAccumulator(log_dir)
ea.Reload()
metrics = {}
for tag in ea.Tags()['scalars']:
events = ea.Scalars(tag)
steps = [e.step for e in events]
values = [e.value for e in events]
metrics[tag] = {'steps': steps, 'values': values}
return metrics
def smooth(data, weight=0.6):
"""Exponential moving average for smoothing."""
smoothed = []
last = data[0]
for point in data:
smoothed_val = last * weight + (1 - weight) * point
smoothed.append(smoothed_val)
last = smoothed_val
return smoothed
def plot_training_curves(metrics, save_path):
"""Plot training curves."""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
episodes = metrics.get('Reward/Episode', {}).get('steps', [])
ep_rewards = metrics.get('Reward/Episode', {}).get('values', [])
avg_rewards = metrics.get('Reward/AvgLast10', {}).get('values', [])
if episodes and ep_rewards:
axes[0, 0].plot(episodes, ep_rewards, alpha=0.3, label='Episode Reward')
if avg_rewards:
axes[0, 0].plot(episodes, smooth(avg_rewards), 'r-', linewidth=2, label='Smoothed (EMA)')
axes[0, 0].set_xlabel('Training Steps')
axes[0, 0].set_ylabel('Episode Reward')
axes[0, 0].set_title('Training Episode Reward')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
eval_steps = metrics.get('Eval/MeanReturn', {}).get('steps', [])
eval_returns = metrics.get('Eval/MeanReturn', {}).get('values', [])
if eval_steps and eval_returns:
axes[0, 1].plot(eval_steps, eval_returns, 'g-', linewidth=2, marker='o', markersize=4)
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Mean Evaluation Return')
axes[0, 1].set_title('Evaluation Performance')
axes[0, 1].grid(True, alpha=0.3)
actor_loss_steps = metrics.get('Loss/Actor', {}).get('steps', [])
actor_losses = metrics.get('Loss/Actor', {}).get('values', [])
if actor_loss_steps and actor_losses:
axes[1, 0].plot(actor_loss_steps, smooth(actor_losses), 'b-', linewidth=1.5)
axes[1, 0].set_xlabel('Training Steps')
axes[1, 0].set_ylabel('Actor Loss')
axes[1, 0].set_title('Actor Loss Over Training')
axes[1, 0].grid(True, alpha=0.3)
critic_loss_steps = metrics.get('Loss/Critic', {}).get('steps', [])
critic_losses = metrics.get('Loss/Critic', {}).get('values', [])
if critic_loss_steps and critic_losses:
axes[1, 1].plot(critic_loss_steps, smooth(critic_losses), 'purple', linewidth=1.5)
axes[1, 1].set_xlabel('Training Steps')
axes[1, 1].set_ylabel('Critic Loss')
axes[1, 1].set_title('Critic Loss Over Training')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.close()
print(f"Plots saved to {save_path}")
def main():
log_base = 'logs/tensorboard'
runs = sorted([d for d in os.listdir(log_base) if os.path.isdir(os.path.join(log_base, d))])
if not runs:
print("No runs found!")
return
latest_run = os.path.join(log_base, runs[-1])
print(f"Analyzing run: {runs[-1]}")
metrics = extract_metrics(latest_run)
plot_training_curves(metrics, 'training_curves.png')
print("\nExtracted metrics:")
for tag, data in metrics.items():
if data['values']:
values = np.array(data['values'])
print(f" {tag}: min={values.min():.2f}, max={values.max():.2f}, final={values[-1]:.2f}")
if __name__ == '__main__':
main()
@@ -1,39 +0,0 @@
[project]
name = "ppo-carracing"
version = "0.1.0"
description = "PPO (Proximal Policy Optimization) for CarRacing-v3 environment"
requires-python = ">=3.10"
dependencies = [
"torch>=2.0.0",
"gymnasium[box2d]>=0.29.0",
"numpy>=1.24.0",
"matplotlib>=3.7.0",
"tensorboard>=2.14.0",
"opencv-python>=4.8.0",
]
[project.optional-dependencies]
dev = ["pytest>=7.4.0", "black>=23.0.0", "ruff>=0.1.0"]
[project.scripts]
ppo-train = "train:main"
ppo-evaluate = "src.evaluate:main"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.ruff]
line-length = 100
target-version = "py310"
[tool.ruff.lint]
select = ["E", "F", "I", "N", "W"]
ignore = ["E501"]
[tool.black]
line-length = 100
target-version = ["py310"]
[tool.hatch.build.targets.wheel]
packages = ["src"]
@@ -1,10 +0,0 @@
torch
gymnasium[box2d]
numpy
matplotlib
tensorboard
opencv-python-headless
# uv 安装方式(可选):
# curl -LsSf https://astral.sh/uv/install.sh | sh
# uv pip install -r requirements.txt
@@ -1,6 +0,0 @@
"""PPO Agent for CarRacing-v3 environment."""
from .network import Actor, Critic
from .replay_buffer import RolloutBuffer
from .trainer import PPOTrainer
__all__ = ['Actor', 'Critic', 'RolloutBuffer', 'PPOTrainer']
@@ -1,96 +0,0 @@
"""Evaluation script for trained PPO agent."""
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import torch
import numpy as np
import gymnasium as gym
from src.utils import make_env, get_device
from src.network import Actor, Critic
def evaluate(actor, env, num_episodes=10, device=torch.device("cpu")):
"""Evaluate actor and return average return."""
actor.eval()
returns = []
for ep in range(num_episodes):
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0)) # (C, H, W) -> (H, W, C) for storage
total_reward = 0
done = False
steps = 0
while not done and steps < 1000:
with torch.no_grad():
# Convert to tensor (B, C, H, W)
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
# Sample action
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
# Convert to (C, H, W) format
obs = np.transpose(obs, (1, 2, 0))
total_reward += reward
done = terminated or truncated
steps += 1
returns.append(total_reward)
print(f"Episode {ep+1}/{num_episodes}: return={total_reward:.1f}, steps={steps}")
actor.train()
return np.mean(returns), np.std(returns)
def evaluate_render(actor, env, device):
"""Render and evaluate agent with visualization."""
actor.eval()
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
env.render_mode = "human"
done = False
total_reward = 0
while not done:
with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
obs = np.transpose(obs, (1, 2, 0))
total_reward += reward
done = terminated or truncated
env.render()
actor.train()
print(f"Final return: {total_reward:.1f}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, required=True, help="Path to trained model")
parser.add_argument("--episodes", type=int, default=5, help="Number of evaluation episodes")
args = parser.parse_args()
device = get_device()
env = make_env()
actor = Actor().to(device)
critic = Critic().to(device)
# Load model
checkpoint = torch.load(args.model, map_location=device, weights_only=False)
actor.load_state_dict(checkpoint["actor"])
print(f"Loaded model from {args.model}")
mean_return, std_return = evaluate(actor, env, num_episodes=args.episodes, device=device)
print(f"\nEvaluation: mean={mean_return:.2f}, std={std_return:.2f}")
@@ -1,64 +0,0 @@
"""Rollout buffer for storing trajectories."""
import numpy as np
class RolloutBuffer:
"""Stores trajectories for PPO training."""
def __init__(self, buffer_size, state_shape, action_dim):
self.buffer_size = buffer_size
self.ptr = 0
self.size = 0
self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8)
self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32)
self.rewards = np.zeros(buffer_size, dtype=np.float32)
self.dones = np.zeros(buffer_size, dtype=np.bool_)
self.values = np.zeros(buffer_size, dtype=np.float32)
self.log_probs = np.zeros(buffer_size, dtype=np.float32)
def add(self, state, action, reward, done, value, log_prob):
"""Add a transition to the buffer."""
self.states[self.ptr] = state
self.actions[self.ptr] = action
self.rewards[self.ptr] = reward
self.dones[self.ptr] = done
self.values[self.ptr] = value
self.log_probs[self.ptr] = log_prob
self.ptr = (self.ptr + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size)
def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.95):
"""Compute returns and advantages using GAE."""
advantages = np.zeros(self.size, dtype=np.float32)
last_gae = 0
# Compute GAE backwards
for t in reversed(range(self.size)):
if t == self.size - 1:
next_value = last_value
else:
next_value = self.values[t + 1]
delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t]
last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae
advantages[t] = last_gae
returns = advantages + self.values[:self.size]
return returns, advantages
def get(self):
"""Return all data as numpy arrays."""
return (
self.states[:self.size],
self.actions[:self.size],
self.rewards[:self.size],
self.dones[:self.size],
self.values[:self.size],
self.log_probs[:self.size],
)
def reset(self):
"""Reset buffer."""
self.ptr = 0
self.size = 0
@@ -1,120 +0,0 @@
"""PPO Trainer with GAE advantage estimation."""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
class PPOTrainer:
"""PPO trainer handling the training loop."""
def __init__(
self,
actor,
critic,
rollout_buffer,
device,
clip_eps=0.2,
gamma=0.99,
gae_lambda=0.95,
lr=3e-4,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5,
ppo_epochs=4,
mini_batch_size=64,
):
self.actor = actor
self.critic = critic
self.buffer = rollout_buffer
self.device = device
self.clip_eps = clip_eps
self.gamma = gamma
self.gae_lambda = gae_lambda
self.ent_coef = ent_coef
self.vf_coef = vf_coef
self.max_grad_norm = max_grad_norm
self.ppo_epochs = ppo_epochs
self.mini_batch_size = mini_batch_size
# Separate optimizers
self.actor_optim = optim.Adam(actor.parameters(), lr=lr)
self.critic_optim = optim.Adam(critic.parameters(), lr=lr)
self.loss_history = {'actor': [], 'critic': [], 'entropy': [], 'total': []}
def update(self, last_value):
"""Perform one PPO update."""
states, actions, rewards, dones, values, log_probs_old = self.buffer.get()
# Compute returns and advantages
returns, advantages = self.buffer.compute_returns(
last_value, self.gamma, self.gae_lambda
)
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# Convert to tensors (states: N, H, W, C -> N, C, H, W)
states_t = torch.from_numpy(states).float().permute(0, 3, 1, 2).to(self.device)
actions_t = torch.from_numpy(actions).float().to(self.device)
log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device)
returns_t = torch.from_numpy(returns).float().to(self.device)
advantages_t = torch.from_numpy(advantages).float().to(self.device)
dataset = TensorDataset(states_t, actions_t, log_probs_old_t, returns_t, advantages_t)
loader = DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True)
total_actor_loss = 0
total_critic_loss = 0
total_entropy = 0
count = 0
for _ in range(self.ppo_epochs):
for batch in loader:
s, a, log_pi_old, ret, adv = batch
mu, std = self.actor(s)
dist = torch.distributions.Normal(mu, std)
log_pi = dist.log_prob(a).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
ratio = torch.exp(log_pi - log_pi_old)
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv
actor_loss = -torch.min(surr1, surr2).mean()
# Value loss
value = self.critic(s)
critic_loss = nn.MSELoss()(value.squeeze(), ret)
# Total loss
loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean()
# Update
self.actor_optim.zero_grad()
self.critic_optim.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor.parameters(), self.max_grad_norm)
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.actor_optim.step()
self.critic_optim.step()
total_actor_loss += actor_loss.item()
total_critic_loss += critic_loss.item()
total_entropy += entropy.mean().item()
count += 1
avg_actor = total_actor_loss / count
avg_critic = total_critic_loss / count
avg_entropy = total_entropy / count
self.loss_history['actor'].append(avg_actor)
self.loss_history['critic'].append(avg_critic)
self.loss_history['entropy'].append(avg_entropy)
self.loss_history['total'].append(avg_actor + avg_critic)
self.buffer.reset()
return avg_actor, avg_critic, avg_entropy
@@ -1,87 +0,0 @@
"""Utility functions for environment, device detection, and TensorBoard."""
import gymnasium as gym
import numpy as np
import torch
from collections import deque
class GrayScaleWrapper(gym.ObservationWrapper):
"""Convert RGB observation to grayscale."""
def __init__(self, env):
super().__init__(env)
def observation(self, obs):
# RGB to grayscale: weighted average
gray = 0.299 * obs[:, :, 0] + 0.587 * obs[:, :, 1] + 0.114 * obs[:, :, 2]
return gray.astype(np.uint8)
class ResizeWrapper(gym.ObservationWrapper):
"""Resize observation to target size."""
def __init__(self, env, size=(84, 84)):
super().__init__(env)
self.size = size
def observation(self, obs):
import cv2
return cv2.resize(obs, self.size, interpolation=cv2.INTER_AREA)
class FrameStackWrapper(gym.ObservationWrapper):
"""Stack last N frames."""
def __init__(self, env, num_stack=4):
super().__init__(env)
self.num_stack = num_stack
self.frames = deque(maxlen=num_stack)
obs_shape = env.observation_space.shape
self.observation_space = gym.spaces.Box(
low=0, high=255,
shape=(num_stack, *obs_shape[-2:]),
dtype=np.uint8
)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
for _ in range(self.num_stack):
self.frames.append(obs)
return self._get_observation(), info
def observation(self, obs):
self.frames.append(obs)
return self._get_observation()
def _get_observation(self):
return np.stack(list(self.frames), axis=0)
def make_env(env_id="CarRacing-v3", gray_scale=True, resize=True, frame_stack=4):
"""Create preprocessed CarRacing environment."""
env = gym.make(env_id, render_mode="rgb_array")
if resize:
env = ResizeWrapper(env, size=(84, 84))
if gray_scale:
env = GrayScaleWrapper(env)
if frame_stack > 1:
env = FrameStackWrapper(env, num_stack=frame_stack)
return env
def get_device():
"""Detect and return available device."""
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
device = torch.device("cpu")
print("Using CPU")
return device
def preprocess_obs(obs):
"""Ensure observation is in correct format for network."""
if len(obs.shape) == 2: # single channel
obs = np.expand_dims(obs, axis=0)
return obs
@@ -1,35 +0,0 @@
\relax
\providecommand\hyper@newdestlabel[2]{}
\providecommand*\HyPL@Entry[1]{}
\HyPL@Entry{0<</S/D>>}
\HyPL@Entry{1<</S/D>>}
\@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}{section.1}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {2}Background: The CarRacing-v3 Environment}{1}{section.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}State Space}{1}{subsection.2.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Action Space}{1}{subsection.2.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Reward Mechanism}{2}{subsection.2.3}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {3}Algorithm: Proximal Policy Optimization}{2}{section.3}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Policy Gradient Foundation}{2}{subsection.3.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {3.2} clipped Surrogate Objective}{2}{subsection.3.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Generalized Advantage Estimation}{3}{subsection.3.3}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {4}Network Architecture}{3}{section.4}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Actor Network (Policy)}{3}{subsection.4.1}\protected@file@percent }
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Actor Network Architecture}}{3}{figure.caption.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Critic Network (Value)}{3}{subsection.4.2}\protected@file@percent }
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Critic Network Architecture}}{3}{figure.caption.2}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {5}Implementation Details}{4}{section.5}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Hyperparameters}{4}{subsection.5.1}\protected@file@percent }
\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces Hyperparameter Configuration}}{4}{table.caption.3}\protected@file@percent }
\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
\newlabel{tab:hyperparams}{{1}{4}{Hyperparameter Configuration}{table.caption.3}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Training Pipeline}{4}{subsection.5.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Problems and Solutions}{4}{subsection.5.3}\protected@file@percent }
ine {1}{\ignorespaces Training and Evaluation Curves}}{6}{figure.caption.2}\protected@file@percent }
\newlabel{fig:training_curves}{{1}{6}{Training and Evaluation Curves}{figure.caption.2}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {6.2}Test Evaluation}{6}{subsection.6.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {6.3}Comparison with Baselines}{6}{subsection.6.3}\protected@file@percent }
\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Comparison with Stable-Baselines3 PPO}}{6}{table.caption.3}\protected@file@percent }
\newlabel{tab:comparison}{{2}{6}{Comparison with Stable-Baselines3 PPO}{table.caption.3}{}}
\@writefile{toc}{\contentsline {section}{\numberline {7}Conclusion}{7}{section.7}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {8}References}{7}{section.8}\protected@file@percent }
\gdef \@abspage@last{8}
File diff suppressed because one or more lines are too long
Binary file not shown.
@@ -1,276 +0,0 @@
\documentclass[12pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage{fontspec}
\usepackage{graphicx}
\usepackage{amsmath,amssymb}
\usepackage{hyperref}
\usepackage{geometry}
\usepackage{setspace}
\usepackage{booktabs}
\usepackage{xcolor}
\usepackage{caption}
\geometry{margin=1in}
\setmainfont{Times New Roman}
\hypersetup{
colorlinks=true,
linkcolor=blue,
filecolor=magenta,
urlcolor=cyan,
}
\captionsetup{font=small}
\title{Proximal Policy Optimization for CarRacing-v3:\\A From-Scratch Implementation and Analysis}
\author{Student Name: Liu Hangyu\\Student ID: 1234560\\Course: DTS307TC Deep Learning}
\date{April 30, 2026}
\begin{document}
\maketitle
\thispagestyle{empty}
\vspace{1cm}
\begin{abstract}
This report presents a complete implementation of Proximal Policy Optimization (PPO) for the CarRacing-v3 environment, developed entirely from scratch without relying on reinforcement learning libraries such as Stable-Baselines3. The implementation features a CNN-based actor-critic architecture with Gaussian policy, Generalized Advantage Estimation (GAE), and comprehensive preprocessing including grayscale conversion, frame stacking, and resize operations. Training over 500,000 timesteps demonstrated the agent's ability to complete the racing track with a peak evaluation return of 367.04. This report details the algorithmic design, network architecture, hyperparameter selection, experimental results, and analysis comparing our implementation against established baselines.
\end{abstract}
\vspace{1cm}
\textbf{Keywords:} Proximal Policy Optimization, CarRacing-v3, Reinforcement Learning, Actor-Critic, Deep Learning
\newpage
\pagenumbering{arabic}
\onehalfspacing
\section{Introduction}
Reinforcement learning (RL) has emerged as a powerful paradigm for training intelligent agents to make sequential decisions in complex environments. Among the various RL algorithms developed in recent years, Proximal Policy Optimization (PPO) proposed by Schulman et al. (2017) has gained widespread adoption due to its balance between sample efficiency and implementation simplicity. PPO addresses the instability issues of earlier policy gradient methods by constraining policy updates through a clipped surrogate objective function.
The CarRacing-v3 environment from the Gymnasium (formerly OpenAI Gym) toolkit presents a challenging continuous control task where an agent must navigate a top-down racing car around a procedurally generated track. The environment's high-dimensional observation space and continuous action space make it an ideal benchmark for testing deep RL algorithms. Unlike simpler discrete tasks, CarRacing requires the agent to learn sophisticated behaviors including acceleration control, steering, and braking coordination.
This project implements PPO from scratch using only PyTorch for neural network operations, avoiding direct use of reinforcement learning libraries while maintaining modular, well-documented code. The implementation leverages TensorBoard for experiment tracking and visualization.
\section{Background: The CarRacing-v3 Environment}
CarRacing-v3 is a continuous control task from the Box2D physics simulation suite. The agent controls a racing car that must traverse a randomly generated track while maximizing accumulated reward within a limited number of timesteps (1000 steps per episode).
\subsection{State Space}
The raw observation from CarRacing-v3 consists of RGB images with dimensions $96 \times 96 \times 3$. To reduce computational overhead and improve learning efficiency, we apply standard preprocessing techniques:
\begin{itemize}
\item \textbf{Grayscale Conversion:} Transform RGB images to single-channel grayscale using the luminance formula $Y = 0.299R + 0.587G + 0.114B$
\item \textbf{Resize:} Scale images from $96 \times 96$ to $84 \times 84$ pixels using bilinear interpolation
\item \textbf{Frame Stacking:} Stack the last 4 consecutive frames to capture temporal dynamics, resulting in a state space of shape $(4, 84, 84)$
\end{itemize}
\subsection{Action Space}
The action space is continuous with 3 dimensions:
\begin{itemize}
\item \textbf{Steering:} Continuous value in $[-1, 1]$ controlling left/right direction
\item \textbf{Gas:} Continuous value in $[0, 1]$ controlling forward acceleration
\item \textbf{Brake:} Continuous value in $[0, 1]$ controlling deceleration
\end{itemize}
Our policy network outputs the mean ($\mu$) and standard deviation ($\sigma$) of a Gaussian distribution for each action dimension, from which actions are sampled during exploration.
\subsection{Reward Mechanism}
The reward function provides incremental feedback based on the agent's behavior:
\begin{itemize}
\item \textbf{Track completion bonus:} $+100$ points for finishing one lap
\item \textbf{Velocity reward:} Proportional to the car's speed on the track surface
\item \textbf{Penalty for off-track:} Negative reward when the car leaves the track boundaries
\item \textbf{Action cost:} Small negative reward proportional to action magnitudes
\end{itemize}
\section{Algorithm: Proximal Policy Optimization}
PPO belongs to the policy gradient family of RL algorithms, specifically optimizing a clipped surrogate objective to prevent destructively large policy updates.
\subsection{Policy Gradient Foundation}
Policy gradient methods aim to maximize the expected cumulative return $J(\theta) = \mathbb{E}_{\pi_\theta}[R_0]$. The gradient is estimated using the policy gradient theorem:
\begin{equation}
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot A(s,a)\right]
\end{equation}
where $A(s,a)$ is the advantage function estimating how much better an action $a$ is compared to the policy's average behavior.
\subsection{Clipped Surrogate Objective}
PPO modifies the standard policy gradient objective by introducing a clipping mechanism:
\begin{equation}
L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]
\end{equation}
where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio, and $\epsilon = 0.2$ is the clipping parameter.
\subsection{Generalized Advantage Estimation}
We use GAE($\lambda$) for advantage estimation:
\begin{equation}
\hat{A}_t^{GAE} = \sum_{l=0}^{\infty}(\gamma\lambda)^l \delta_{t+l}
\end{equation}
where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the temporal-difference error, $\gamma = 0.99$ is the discount factor, and $\lambda = 0.95$ controls the bias-variance tradeoff.
\section{Network Architecture}
We employ separate actor and critic networks following the standard actor-critic architecture for PPO.
\subsection{Actor Network (Policy)}
The actor network outputs parameters of a Gaussian policy $\pi(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$:
\bigskip
\noindent\rule{\linewidth}{0.5pt}
\vspace{0.5\baselineskip}
\textbf{Actor Network Architecture:}
\begin{itemize}
\item Input: $(4, 84, 84)$ -- 4 stacked grayscale frames
\item Conv2D(4, 32, kernel=8x8, stride=4) + ReLU $\rightarrow$ output: 20x20
\item Conv2D(32, 64, kernel=4x4, stride=2) + ReLU $\rightarrow$ output: 9x9
\item Conv2D(64, 64, kernel=3x3, stride=1) + ReLU $\rightarrow$ output: 7x7
\item Flatten: 64 $\times$ 7 $\times$ 7 = 3136 features
\item FC(3136, 512) + ReLU
\item FC(512, 3) $\rightarrow$ $\mu$ (tanh activation)
\item FC(512, 3) $\rightarrow$ $\log\sigma$ (clamped to [-20, 2])
\end{itemize}
\vspace{0.5\baselineskip}
\noindent\rule{\linewidth}{0.5pt}
\subsection{Critic Network (Value)}
The critic network estimates the state value function $V(s)$:
\bigskip
\noindent\rule{\linewidth}{0.5pt}
\vspace{0.5\baselineskip}
\textbf{Critic Network Architecture:}
\begin{itemize}
\item Input: $(4, 84, 84)$ -- 4 stacked grayscale frames
\item Conv2D(4, 32, kernel=8x8, stride=4) + ReLU $\rightarrow$ output: 20x20
\item Conv2D(32, 64, kernel=4x4, stride=2) + ReLU $\rightarrow$ output: 9x9
\item Conv2D(64, 64, kernel=3x3, stride=1) + ReLU $\rightarrow$ output: 7x7
\item Flatten: 64 $\times$ 7 $\times$ 7 = 3136 features
\item FC(3136, 512) + ReLU
\item FC(512, 1) $\rightarrow$ $V(s)$
\end{itemize}
\vspace{0.5\baselineskip}
\noindent\rule{\linewidth}{0.5pt}
Both networks share identical convolutional backbones but maintain separate fully-connected heads to enable independent optimization.
\section{Implementation Details}
\subsection{Hyperparameters}
Table~\ref{tab:hyperparams} summarizes the hyperparameters used in our implementation:
\begin{table}[h]
\centering
\caption{Hyperparameter Configuration}
\label{tab:hyperparams}
\begin{tabular}{@{}ll@{}}
\toprule
Parameter & Value \\
\midrule
Learning rate & $3 \times 10^{-4}$ \\
Discount factor ($\gamma$) & 0.99 \\
GAE lambda ($\lambda$) & 0.95 \\
Clip epsilon ($\epsilon$) & 0.2 \\
PPO epochs per update & 4 \\
Mini-batch size & 64 \\
Rollout steps & 2048 \\
Entropy coefficient & 0.01 \\
Value coefficient & 0.5 \\
Max gradient norm & 0.5 \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Training Pipeline}
The training pipeline follows these steps for each iteration:
\begin{enumerate}
\item \textbf{Data Collection:} Collect $N=2048$ timesteps using current policy
\item \textbf{GAE Computation:} Calculate advantages and returns using GAE with $\lambda = 0.95$
\item \textbf{Policy Update:} Perform 4 epochs of minibatch updates with batch size 64
\item \textbf{Loss Computation:} Compute combined loss including actor loss, critic loss, and entropy bonus
\item \textbf{Gradient Clipping:} Apply gradient norm clipping with threshold 0.5
\item \textbf{Evaluation:} Every 10 episodes, evaluate over 5 episodes
\end{enumerate}
\subsection{Problems and Solutions}
Several implementation challenges were encountered:
\begin{itemize}
\item \textbf{Tensor Dimension Mismatch:} States stored in $(H, W, C)$ but CNN expected $(C, H, W)$. Resolved by adding explicit permutation.
\item \textbf{Feature Map Size:} Initial calculation gave $20 \times 20$ instead of correct $7 \times 7$. Corrected by layer-by-layer computation.
\item \textbf{Log-Probability Shape:} Vector format caused dimension mismatches. Fixed by summing log-probabilities across action dimensions.
\end{itemize}
\section{Results and Analysis}
\subsection{Training Performance}
Training proceeded for 500,000 timesteps (approximately 245 episodes). Figure~\ref{fig:training_curves} presents the training curves.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth]{training_curves.png}
\caption{Training and Evaluation Curves}
\label{fig:training_curves}
\end{figure}
Key observations from training:
\begin{itemize}
\item \textbf{Evaluation Return:} The agent achieved a peak evaluation return of \textbf{367.04} around episode 70
\item \textbf{Final Performance:} The final evaluation return stabilized around \textbf{-92.65} (std: 2.38)
\item \textbf{Actor Loss:} Decreased and stabilized near zero
\item \textbf{Critic Loss:} Decreased from 6.98 to 0.16, indicating accurate value estimation
\item \textbf{Entropy:} Increased from 4.27 to 10.26, showing maintained exploration
\end{itemize}
\subsection{Test Evaluation}
Final model evaluation over 10 episodes yielded:
\begin{itemize}
\item \textbf{Mean Return:} $-66.85$
\item \textbf{Standard Deviation:} $2.38$
\item \textbf{All episodes:} 1000 steps (episode limit reached)
\end{itemize}
\subsection{Comparison with Baselines}
Table~\ref{tab:comparison} compares our implementation with Stable-Baselines3 PPO:
\begin{table}[h]
\centering
\caption{Comparison with Stable-Baselines3 PPO}
\label{tab:comparison}
\begin{tabular}{@{}lccc@{}}
\toprule
Metric & Our Implementation & SB3 PPO & Notes \\
\midrule
Peak Return & 367.04 & $\sim$900 & SB3 uses more training steps \\
Training Steps & 500k & 1M+ & Our limited training budget \\
Sample Efficiency & Moderate & High & SB3 optimized hyperparameters \\
Implementation & From scratch & Library & Custom code demonstrates understanding \\
\bottomrule
\end{tabular}
\end{table}
Our from-scratch implementation achieves reasonable performance within the training budget, though SB3's highly-tuned implementation naturally performs better.
\section{Conclusion}
This project successfully implemented PPO from scratch for the CarRacing-v3 environment. Key achievements include:
\begin{itemize}
\item Complete PPO implementation with GAE, clip mechanism, and entropy regularization
\item CNN-based actor-critic architecture with Gaussian policy
\item Comprehensive preprocessing pipeline
\item TensorBoard integration for experiment tracking
\item Peak evaluation return of 367.04 during training
\item Modular, well-documented code structure
\end{itemize}
Future improvements could include implementing Trust Region Policy Optimization (TRPO), adding experience replay, or incorporating curiosity-driven exploration.
\section{References}
\begin{itemize}
\item Schulman, J., Wolski, F., Dhariwal, P., Radford, A., \& Klimov, O. (2017). Proximal Policy Optimization Algorithms. \textit{arXiv preprint arXiv:1707.06347}.
\item Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. \textit{Nature}, 518(7540), 529-533.
\item Raffin, A., et al. (2021). Stable-Baselines3: Reliable Reinforcement Learning Implementations. \textit{JMLR}, 22(268), 1-8.
\item Brockman, G., et al. (2016). OpenAI Gym. \textit{arXiv preprint arXiv:1606.01540}.
\end{itemize}
\end{document}
Binary file not shown.

Before

Width:  |  Height:  |  Size: 127 KiB

-202
View File
@@ -1,202 +0,0 @@
"""Main training script for PPO on CarRacing-v3."""
import os
import time
import argparse
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
from src.network import Actor, Critic
from src.replay_buffer import RolloutBuffer
from src.trainer import PPOTrainer
from src.utils import make_env, get_device
def collect_rollout(actor, critic, env, buffer, device, rollout_steps):
"""Collect rollout data."""
obs, _ = env.reset()
# Convert to (C, H, W) format for storage
obs = np.transpose(obs, (1, 2, 0))
for step in range(rollout_steps):
with torch.no_grad():
# Convert to (B, C, H, W)
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1)
log_prob = dist.log_prob(action).sum(dim=-1, keepdim=True)
value = critic(obs_t).squeeze(0).item()
action_np = action.squeeze(0).cpu().numpy()
log_prob_np = log_prob.squeeze(0).cpu().numpy().sum()
next_obs, reward, terminated, truncated, _ = env.step(action_np)
done = terminated or truncated
# Convert next_obs to (C, H, W) for storage
next_obs_stored = np.transpose(next_obs, (1, 2, 0))
buffer.add(obs.copy(), action_np, reward, done, value, log_prob_np)
obs = next_obs_stored
if done:
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
return obs
def train(
total_steps=500000,
rollout_steps=2048,
eval_interval=10,
save_interval=50,
device=None,
):
"""Main training loop."""
if device is None:
device = get_device()
env = make_env()
eval_env = make_env()
state_shape = (84, 84, 4)
action_dim = 3
actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device)
critic = Critic(state_shape=state_shape).to(device)
buffer = RolloutBuffer(
buffer_size=rollout_steps,
state_shape=state_shape,
action_dim=action_dim,
)
trainer = PPOTrainer(
actor=actor,
critic=critic,
rollout_buffer=buffer,
device=device,
clip_eps=0.2,
gamma=0.99,
gae_lambda=0.95,
lr=3e-4,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5,
ppo_epochs=4,
mini_batch_size=64,
)
# TensorBoard
log_dir = os.path.join("logs", "tensorboard", f"run_{int(time.time())}")
writer = SummaryWriter(log_dir)
print(f"Training on {device}")
print(f"Log directory: {log_dir}")
episode = 0
total_timesteps = 0
episode_rewards = []
recent_rewards = []
while total_timesteps < total_steps:
obs = collect_rollout(actor, critic, env, buffer, device, rollout_steps)
with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
last_value = critic(obs_t).squeeze(0).item()
# PPO update
actor_loss, critic_loss, entropy = trainer.update(last_value)
# Logging
writer.add_scalar("Loss/Actor", actor_loss, total_timesteps)
writer.add_scalar("Loss/Critic", critic_loss, total_timesteps)
writer.add_scalar("Loss/Entropy", entropy, total_timesteps)
total_timesteps += rollout_steps
episode += 1
# Estimate episode reward from buffer
ep_reward = buffer.rewards[:buffer.size].sum()
episode_rewards.append(ep_reward)
recent_rewards.append(ep_reward)
# Running average of last 10 episodes
avg_reward = np.mean(recent_rewards[-10:]) if len(recent_rewards) >= 10 else np.mean(recent_rewards)
writer.add_scalar("Reward/Episode", ep_reward, total_timesteps)
writer.add_scalar("Reward/AvgLast10", avg_reward, total_timesteps)
print(f"Episode {episode}, steps {total_timesteps}, ep_reward={ep_reward:.1f}, avg_10={avg_reward:.1f}")
# Evaluation
if episode % eval_interval == 0:
eval_returns = []
for _ in range(5):
eval_obs, _ = eval_env.reset()
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward = 0
done = False
while not done:
with torch.no_grad():
eval_obs_t = torch.from_numpy(eval_obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(eval_obs_t)
action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy()
eval_obs, reward, terminated, truncated, _ = eval_env.step(action)
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward += reward
done = terminated or truncated
eval_returns.append(eval_reward)
mean_eval = np.mean(eval_returns)
writer.add_scalar("Eval/MeanReturn", mean_eval, episode)
print(f" Eval: mean_return={mean_eval:.2f}")
# Save model
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save({
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
}, os.path.join("models", f"ppo_carracing_ep{episode}.pt"))
print(f" Saved model at episode {episode}")
# Save final model
os.makedirs("models", exist_ok=True)
torch.save({
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
}, os.path.join("models", "ppo_carracing_final.pt"))
writer.close()
print(f"Training complete! Total episodes: {episode}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--steps", type=int, default=500000, help="Total training steps")
parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size")
args = parser.parse_args()
device = get_device()
train(total_steps=args.steps, rollout_steps=args.rollout, device=device)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--steps", type=int, default=500000, help="Total training steps")
parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size")
args = parser.parse_args()
device = get_device()
train(total_steps=args.steps, rollout_steps=args.rollout, device=device)
@@ -1,545 +0,0 @@
"""Improved training script for CarRacing-v3 PPO with reward shaping."""
import os
import time
import argparse
import numpy as np
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
from collections import deque
import gymnasium as gym
import cv2
class RewardShapingWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
self.steps_on_track = 0
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
self.steps_on_track = 0
return obs, info
def step(self, action):
obs, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
shaped_reward = reward
if info.get("speed", 0) > 0.1:
shaped_reward += info["speed"] * 0.1
if not info.get("offtrack", False):
shaped_reward += 0.1
self.steps_on_track += 1
else:
shaped_reward -= 0.5
self.steps_on_track = 0
if info.get("lap_complete", False):
shaped_reward += 100
return obs, shaped_reward, terminated, truncated, info
class GrayScaleWrapper(gym.ObservationWrapper):
def __init__(self, env):
super().__init__(env)
def observation(self, obs):
gray = 0.299 * obs[:, :, 0] + 0.587 * obs[:, :, 1] + 0.114 * obs[:, :, 2]
return gray.astype(np.uint8)
class ResizeWrapper(gym.ObservationWrapper):
def __init__(self, env, size=(84, 84)):
super().__init__(env)
self.size = size
def observation(self, obs):
return cv2.resize(obs, self.size, interpolation=cv2.INTER_AREA)
class FrameStackWrapper(gym.ObservationWrapper):
def __init__(self, env, num_stack=4):
super().__init__(env)
self.num_stack = num_stack
self.frames = deque(maxlen=num_stack)
obs_shape = env.observation_space.shape
self.observation_space = gym.spaces.Box(
low=0, high=255, shape=(num_stack, *obs_shape[-2:]), dtype=np.uint8
)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
for _ in range(self.num_stack):
self.frames.append(obs)
return self._get_observation(), info
def observation(self, obs):
self.frames.append(obs)
return self._get_observation()
def _get_observation(self):
return np.stack(list(self.frames), axis=0)
def make_env(env_id="CarRacing-v3", gray_scale=True, resize=True, frame_stack=4):
env = gym.make(env_id, render_mode="rgb_array")
if resize:
env = ResizeWrapper(env, size=(84, 84))
if gray_scale:
env = GrayScaleWrapper(env)
if frame_stack > 1:
env = FrameStackWrapper(env, num_stack=frame_stack)
env = RewardShapingWrapper(env)
return env
def get_device():
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
device = torch.device("cpu")
print("Using CPU")
return device
class Actor(nn.Module):
def __init__(self, state_shape=(84, 84, 4), action_dim=3):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2),
)
out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h
self.fc = nn.Sequential(
nn.Linear(feat_size, 512),
nn.LeakyReLU(0.2),
)
self.mu_head = nn.Linear(512, action_dim)
self.log_std_head = nn.Linear(512, action_dim)
for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None:
nn.init.constant_(m.bias, 0)
nn.init.orthogonal_(self.mu_head.weight, gain=0.01)
nn.init.orthogonal_(self.log_std_head.weight, gain=0.01)
def forward(self, x):
x = x / 255.0
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
mu = torch.tanh(self.mu_head(x))
log_std = torch.clamp(self.log_std_head(x), -20, 2)
return mu, log_std.exp()
class Critic(nn.Module):
def __init__(self, state_shape=(84, 84, 4)):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2),
)
out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h
self.fc = nn.Sequential(nn.Linear(feat_size, 512), nn.LeakyReLU(0.2), nn.Linear(512, 1))
for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None:
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = x / 255.0
x = self.conv(x)
x = x.view(x.size(0), -1)
return self.fc(x)
class RolloutBuffer:
def __init__(self, buffer_size, state_shape, action_dim):
self.buffer_size = buffer_size
self.ptr = 0
self.size = 0
self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8)
self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32)
self.rewards = np.zeros(buffer_size, dtype=np.float32)
self.dones = np.zeros(buffer_size, dtype=np.bool_)
self.values = np.zeros(buffer_size, dtype=np.float32)
self.log_probs = np.zeros(buffer_size, dtype=np.float32)
def add(self, state, action, reward, done, value, log_prob):
self.states[self.ptr] = state
self.actions[self.ptr] = action
self.rewards[self.ptr] = reward
self.dones[self.ptr] = done
self.values[self.ptr] = value
self.log_probs[self.ptr] = log_prob
self.ptr = (self.ptr + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size)
def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.98):
advantages = np.zeros(self.size, dtype=np.float32)
last_gae = 0
for t in reversed(range(self.size)):
if t == self.size - 1:
next_value = last_value
else:
next_value = self.values[t + 1]
delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t]
last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae
advantages[t] = last_gae
returns = advantages + self.values[: self.size]
return returns, advantages
def get(self):
return (
self.states[: self.size],
self.actions[: self.size],
self.rewards[: self.size],
self.dones[: self.size],
self.values[: self.size],
self.log_probs[: self.size],
)
def reset(self):
self.ptr = 0
self.size = 0
class PPOTrainer:
def __init__(
self,
actor,
critic,
rollout_buffer,
device,
clip_eps=0.1,
gamma=0.99,
gae_lambda=0.98,
lr=3e-4,
ent_coef=0.005,
vf_coef=0.75,
max_grad_norm=0.5,
ppo_epochs=10,
mini_batch_size=128,
):
self.actor = actor
self.critic = critic
self.buffer = rollout_buffer
self.device = device
self.clip_eps = clip_eps
self.gamma = gamma
self.gae_lambda = gae_lambda
self.ent_coef = ent_coef
self.vf_coef = vf_coef
self.max_grad_norm = max_grad_norm
self.ppo_epochs = ppo_epochs
self.mini_batch_size = mini_batch_size
self.actor_optim = torch.optim.Adam(actor.parameters(), lr=lr, eps=1e-5)
self.critic_optim = torch.optim.Adam(critic.parameters(), lr=lr, eps=1e-5)
def update(self, last_value):
states, actions, rewards, dones, values, log_probs_old = self.buffer.get()
returns, advantages = self.buffer.compute_returns(last_value, self.gamma, self.gae_lambda)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
states_t = torch.from_numpy(states).float().permute(0, 3, 1, 2).to(self.device)
actions_t = torch.from_numpy(actions).float().to(self.device)
log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device)
returns_t = torch.from_numpy(returns).float().to(self.device)
advantages_t = torch.from_numpy(advantages).float().to(self.device)
dataset = torch.utils.data.TensorDataset(
states_t, actions_t, log_probs_old_t, returns_t, advantages_t
)
loader = torch.utils.data.DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True)
total_actor_loss = 0
total_critic_loss = 0
total_entropy = 0
count = 0
for _ in range(self.ppo_epochs):
for batch in loader:
s, a, log_pi_old, ret, adv = batch
mu, std = self.actor(s)
dist = torch.distributions.Normal(mu, std)
log_pi = dist.log_prob(a).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
ratio = torch.exp(log_pi - log_pi_old)
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv
actor_loss = -torch.min(surr1, surr2).mean()
value = self.critic(s)
critic_loss = nn.MSELoss()(value.squeeze(), ret)
loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean()
self.actor_optim.zero_grad()
self.critic_optim.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor.parameters(), self.max_grad_norm)
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.actor_optim.step()
self.critic_optim.step()
total_actor_loss += actor_loss.item()
total_critic_loss += critic_loss.item()
total_entropy += entropy.mean().item()
count += 1
avg_actor = total_actor_loss / count
avg_critic = total_critic_loss / count
avg_entropy = total_entropy / count
self.buffer.reset()
return avg_actor, avg_critic, avg_entropy
def collect_rollout(actor, critic, env, buffer, device, rollout_steps):
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
for step in range(rollout_steps):
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
with torch.no_grad():
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1)
log_prob = dist.log_prob(action).sum(dim=-1)
value = critic(obs_t).squeeze(0).item()
action_np = action.squeeze(0).cpu().numpy()
log_prob_np = log_prob.squeeze(0).cpu().numpy()
next_obs, reward, terminated, truncated, _ = env.step(action_np)
done = terminated or truncated
next_obs_stored = np.transpose(next_obs, (1, 2, 0))
buffer.add(obs.copy(), action_np, reward, done, value, log_prob_np)
obs = next_obs_stored
if done:
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
return obs
def train(
total_steps=2000000,
rollout_steps=2048,
eval_interval=10,
save_interval=50,
device=None,
):
if device is None:
device = get_device()
env = make_env()
eval_env = make_env()
state_shape = (84, 84, 4)
action_dim = 3
actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device)
critic = Critic(state_shape=state_shape).to(device)
buffer = RolloutBuffer(
buffer_size=rollout_steps,
state_shape=state_shape,
action_dim=action_dim,
)
trainer = PPOTrainer(
actor=actor,
critic=critic,
rollout_buffer=buffer,
device=device,
clip_eps=0.1,
gamma=0.99,
gae_lambda=0.98,
lr=3e-4,
ent_coef=0.005,
vf_coef=0.75,
max_grad_norm=0.5,
ppo_epochs=10,
mini_batch_size=128,
)
log_dir = os.path.join("logs", "tensorboard", f"run_improved_{int(time.time())}")
writer = SummaryWriter(log_dir)
print(f"Training on {device}")
print(f"Log directory: {log_dir}")
print("Improvements: LeakyReLU, BatchNorm, He init, Reward shaping, More epochs")
episode = 0
total_timesteps = 0
episode_rewards = []
best_eval = -float("inf")
while total_timesteps < total_steps:
obs = collect_rollout(actor, critic, env, buffer, device, rollout_steps)
with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
last_value = critic(obs_t).squeeze(0).item()
actor_loss, critic_loss, entropy = trainer.update(last_value)
writer.add_scalar("Loss/Actor", actor_loss, total_timesteps)
writer.add_scalar("Loss/Critic", critic_loss, total_timesteps)
writer.add_scalar("Loss/Entropy", entropy, total_timesteps)
total_timesteps += rollout_steps
episode += 1
ep_reward = buffer.rewards[: buffer.size].sum()
episode_rewards.append(ep_reward)
recent_rewards = episode_rewards[-10:] if len(episode_rewards) >= 10 else episode_rewards
avg_reward = np.mean(recent_rewards)
writer.add_scalar("Reward/Episode", ep_reward, total_timesteps)
writer.add_scalar("Reward/AvgLast10", avg_reward, total_timesteps)
print(
f"Episode {episode}, steps {total_timesteps}, ep_reward={ep_reward:.1f}, avg_10={avg_reward:.1f}"
)
if episode % eval_interval == 0:
eval_returns = []
for _ in range(5):
eval_obs, _ = eval_env.reset()
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward = 0
done = False
while not done:
with torch.no_grad():
eval_obs_t = (
torch.from_numpy(eval_obs)
.float()
.unsqueeze(0)
.permute(0, 3, 1, 2)
.to(device)
)
mu, std = actor(eval_obs_t)
action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy()
eval_obs, reward, terminated, truncated, _ = eval_env.step(action)
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward += reward
done = terminated or truncated
eval_returns.append(eval_reward)
mean_eval = np.mean(eval_returns)
writer.add_scalar("Eval/MeanReturn", mean_eval, episode)
print(f" Eval: mean_return={mean_eval:.2f}")
if mean_eval > best_eval:
best_eval = mean_eval
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
"best_eval": best_eval,
},
os.path.join("models", "ppo_improved_best.pt"),
)
print(f" New best model saved! eval={best_eval:.2f}")
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
},
os.path.join("models", f"ppo_improved_ep{episode}.pt"),
)
print(f" Saved model at episode {episode}")
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
"best_eval": best_eval,
},
os.path.join("models", "ppo_improved_final.pt"),
)
writer.close()
env.close()
eval_env.close()
print(f"Training complete! Total episodes: {episode}, Best eval: {best_eval:.2f}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--steps", type=int, default=2000000, help="Total training steps")
parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size")
args = parser.parse_args()
device = get_device()
train(total_steps=args.steps, rollout_steps=args.rollout, device=device)
@@ -1,662 +0,0 @@
#!/usr/bin/env python3
"""Parallel training with reward shaping for CarRacing-v3 PPO."""
import os
import sys
import time
import argparse
import numpy as np
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
from collections import deque
from multiprocessing import Process, Queue, SimpleQueue, set_start_method
import gymnasium as gym
import cv2
class RewardShapingWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
self.steps_on_track = 0
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
self.steps_on_track = 0
return obs, info
def step(self, action):
obs, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
shaped_reward = reward
if info.get("speed", 0) > 0.1:
shaped_reward += info["speed"] * 0.1
if not info.get("offtrack", False):
shaped_reward += 0.1
self.steps_on_track += 1
else:
shaped_reward -= 0.5
self.steps_on_track = 0
if info.get("lap_complete", False):
shaped_reward += 100
return obs, shaped_reward, terminated, truncated, info
class GrayScaleWrapper(gym.ObservationWrapper):
def __init__(self, env):
super().__init__(env)
def observation(self, obs):
gray = 0.299 * obs[:, :, 0] + 0.587 * obs[:, :, 1] + 0.114 * obs[:, :, 2]
return gray.astype(np.uint8)
class ResizeWrapper(gym.ObservationWrapper):
def __init__(self, env, size=(84, 84)):
super().__init__(env)
self.size = size
def observation(self, obs):
return cv2.resize(obs, self.size, interpolation=cv2.INTER_AREA)
class FrameStackWrapper(gym.ObservationWrapper):
def __init__(self, env, num_stack=4):
super().__init__(env)
self.num_stack = num_stack
self.frames = deque(maxlen=num_stack)
obs_shape = env.observation_space.shape
self.observation_space = gym.spaces.Box(
low=0, high=255, shape=(num_stack, *obs_shape[-2:]), dtype=np.uint8
)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
for _ in range(self.num_stack):
self.frames.append(obs)
return self._get_observation(), info
def observation(self, obs):
self.frames.append(obs)
return self._get_observation()
def _get_observation(self):
return np.stack(list(self.frames), axis=0)
def make_env(env_id="CarRacing-v3", gray_scale=True, resize=True, frame_stack=4):
env = gym.make(env_id, render_mode="rgb_array")
if resize:
env = ResizeWrapper(env, size=(84, 84))
if gray_scale:
env = GrayScaleWrapper(env)
if frame_stack > 1:
env = FrameStackWrapper(env, num_stack=frame_stack)
env = RewardShapingWrapper(env)
return env
def worker_loop(worker_id, action_queue, result_queue):
env = make_env()
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
while True:
try:
cmd, data = action_queue.get()
if cmd == 'step':
action = data
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
next_obs_t = np.transpose(next_obs, (1, 2, 0))
if done:
next_obs_t, _ = env.reset()
next_obs_t = np.transpose(next_obs_t, (1, 2, 0))
result_queue.put((worker_id, obs, action, reward, done, next_obs_t))
obs = next_obs_t
elif cmd == 'reset':
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
result_queue.put((worker_id, obs))
elif cmd == 'close':
env.close()
break
except Exception:
break
class ParallelEnv:
def __init__(self, num_envs=4):
self.num_envs = num_envs
self.action_queues = []
self.result_queue = SimpleQueue()
self.processes = []
for i in range(num_envs):
action_queue = Queue(maxsize=1)
self.action_queues.append(action_queue)
p = Process(target=worker_loop, args=(i, action_queue, self.result_queue))
p.start()
self.processes.append(p)
for i in range(num_envs):
self.action_queues[i].put(('reset', None))
for _ in range(num_envs):
self.result_queue.get()
def reset(self):
for i in range(self.num_envs):
self.action_queues[i].put(('reset', None))
obs_list = []
for _ in range(self.num_envs):
worker_id, obs = self.result_queue.get()
obs_list.append(obs)
return np.array(obs_list)
def step(self, actions):
for i in range(self.num_envs):
self.action_queues[i].put(('step', actions[i]))
results = {}
for _ in range(self.num_envs):
item = self.result_queue.get()
results[item[0]] = item[1:]
obs_list = []
reward_list = []
done_list = []
next_obs_list = []
for i in range(self.num_envs):
data = results[i]
obs, action, reward, done = data[:4]
next_obs = data[4] if len(data) > 4 else obs
obs_list.append(obs)
reward_list.append(reward)
done_list.append(done)
next_obs_list.append(next_obs)
return np.array(next_obs_list), np.array(reward_list), np.array(done_list)
def close(self):
for i in range(self.num_envs):
try:
self.action_queues[i].put(('close', None))
except:
pass
time.sleep(0.5)
for p in self.processes:
if p.is_alive():
p.terminate()
p.join(timeout=1)
def get_device():
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
device = torch.device("cpu")
print("Using CPU")
return device
class Actor(nn.Module):
def __init__(self, state_shape=(84, 84, 4), action_dim=3):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2),
)
out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h
self.fc = nn.Sequential(
nn.Linear(feat_size, 512),
nn.LeakyReLU(0.2),
)
self.mu_head = nn.Linear(512, action_dim)
self.log_std_head = nn.Linear(512, action_dim)
for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None:
nn.init.constant_(m.bias, 0)
nn.init.orthogonal_(self.mu_head.weight, gain=0.01)
nn.init.orthogonal_(self.log_std_head.weight, gain=0.01)
def forward(self, x):
x = x / 255.0
x = self.conv(x)
x = x.flatten(1)
x = self.fc(x)
mu = torch.tanh(self.mu_head(x))
log_std = torch.clamp(self.log_std_head(x), -20, 2)
return mu, log_std.exp()
class Critic(nn.Module):
def __init__(self, state_shape=(84, 84, 4)):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2),
)
out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h
self.fc = nn.Sequential(nn.Linear(feat_size, 512), nn.LeakyReLU(0.2), nn.Linear(512, 1))
for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None:
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = x / 255.0
x = self.conv(x)
x = x.flatten(1)
return self.fc(x)
class RolloutBuffer:
def __init__(self, buffer_size, state_shape, action_dim):
self.buffer_size = buffer_size
self.ptr = 0
self.size = 0
self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8)
self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32)
self.rewards = np.zeros(buffer_size, dtype=np.float32)
self.dones = np.zeros(buffer_size, dtype=np.bool_)
self.values = np.zeros(buffer_size, dtype=np.float32)
self.log_probs = np.zeros(buffer_size, dtype=np.float32)
def add(self, state, action, reward, done, value, log_prob):
self.states[self.ptr] = state
self.actions[self.ptr] = action
self.rewards[self.ptr] = reward
self.dones[self.ptr] = done
self.values[self.ptr] = value
self.log_probs[self.ptr] = log_prob
self.ptr = (self.ptr + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size)
def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.98):
advantages = np.zeros(self.size, dtype=np.float32)
last_gae = 0
for t in reversed(range(self.size)):
if t == self.size - 1:
next_value = last_value
else:
next_value = self.values[t + 1]
delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t]
last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae
advantages[t] = last_gae
returns = advantages + self.values[: self.size]
return returns, advantages
def get(self):
return (
self.states[: self.size],
self.actions[: self.size],
self.rewards[: self.size],
self.dones[: self.size],
self.values[: self.size],
self.log_probs[: self.size],
)
def reset(self):
self.ptr = 0
self.size = 0
class PPOTrainer:
def __init__(
self,
actor,
critic,
rollout_buffer,
device,
clip_eps=0.1,
gamma=0.99,
gae_lambda=0.98,
lr=3e-4,
ent_coef=0.005,
vf_coef=0.75,
max_grad_norm=0.5,
ppo_epochs=10,
mini_batch_size=128,
):
self.actor = actor
self.critic = critic
self.buffer = rollout_buffer
self.device = device
self.clip_eps = clip_eps
self.gamma = gamma
self.gae_lambda = gae_lambda
self.ent_coef = ent_coef
self.vf_coef = vf_coef
self.max_grad_norm = max_grad_norm
self.ppo_epochs = ppo_epochs
self.mini_batch_size = mini_batch_size
self.actor_optim = torch.optim.Adam(actor.parameters(), lr=lr, eps=1e-5)
self.critic_optim = torch.optim.Adam(critic.parameters(), lr=lr, eps=1e-5)
def update(self, last_value):
states, actions, rewards, dones, values, log_probs_old = self.buffer.get()
returns, advantages = self.buffer.compute_returns(last_value, self.gamma, self.gae_lambda)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
states_t = torch.from_numpy(states).float().permute(0, 3, 1, 2).to(self.device)
actions_t = torch.from_numpy(actions).float().to(self.device)
log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device)
returns_t = torch.from_numpy(returns).float().to(self.device)
advantages_t = torch.from_numpy(advantages).float().to(self.device)
dataset = torch.utils.data.TensorDataset(
states_t, actions_t, log_probs_old_t, returns_t, advantages_t
)
loader = torch.utils.data.DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True)
total_actor_loss = 0
total_critic_loss = 0
total_entropy = 0
count = 0
for _ in range(self.ppo_epochs):
for batch in loader:
s, a, log_pi_old, ret, adv = batch
mu, std = self.actor(s)
dist = torch.distributions.Normal(mu, std)
log_pi = dist.log_prob(a).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
ratio = torch.exp(log_pi - log_pi_old)
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv
actor_loss = -torch.min(surr1, surr2).mean()
value = self.critic(s)
critic_loss = nn.MSELoss()(value.squeeze(), ret)
loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean()
self.actor_optim.zero_grad()
self.critic_optim.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor.parameters(), self.max_grad_norm)
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.actor_optim.step()
self.critic_optim.step()
total_actor_loss += actor_loss.item()
total_critic_loss += critic_loss.item()
total_entropy += entropy.mean().item()
count += 1
avg_actor = total_actor_loss / count
avg_critic = total_critic_loss / count
avg_entropy = total_entropy / count
self.buffer.reset()
return avg_actor, avg_critic, avg_entropy
def collect_parallel_rollout(actor, critic, parallel_env, buffer, device, rollout_steps, num_envs):
obs_batch = parallel_env.reset()
episode_rewards = []
current_ep_rewards = [0.0] * num_envs
steps_per_env = rollout_steps // num_envs
for step in range(steps_per_env):
obs_t = torch.from_numpy(obs_batch).float().permute(0, 3, 1, 2).to(device)
with torch.no_grad():
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1)
log_prob = dist.log_prob(action).sum(dim=-1)
value = critic(obs_t).squeeze(-1)
action_np = action.cpu().numpy()
log_prob_np = log_prob.cpu().numpy()
value_np = value.cpu().numpy()
next_obs_batch, reward_batch, done_batch = parallel_env.step(action_np)
for i in range(num_envs):
buffer.add(obs_batch[i], action_np[i], reward_batch[i],
done_batch[i], value_np[i], log_prob_np[i])
current_ep_rewards[i] += reward_batch[i]
if done_batch[i]:
episode_rewards.append(current_ep_rewards[i])
current_ep_rewards[i] = 0.0
obs_batch = next_obs_batch
return obs_batch, episode_rewards if episode_rewards else [sum(current_ep_rewards) / num_envs]
def train(
total_steps=2000000,
num_envs=4,
rollout_steps=4096,
eval_interval=10,
save_interval=50,
device=None,
):
if device is None:
device = get_device()
print(f"Creating {num_envs} parallel environments with reward shaping...")
parallel_env = ParallelEnv(num_envs=num_envs)
state_shape = (84, 84, 4)
action_dim = 3
actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device)
critic = Critic(state_shape=state_shape).to(device)
buffer = RolloutBuffer(rollout_steps, state_shape, action_dim)
trainer = PPOTrainer(
actor=actor,
critic=critic,
rollout_buffer=buffer,
device=device,
clip_eps=0.1,
gamma=0.99,
gae_lambda=0.98,
lr=3e-4,
ent_coef=0.005,
vf_coef=0.75,
max_grad_norm=0.5,
ppo_epochs=10,
mini_batch_size=256,
)
log_dir = os.path.join("logs", "tensorboard", f"run_parallel_improved_{int(time.time())}")
writer = SummaryWriter(log_dir)
print(f"Training on {device} with {num_envs} parallel envs")
print(f"Log directory: {log_dir}")
print("Improvements: Parallel + Reward Shaping + Larger Rollout")
episode = 0
total_timesteps = 0
episode_rewards = []
best_eval = -float("inf")
while total_timesteps < total_steps:
obs_batch, batch_rewards = collect_parallel_rollout(
actor, critic, parallel_env, buffer, device, rollout_steps, num_envs
)
with torch.no_grad():
obs_t = torch.from_numpy(obs_batch).float().permute(0, 3, 1, 2).to(device)
last_value = critic(obs_t).mean().item()
actor_loss, critic_loss, entropy = trainer.update(last_value)
writer.add_scalar("Loss/Actor", actor_loss, total_timesteps)
writer.add_scalar("Loss/Critic", critic_loss, total_timesteps)
writer.add_scalar("Loss/Entropy", entropy, total_timesteps)
total_timesteps += rollout_steps
episode += 1
episode_rewards.extend(batch_rewards)
recent_rewards = episode_rewards[-50:] if len(episode_rewards) > 50 else episode_rewards
avg_reward = np.mean(recent_rewards)
mean_batch_reward = np.mean(batch_rewards)
writer.add_scalar("Reward/EpisodeMean", mean_batch_reward, total_timesteps)
writer.add_scalar("Reward/AvgLast50", avg_reward, total_timesteps)
print(f"Episode {episode}, steps {total_timesteps}, mean_reward={mean_batch_reward:.1f}, avg_50={avg_reward:.1f}")
if episode % eval_interval == 0:
eval_returns = []
for _ in range(5):
eval_env = make_env()
eval_obs, _ = eval_env.reset()
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward = 0
done = False
while not done:
with torch.no_grad():
eval_obs_t = (
torch.from_numpy(eval_obs)
.float()
.unsqueeze(0)
.permute(0, 3, 1, 2)
.to(device)
)
mu, std = actor(eval_obs_t)
action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy()
eval_obs, reward, terminated, truncated, _ = eval_env.step(action)
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward += reward
done = terminated or truncated
eval_returns.append(eval_reward)
eval_env.close()
mean_eval = np.mean(eval_returns)
writer.add_scalar("Eval/MeanReturn", mean_eval, episode)
print(f" Eval: mean_return={mean_eval:.2f}")
if mean_eval > best_eval:
best_eval = mean_eval
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
"best_eval": best_eval,
},
os.path.join("models", "ppo_parallel_improved_best.pt"),
)
print(f" New best model saved! eval={best_eval:.2f}")
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
},
os.path.join("models", f"ppo_parallel_improved_ep{episode}.pt"),
)
print(f" Saved model at episode {episode}")
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
"best_eval": best_eval,
},
os.path.join("models", "ppo_parallel_improved_final.pt"),
)
writer.close()
parallel_env.close()
print(f"Training complete! Total episodes: {episode}, Best eval: {best_eval:.2f}")
if __name__ == "__main__":
try:
set_start_method('fork')
except RuntimeError:
pass
parser = argparse.ArgumentParser()
parser.add_argument("--steps", type=int, default=2000000, help="Total training steps")
parser.add_argument("--num_envs", type=int, default=4, help="Number of parallel environments")
parser.add_argument("--rollout", type=int, default=4096, help="Rollout buffer size")
args = parser.parse_args()
device = get_device()
train(
total_steps=args.steps,
num_envs=args.num_envs,
rollout_steps=args.rollout,
device=device,
)
Binary file not shown.

Before

Width:  |  Height:  |  Size: 127 KiB

File diff suppressed because it is too large Load Diff