feat: 添加强化学习项目报告及重构课程作业报告代码结构

- 新增强化学习个人项目报告,包含基于PyTorch从零实现的PPO算法
- 重构课程作业报告代码结构,提取运行时路径管理和notebook执行逻辑到独立模块
- 更新依赖文件requirements.txt,添加强化学习相关依赖
- 简化模型比较结果表格,仅保留基线逻辑回归模型数据
This commit is contained in:
2026-04-30 16:54:41 +08:00
parent 6ac02ba4fe
commit d353133b31
21 changed files with 1639 additions and 102 deletions
+57
View File
@@ -0,0 +1,57 @@
# PPO for CarRacing-v3
From-scratch PPO implementation for CarRacing-v3. No Stable-Baselines or other RL libraries used.
## Setup
```bash
conda activate my_env
uv pip install -r requirements.txt
```
## Train
```bash
python train.py --steps 500000
```
## Evaluate
```bash
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
```
## TensorBoard
```bash
tensorboard --logdir logs/tensorboard
```
## Project Structure
```
src/
├── network.py # Actor (Gaussian policy) and Critic (Value) networks
├── replay_buffer.py # Rollout buffer with GAE computation
├── trainer.py # PPO update with clipped surrogate objective
├── utils.py # Environment wrappers (grayscale, resize, frame stack)
└── evaluate.py # Evaluation script
train.py # Main training entry point
models/ # Saved checkpoints
logs/tensorboard/ # TensorBoard logs
```
## Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning rate | 3e-4 |
| Gamma | 0.99 |
| GAE lambda | 0.95 |
| Clip epsilon | 0.2 |
| PPO epochs | 4 |
| Mini-batch size | 64 |
| Rollout steps | 2048 |
| Entropy coefficient | 0.01 |
| Value coefficient | 0.5 |
| Max gradient norm | 0.5 |
@@ -0,0 +1,136 @@
# PPO + CarRacing-v3 任务进度追踪
> 生成时间:2026/04/30
---
## 作业要求
用 Python 从零实现 PPO 算法,在 CarRacing-v3 环境训练智能体,提交:
- 技术报告(≤3000 词,英文)PDF
- 源代码 + 训练模型 zip 文件
- 截止:04/May/2026 23:59
- **禁止使用**Stable-Baselines 等 RL 专用库
- **允许使用**TensorBoard、PyTorch、Gymnasium
---
## 一、已完成 ✅
| 步骤 | 内容 | 文件 |
|------|------|------|
| ✅ 项目结构 | src/ 目录、requirements.txt、README.md | [requirements.txt](requirements.txt)、[README.md](README.md) |
| ✅ 策略/价值网络 | Actor(高斯策略输出 μ, σ)+ Critic 实现,CNN 结构 | [src/network.py](src/network.py) |
| ✅ Rollout Buffer | 轨迹存储 + GAE 优势估计 + 返回值计算 | [src/replay_buffer.py](src/replay_buffer.py) |
| ✅ PPO Trainer | PPO 更新(clip 目标函数 + 熵正则 + 价值损失) | [src/trainer.py](src/trainer.py) |
| ✅ 环境预处理 | 灰度化 + Resize(84×84) + 帧堆叠(4帧) Wrapper | [src/utils.py](src/utils.py) |
| ✅ 评估脚本 | 渲染测试 + 多回合平均分数评估 | [src/evaluate.py](src/evaluate.py) |
| ✅ 训练入口 | 主训练循环、TensorBoard 记录、模型保存 | [train.py](train.py) |
**核心算法实现要点**
- 策略网络:3 层 CNN + FC(512) → μ, σ(高斯策略,tanh 激活)
- 价值网络:3 层 CNN + FC(512) → V(s)
- GAE:λ=0.95,优势归一化
- PPO clip:ε=0.24 epoch 更新,mini-batch 64
---
## 二、待完成 ⬜
| 步骤 | 内容 | 优先级 |
|------|------|--------|
| ⬜ 安装依赖 | `uv pip install --system -r requirements.txt` | **高** |
| ⬜ 环境测试 | 短时间(~10000步)验证代码能跑通 | **高** |
| ⬜ 完整训练 | 运行 500k+ 步,预计 5-8 小时(后台) | **高(耗时)** |
| ⬜ 生成图表 | 从 TensorBoard 提取数据,用 matplotlib 绘图 | 中 |
| ⬜ 撰写报告 | 英文技术报告(≤3000 词),LaTeX 排版 | 中 |
| ⬜ 编译 PDF | XeLaTeX 编译生成 CW1_1234560.pdf | 中 |
| ⬜ 打包 zip | 源代码 + 模型打包 CW1_1234560.zip | 低 |
---
## 三、文件结构
```
强化学习个人项目报告/
├── src/
│ ├── __init__.py
│ ├── network.py # Actor + Critic CNN 网络
│ ├── replay_buffer.py # Rollout buffer + GAE
│ ├── trainer.py # PPO 更新逻辑
│ ├── utils.py # 环境预处理 wrappers
│ └── evaluate.py # 评估脚本
├── train.py # 主训练入口
├── requirements.txt
├── README.md
└── TASK_PROGRESS.md # 本文档
```
---
## 四、超参数配置
| 参数 | 值 |
|------|-----|
| Learning rate | 3e-4 |
| Gamma | 0.99 |
| GAE lambda | 0.95 |
| Clip epsilon | 0.2 |
| PPO epochs | 4 |
| Mini-batch size | 64 |
| Rollout steps | 2048 |
| Entropy coefficient | 0.01 |
| Value coefficient | 0.5 |
| Max gradient norm | 0.5 |
| State shape | (84, 84, 4) |
| Action dim | 3(连续:steer, gas, brake |
---
## 五、下一步行动
### 立即执行
```bash
# 1. 安装依赖
uv pip install --system -r requirements.txt
# 2. 验证代码能跑(短测试)
python train.py --steps 10000
# 3. 开始正式训练(后台运行,预计 5-8 小时)
python train.py --steps 500000
```
### 训练完成后
```bash
# TensorBoard 可视化
tensorboard --logdir logs/tensorboard
# 评估模型
python src/evaluate.py --model models/ppo_carracing_final.pt --episodes 10
```
### 报告撰写后
```bash
# 编译 PDF
cd tex && xelatex CW1_1234560.tex
```
---
## 六、报告结构(≤3000 词)
1. **Introduction** — RL 背景、CarRacing-v3 任务、状态/动作/奖励空间定义
2. **Methodology** — PPO 数学公式、clip 机制、GAE 优势估计
3. **Implementation Details** — 网络结构、训练流程、超参数、问题与解决
4. **Results and Analysis** — 训练曲线图、评估结果、与 SB3 基线对比
5. **Conclusion** — PPO 敏感性、actor-critic 有效性总结
---
## 七、提交清单
- [ ] `CW1_1234560.pdf` — 技术报告(封面 + ≤3000 词)
- [ ] `CW1_1234560.zip` — 源代码 + 训练好的模型 .pt 文件
- [ ] 所有代码使用英文注释
- [ ] 图表坐标轴和图例使用英文
@@ -0,0 +1,5 @@
torch
gymnasium[box2d]
numpy
matplotlib
tensorboard
@@ -0,0 +1,6 @@
"""PPO Agent for CarRacing-v3 environment."""
from .network import Actor, Critic
from .replay_buffer import RolloutBuffer
from .trainer import PPOTrainer
__all__ = ['Actor', 'Critic', 'RolloutBuffer', 'PPOTrainer']
@@ -0,0 +1,92 @@
"""Evaluation script for trained PPO agent."""
import torch
import numpy as np
import gymnasium as gym
from src.utils import make_env, get_device
from src.network import Actor, Critic
def evaluate(actor, env, num_episodes=10, device=torch.device("cpu")):
"""Evaluate actor and return average return."""
actor.eval()
returns = []
for ep in range(num_episodes):
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0)) # (C, H, W) -> (H, W, C) for storage
total_reward = 0
done = False
steps = 0
while not done and steps < 1000:
with torch.no_grad():
# Convert to tensor (B, C, H, W)
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
# Sample action
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
# Convert to (C, H, W) format
obs = np.transpose(obs, (1, 2, 0))
total_reward += reward
done = terminated or truncated
steps += 1
returns.append(total_reward)
print(f"Episode {ep+1}/{num_episodes}: return={total_reward:.1f}, steps={steps}")
actor.train()
return np.mean(returns), np.std(returns)
def evaluate_render(actor, env, device):
"""Render and evaluate agent with visualization."""
actor.eval()
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
env.render_mode = "human"
done = False
total_reward = 0
while not done:
with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
obs = np.transpose(obs, (1, 2, 0))
total_reward += reward
done = terminated or truncated
env.render()
actor.train()
print(f"Final return: {total_reward:.1f}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, required=True, help="Path to trained model")
parser.add_argument("--episodes", type=int, default=5, help="Number of evaluation episodes")
args = parser.parse_args()
device = get_device()
env = make_env()
actor = Actor().to(device)
critic = Critic().to(device)
# Load model
checkpoint = torch.load(args.model, map_location=device, weights_only=False)
actor.load_state_dict(checkpoint["actor"])
print(f"Loaded model from {args.model}")
mean_return, std_return = evaluate(actor, env, num_episodes=args.episodes, device=device)
print(f"\nEvaluation: mean={mean_return:.2f}, std={std_return:.2f}")
@@ -0,0 +1,78 @@
"""Neural network architectures for Actor and Critic."""
import torch
import torch.nn as nn
import torch.nn.functional as F
class Actor(nn.Module):
"""Actor network outputting Gaussian policy parameters (mu, sigma)."""
def __init__(self, state_shape=(84, 84, 4), action_dim=3):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1] # channels, height, width
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU(),
)
# Calculate feature map size: 84x84 -> 20x20 after conv layers
feat_size = 64 * 20 * 20
self.fc = nn.Sequential(
nn.Linear(feat_size, 512),
nn.ReLU(),
)
self.mu_head = nn.Linear(512, action_dim)
self.log_std_head = nn.Linear(512, action_dim)
# Initialize output layers
nn.init.orthogonal_(self.mu_head.weight, gain=0.01)
nn.init.orthogonal_(self.log_std_head.weight, gain=0.01)
def forward(self, x):
"""Forward pass returning (mu, log_std)."""
x = x / 255.0 # Normalize
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
mu = torch.tanh(self.mu_head(x))
log_std = self.log_std_head(x)
log_std = torch.clamp(log_std, -20, 2)
return mu, log_std.exp()
class Critic(nn.Module):
"""Critic network estimating state value V(s)."""
def __init__(self, state_shape=(84, 84, 4)):
super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU(),
)
feat_size = 64 * 20 * 20
self.fc = nn.Sequential(
nn.Linear(feat_size, 512),
nn.ReLU(),
nn.Linear(512, 1)
)
def forward(self, x):
"""Forward pass returning V(s)."""
x = x / 255.0
x = self.conv(x)
x = x.view(x.size(0), -1)
return self.fc(x)
@@ -0,0 +1,64 @@
"""Rollout buffer for storing trajectories."""
import numpy as np
class RolloutBuffer:
"""Stores trajectories for PPO training."""
def __init__(self, buffer_size, state_shape, action_dim):
self.buffer_size = buffer_size
self.ptr = 0
self.size = 0
self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8)
self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32)
self.rewards = np.zeros(buffer_size, dtype=np.float32)
self.dones = np.zeros(buffer_size, dtype=np.bool_)
self.values = np.zeros(buffer_size, dtype=np.float32)
self.log_probs = np.zeros((buffer_size, action_dim), dtype=np.float32)
def add(self, state, action, reward, done, value, log_prob):
"""Add a transition to the buffer."""
self.states[self.ptr] = state
self.actions[self.ptr] = action
self.rewards[self.ptr] = reward
self.dones[self.ptr] = done
self.values[self.ptr] = value
self.log_probs[self.ptr] = log_prob
self.ptr = (self.ptr + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size)
def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.95):
"""Compute returns and advantages using GAE."""
advantages = np.zeros(self.size, dtype=np.float32)
last_gae = 0
# Compute GAE backwards
for t in reversed(range(self.size)):
if t == self.size - 1:
next_value = last_value
else:
next_value = self.values[t + 1]
delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t]
last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae
advantages[t] = last_gae
returns = advantages + self.values[:self.size]
return returns, advantages
def get(self):
"""Return all data as numpy arrays."""
return (
self.states[:self.size],
self.actions[:self.size],
self.rewards[:self.size],
self.dones[:self.size],
self.values[:self.size],
self.log_probs[:self.size],
)
def reset(self):
"""Reset buffer."""
self.ptr = 0
self.size = 0
@@ -0,0 +1,123 @@
"""PPO Trainer with GAE advantage estimation."""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
class PPOTrainer:
"""PPO trainer handling the training loop."""
def __init__(
self,
actor,
critic,
rollout_buffer,
device,
clip_eps=0.2,
gamma=0.99,
gae_lambda=0.95,
lr=3e-4,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5,
ppo_epochs=4,
mini_batch_size=64,
):
self.actor = actor
self.critic = critic
self.buffer = rollout_buffer
self.device = device
self.clip_eps = clip_eps
self.gamma = gamma
self.gae_lambda = gae_lambda
self.ent_coef = ent_coef
self.vf_coef = vf_coef
self.max_grad_norm = max_grad_norm
self.ppo_epochs = ppo_epochs
self.mini_batch_size = mini_batch_size
# Separate optimizers
self.actor_optim = optim.Adam(actor.parameters(), lr=lr)
self.critic_optim = optim.Adam(critic.parameters(), lr=lr)
self.loss_history = {'actor': [], 'critic': [], 'entropy': [], 'total': []}
def update(self, last_value):
"""Perform one PPO update."""
states, actions, rewards, dones, values, log_probs_old = self.buffer.get()
# Compute returns and advantages
returns, advantages = self.buffer.compute_returns(
last_value, self.gamma, self.gae_lambda
)
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# Convert to tensors
states_t = torch.from_numpy(states).float().to(self.device)
actions_t = torch.from_numpy(actions).float().to(self.device)
log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device)
returns_t = torch.from_numpy(returns).float().to(self.device)
advantages_t = torch.from_numpy(advantages).float().to(self.device)
dataset = TensorDataset(states_t, actions_t, log_probs_old_t, returns_t, advantages_t)
loader = DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True)
total_actor_loss = 0
total_critic_loss = 0
total_entropy = 0
count = 0
for _ in range(self.ppo_epochs):
for batch in loader:
s, a, log_pi_old, ret, adv = batch
# Get current policy distribution
mu, std = self.actor(s)
dist = torch.distributions.Normal(mu, std)
log_pi = dist.log_prob(a).sum(dim=-1, keepdim=True)
entropy = dist.entropy().sum(dim=-1, keepdim=True)
# Probability ratio
ratio = torch.exp(log_pi - log_pi_old)
# Clipped surrogate objective
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv
actor_loss = -torch.min(surr1, surr2).mean()
# Value loss
value = self.critic(s)
critic_loss = nn.MSELoss()(value.squeeze(), ret)
# Total loss
loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean()
# Update
self.actor_optim.zero_grad()
self.critic_optim.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor.parameters(), self.max_grad_norm)
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.actor_optim.step()
self.critic_optim.step()
total_actor_loss += actor_loss.item()
total_critic_loss += critic_loss.item()
total_entropy += entropy.mean().item()
count += 1
avg_actor = total_actor_loss / count
avg_critic = total_critic_loss / count
avg_entropy = total_entropy / count
self.loss_history['actor'].append(avg_actor)
self.loss_history['critic'].append(avg_critic)
self.loss_history['entropy'].append(avg_entropy)
self.loss_history['total'].append(avg_actor + avg_critic)
self.buffer.reset()
return avg_actor, avg_critic, avg_entropy
@@ -0,0 +1,87 @@
"""Utility functions for environment, device detection, and TensorBoard."""
import gymnasium as gym
import numpy as np
import torch
from collections import deque
class GrayScaleWrapper(gym.ObservationWrapper):
"""Convert RGB observation to grayscale."""
def __init__(self, env):
super().__init__(env)
def observation(self, obs):
# RGB to grayscale: weighted average
gray = 0.299 * obs[:, :, 0] + 0.587 * obs[:, :, 1] + 0.114 * obs[:, :, 2]
return gray.astype(np.uint8)
class ResizeWrapper(gym.ObservationWrapper):
"""Resize observation to target size."""
def __init__(self, env, size=(84, 84)):
super().__init__(env)
self.size = size
def observation(self, obs):
import cv2
return cv2.resize(obs, self.size, interpolation=cv2.INTER_AREA)
class FrameStackWrapper(gym.ObservationWrapper):
"""Stack last N frames."""
def __init__(self, env, num_stack=4):
super().__init__(env)
self.num_stack = num_stack
self.frames = deque(maxlen=num_stack)
obs_shape = env.observation_space.shape
self.observation_space = gym.spaces.Box(
low=0, high=255,
shape=(num_stack, *obs_shape[-2:]),
dtype=np.uint8
)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
for _ in range(self.num_stack):
self.frames.append(obs)
return self._get_observation(), info
def observation(self, obs):
self.frames.append(obs)
return self._get_observation()
def _get_observation(self):
return np.stack(list(self.frames), axis=0)
def make_env(env_id="CarRacing-v3", gray_scale=True, resize=True, frame_stack=4):
"""Create preprocessed CarRacing environment."""
env = gym.make(env_id, render_mode="rgb_array")
if resize:
env = ResizeWrapper(env, size=(84, 84))
if gray_scale:
env = GrayScaleWrapper(env)
if frame_stack > 1:
env = FrameStackWrapper(env, num_stack=frame_stack)
return env
def get_device():
"""Detect and return available device."""
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
device = torch.device("cpu")
print("Using CPU")
return device
def preprocess_obs(obs):
"""Ensure observation is in correct format for network."""
if len(obs.shape) == 2: # single channel
obs = np.expand_dims(obs, axis=0)
return obs
+192
View File
@@ -0,0 +1,192 @@
"""Main training script for PPO on CarRacing-v3."""
import os
import time
import argparse
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
from src.network import Actor, Critic
from src.replay_buffer import RolloutBuffer
from src.trainer import PPOTrainer
from src.utils import make_env, get_device
def collect_rollout(actor, critic, env, buffer, device, rollout_steps):
"""Collect rollout data."""
obs, _ = env.reset()
# Convert to (C, H, W) format for storage
obs = np.transpose(obs, (1, 2, 0))
for step in range(rollout_steps):
with torch.no_grad():
# Convert to (B, C, H, W)
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std)
action = dist.sample()
action = torch.clamp(action, -1, 1)
log_prob = dist.log_prob(action).sum(dim=-1, keepdim=True)
value = critic(obs_t).squeeze(0).item()
action_np = action.squeeze(0).cpu().numpy()
log_prob_np = log_prob.squeeze(0).cpu().numpy()
next_obs, reward, terminated, truncated, _ = env.step(action_np)
done = terminated or truncated
# Convert next_obs to (C, H, W) for storage
next_obs_stored = np.transpose(next_obs, (1, 2, 0))
buffer.add(obs.copy(), action_np, reward, done, value, log_prob_np)
obs = next_obs_stored
if done:
obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0))
def train(
total_steps=500000,
rollout_steps=2048,
eval_interval=10,
save_interval=50,
device=None,
):
"""Main training loop."""
if device is None:
device = get_device()
env = make_env()
eval_env = make_env()
state_shape = (84, 84, 4)
action_dim = 3
actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device)
critic = Critic(state_shape=state_shape).to(device)
buffer = RolloutBuffer(
buffer_size=rollout_steps,
state_shape=state_shape,
action_dim=action_dim,
)
trainer = PPOTrainer(
actor=actor,
critic=critic,
rollout_buffer=buffer,
device=device,
clip_eps=0.2,
gamma=0.99,
gae_lambda=0.95,
lr=3e-4,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5,
ppo_epochs=4,
mini_batch_size=64,
)
# TensorBoard
log_dir = os.path.join("logs", "tensorboard", f"run_{int(time.time())}")
writer = SummaryWriter(log_dir)
print(f"Training on {device}")
print(f"Log directory: {log_dir}")
episode = 0
total_timesteps = 0
episode_rewards = []
recent_rewards = []
while total_timesteps < total_steps:
# Collect rollout
collect_rollout(actor, critic, env, buffer, device, rollout_steps)
# Get last value for GAE
with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
last_value = critic(obs_t).squeeze(0).item()
# PPO update
actor_loss, critic_loss, entropy = trainer.update(last_value)
# Logging
writer.add_scalar("Loss/Actor", actor_loss, total_timesteps)
writer.add_scalar("Loss/Critic", critic_loss, total_timesteps)
writer.add_scalar("Loss/Entropy", entropy, total_timesteps)
total_timesteps += rollout_steps
episode += 1
# Estimate episode reward from buffer
ep_reward = buffer.rewards[:buffer.size].sum()
episode_rewards.append(ep_reward)
recent_rewards.append(ep_reward)
# Running average of last 10 episodes
avg_reward = np.mean(recent_rewards[-10:]) if len(recent_rewards) >= 10 else np.mean(recent_rewards)
writer.add_scalar("Reward/Episode", ep_reward, total_timesteps)
writer.add_scalar("Reward/AvgLast10", avg_reward, total_timesteps)
print(f"Episode {episode}, steps {total_timesteps}, ep_reward={ep_reward:.1f}, avg_10={avg_reward:.1f}")
# Evaluation
if episode % eval_interval == 0:
eval_returns = []
for _ in range(5):
eval_obs, _ = eval_env.reset()
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward = 0
done = False
while not done:
with torch.no_grad():
eval_obs_t = torch.from_numpy(eval_obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
mu, std = actor(eval_obs_t)
action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy()
eval_obs, reward, terminated, truncated, _ = eval_env.step(action)
eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward += reward
done = terminated or truncated
eval_returns.append(eval_reward)
mean_eval = np.mean(eval_returns)
writer.add_scalar("Eval/MeanReturn", mean_eval, episode)
print(f" Eval: mean_return={mean_eval:.2f}")
# Save model
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save({
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
}, os.path.join("models", f"ppo_carracing_ep{episode}.pt"))
print(f" Saved model at episode {episode}")
# Save final model
os.makedirs("models", exist_ok=True)
torch.save({
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
}, os.path.join("models", "ppo_carracing_final.pt"))
writer.close()
print(f"Training complete! Total episodes: {episode}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--steps", type=int, default=500000, help="Total training steps")
parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size")
args = parser.parse_args()
device = get_device()
train(total_steps=args.steps, rollout_steps=args.rollout, device=device)