chore: 更新项目文档、依赖和训练脚本

- 更新 requirements.txt,添加 opencv-python-headless 并补充 uv 安装说明
- 修复 CSV 文件中的换行符格式(CRLF 转 LF)
- 更新 TASK_PROGRESS.md,记录并行训练实现和 WSL 支持
- 优化 train_improved.py 代码格式,移除多余空行和注释
- 更新课程作业要求文档的字符编码
- 添加新的 TensorBoard 日志文件和训练模型
This commit is contained in:
2026-05-01 09:26:23 +08:00
parent 6b929e9790
commit d6860f1f15
16 changed files with 25712 additions and 25680 deletions
+4 -4
View File
@@ -1,5 +1,5 @@
完成一份 强化学习个人课程作业报告:需要用 Python 从零实现一个 PPOProximal Policy Optimization)强化学习算法,让智能体在 CarRacing-v3 环境中完成赛车任务,并在此基础上提交一份不超过 3000 词 的技术报告,系统说明你的方法与结果;具体来说,要介绍该任务的强化学习背景,定义状态空间、动作空间和奖励机制,解释 PPO 的目标函数、裁剪机制和优势估计方法,说明策略网络与价值网络结构、训练流程、超参数设置以及实现过程中遇到的问题和解决办法,同时用图表展示训练与测试结果,分析模型表现和变化趋势,并与如 Stable-Baselines3 这类基线方法在稳定性和样本效率上做简要比较;另外,还要提交一个包含全部源代码和训练好模型的 zip 文件,以及一个单独的 PDF 报告,文件命名和提交格式都必须符合要求,而且实现中不能直接使用 Stable-Baselines 等强化学习专用库,但可以合理使用 TensorBoard 记录实验结果。 完成一份 强化学习个人课程作业报告:需要用 Python 从零实现一个 PPOProximal Policy Optimization)强化学习算法,让智能体在 CarRacing-v3 环境中完成赛车任务,并在此基础上提交一份不超过 3000 词 的技术报告,系统说明你的方法与结果;具体来说,要介绍该任务的强化学习背景,定义状态空间、动作空间和奖励机制,解释 PPO 的目标函数、裁剪机制和优势估计方法,说明策略网络与价值网络结构、训练流程、超参数设置以及实现过程中遇到的问题和解决办法,同时用图表展示训练与测试结果,分析模型表现和变化趋势,并与如 Stable-Baselines3 这类基线方法在稳定性和样本效率上做简要比较;另外,还要提交一个包含全部源代码和训练好模型的 zip 文件,以及一个单独的 PDF 报告,文件命名和提交格式都必须符合要求,而且实现中不能直接使用 Stable-Baselines 等强化学习专用库,但可以合理使用 TensorBoard 记录实验结果。
这个 PDF 要求完成一份 强化学习个人项目报告:需要自己选择一个 Atari 游戏,实现并训练一个你选定的 深度强化学习算法 来达到有竞争力的表现,然后提交一份不超过 3000 词 的技术报告和一个包含全部源代码及训练模型的 zip 文件;报告中需要说明选择的游戏及其挑战,调研并总结深度强化学习尤其是在 Atari 游戏中的应用现状,比较考虑过的算法并解释为什么最终选择当前方法,详细介绍算法原理与具体实现,评估智能体表现、说明所选基准和评价指标,并分析为什么该算法在这个游戏上表现好或不好,同时用清晰标注坐标轴和图例的图表来展示结果;另外,作业明确要求不能直接用 Stable-Baselines 等强化学习专用库来实现算法,但可以用它们做 benchmark,对代码质量、结果分析、报告结构、图表使用和引用规范都会评分,最终还要按指定格式命名并提交 PDF 和 zip 文件。 这个 PDF 要求完成一份 强化学习个人项目报告:需要自己选择一个 Atari 游戏,实现并训练一个你选定的 深度强化学习算法 来达到有竞争力的表现,然后提交一份不超过 3000 词 的技术报告和一个包含全部源代码及训练模型的 zip 文件;报告中需要说明选择的游戏及其挑战,调研并总结深度强化学习尤其是在 Atari 游戏中的应用现状,比较考虑过的算法并解释为什么最终选择当前方法,详细介绍算法原理与具体实现,评估智能体表现、说明所选基准和评价指标,并分析为什么该算法在这个游戏上表现好或不好,同时用清晰标注坐标轴和图例的图表来展示结果;另外,作业明确要求不能直接用 Stable-Baselines 等强化学习专用库来实现算法,但可以用它们做 benchmark,对代码质量、结果分析、报告结构、图表使用和引用规范都会评分,最终还要按指定格式命名并提交 PDF 和 zip 文件。
完成一份 机器学习个人课程作业:围绕一个健康保险数据集,建立并改进一个用于预测申请人保费风险等级(Low / Standard / High)的多分类模型。你需要先完成 Jupyter Notebook 部分,包括数据清理与预处理、识别并删除数据泄露特征、建立基线模型、对比随机森林和一种 boosting 模型、使用高级超参数优化方法调参、根据学号末位完成指定的个性化改进并至少再做一个可选改进、再进行一次 K-Means 与 GMM 的无监督探索,最后基于验证结果选出最终模型并导出规定格式的 hidden-test CSV;同时还要提交一份 不超过1200词 左右的 Theory and Reflection PDF,围绕 bagging vs boosting、超参数优化、K-Means vs GMM、个性化改进反思和 AI 使用声明进行理论与实验结合的总结,并且所有结论都要紧扣你自己 notebook 里的表格、图和指标证据,最终按要求提交 notebook、PDF、CSV 以及必要的补充代码。 完成一份 机器学习个人课程作业:围绕一个健康保险数据集,建立并改进一个用于预测申请人保费风险等级(Low / Standard / High)的多分类模型。你需要先完成 Jupyter Notebook 部分,包括数据清理与预处理、识别并删除数据泄露特征、建立基线模型、对比随机森林和一种 boosting 模型、使用高级超参数优化方法调参、根据学号末位完成指定的个性化改进并至少再做一个可选改进、再进行一次 K-Means 与 GMM 的无监督探索,最后基于验证结果选出最终模型并导出规定格式的 hidden-test CSV;同时还要提交一份 不超过1200词 左右的 Theory and Reflection PDF,围绕 bagging vs boosting、超参数优化、K-Means vs GMM、个性化改进反思和 AI 使用声明进行理论与实验结合的总结,并且所有结论都要紧扣你自己 notebook 里的表格、图和指标证据,最终按要求提交 notebook、PDF、CSV 以及必要的补充代码。
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -1,8 +1,8 @@
k,inertia,silhouette_x,log_likelihood,bic,aic,silhouette_y k,inertia,silhouette_x,log_likelihood,bic,aic,silhouette_y
2,1092962.434364126,0.174016661115075,181335.84491703784,-359250.54291550705,-362061.6898340757,0.41420390111182703 2,1092962.434364126,0.174016661115075,181335.84491703784,-359250.54291550705,-362061.6898340757,0.41420390111182703
3,1018586.5047121042,0.17317021187208304,554291.2303605897,-1103445.131905755,-1107666.4607211794,0.2977020104302583 3,1018586.5047121042,0.17317021187208304,554291.2303605897,-1103445.131905755,-1107666.4607211794,0.2977020104302583
4,953249.4382030136,0.18080059886795355,972834.1094461675,-1938814.7081800548,-1944446.218892335,0.3964327255424141 4,953249.4382030136,0.18080059886795355,972834.1094461675,-1938814.7081800548,-1944446.218892335,0.3964327255424141
5,889284.892342685,0.1964251564081267,1002913.0930748597,-1997256.4935405836,-2004298.1861497194,0.40146893512413845 5,889284.892342685,0.1964251564081267,1002913.0930748597,-1997256.4935405836,-2004298.1861497194,0.40146893512413845
6,818950.9117652641,0.17683056672008368,1180025.734163945,-2349765.5938218986,-2358217.46832789,0.24683353848428613 6,818950.9117652641,0.17683056672008368,1180025.734163945,-2349765.5938218986,-2358217.46832789,0.24683353848428613
7,777658.2185885893,0.197056012688701,1203191.531501821,-2394381.006600795,-2404243.063003642,0.3109553553475885 7,777658.2185885893,0.197056012688701,1203191.531501821,-2394381.006600795,-2404243.063003642,0.3109553553475885
8,691940.8330833976,0.20149802939267383,1261969.3739466753,-2510220.5095936474,-2521492.7478933507,0.17264064800570944 8,691940.8330833976,0.20149802939267383,1261969.3739466753,-2510220.5095936474,-2521492.7478933507,0.17264064800570944
1 k inertia silhouette_x log_likelihood bic aic silhouette_y
2 2 1092962.434364126 0.174016661115075 181335.84491703784 -359250.54291550705 -362061.6898340757 0.41420390111182703
3 3 1018586.5047121042 0.17317021187208304 554291.2303605897 -1103445.131905755 -1107666.4607211794 0.2977020104302583
4 4 953249.4382030136 0.18080059886795355 972834.1094461675 -1938814.7081800548 -1944446.218892335 0.3964327255424141
5 5 889284.892342685 0.1964251564081267 1002913.0930748597 -1997256.4935405836 -2004298.1861497194 0.40146893512413845
6 6 818950.9117652641 0.17683056672008368 1180025.734163945 -2349765.5938218986 -2358217.46832789 0.24683353848428613
7 7 777658.2185885893 0.197056012688701 1203191.531501821 -2394381.006600795 -2404243.063003642 0.3109553553475885
8 8 691940.8330833976 0.20149802939267383 1261969.3739466753 -2510220.5095936474 -2521492.7478933507 0.17264064800570944
@@ -1,2 +1,2 @@
model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard
Baseline_LR,0.7595294117647059,0.7337904761904762,0.7493991157707756,0.7234383324236036,0.7663239074550129,0.6487372909150542,0.7552537989007436 Baseline_LR,0.7595294117647059,0.7337904761904762,0.7493991157707756,0.7234383324236036,0.7663239074550129,0.6487372909150542,0.7552537989007436
1 model train_accuracy val_accuracy train_f1_macro val_f1_macro val_f1_High val_f1_Low val_f1_Standard
2 Baseline_LR 0.7595294117647059 0.7337904761904762 0.7493991157707756 0.7234383324236036 0.7663239074550129 0.6487372909150542 0.7552537989007436
@@ -1,7 +1,7 @@
model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard,train_time model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard,train_time
Baseline_LR,0.7593680672268908,0.7341714285714286,0.7492574544185482,0.7237629331592531,0.7665209565440987,0.6489501312335958,0.7558177117000646, Baseline_LR,0.7593680672268908,0.7341714285714286,0.7492574544185482,0.7237629331592531,0.7665209565440987,0.6489501312335958,0.7558177117000646,
RandomForest,1.0,0.7877333333333333,1.0,0.770789728543472,0.7874554916461244,0.7095334685598377,0.8153802254244543,57.91048526763916 RandomForest,1.0,0.7877333333333333,1.0,0.770789728543472,0.7874554916461244,0.7095334685598377,0.8153802254244543,57.91048526763916
XGBoost,0.8519529411764706,0.8371047619047619,0.8297116592669606,0.8143842728003406,0.8904623073719283,0.6944039941751612,0.8582865168539325,67.63970804214478 XGBoost,0.8519529411764706,0.8371047619047619,0.8297116592669606,0.8143842728003406,0.8904623073719283,0.6944039941751612,0.8582865168539325,67.63970804214478
XGBoost_Tuned,0.9767663865546219,0.8700190476190476,0.9739400525375727,0.8519502714571496,0.9084439578486383,0.7620280474649407,0.8853788090578697,142.65462470054626 XGBoost_Tuned,0.9767663865546219,0.8700190476190476,0.9739400525375727,0.8519502714571496,0.9084439578486383,0.7620280474649407,0.8853788090578697,142.65462470054626
XGB_CatA_MissingHandling,0.9772638655462185,0.870552380952381,0.9745439553742655,0.8529411889528661,0.910207423580786,0.763542562338779,0.885073580939033, XGB_CatA_MissingHandling,0.9772638655462185,0.870552380952381,0.9745439553742655,0.8529411889528661,0.910207423580786,0.763542562338779,0.885073580939033,
Ensemble_SoftVoting,0.9972436974789916,0.8675047619047619,0.9969472283391928,0.851001101708816,0.9024125779343996,0.7684120902511707,0.8821786369408776, Ensemble_SoftVoting,0.9972436974789916,0.8675047619047619,0.9969472283391928,0.851001101708816,0.9024125779343996,0.7684120902511707,0.8821786369408776,
1 model train_accuracy val_accuracy train_f1_macro val_f1_macro val_f1_High val_f1_Low val_f1_Standard train_time
2 Baseline_LR 0.7593680672268908 0.7341714285714286 0.7492574544185482 0.7237629331592531 0.7665209565440987 0.6489501312335958 0.7558177117000646
3 RandomForest 1.0 0.7877333333333333 1.0 0.770789728543472 0.7874554916461244 0.7095334685598377 0.8153802254244543 57.91048526763916
4 XGBoost 0.8519529411764706 0.8371047619047619 0.8297116592669606 0.8143842728003406 0.8904623073719283 0.6944039941751612 0.8582865168539325 67.63970804214478
5 XGBoost_Tuned 0.9767663865546219 0.8700190476190476 0.9739400525375727 0.8519502714571496 0.9084439578486383 0.7620280474649407 0.8853788090578697 142.65462470054626
6 XGB_CatA_MissingHandling 0.9772638655462185 0.870552380952381 0.9745439553742655 0.8529411889528661 0.910207423580786 0.763542562338779 0.885073580939033
7 Ensemble_SoftVoting 0.9972436974789916 0.8675047619047619 0.9969472283391928 0.851001101708816 0.9024125779343996 0.7684120902511707 0.8821786369408776
+38 -17
View File
@@ -26,6 +26,9 @@
| ✅ 环境预处理 | 灰度化 + Resize(84×84) + 帧堆叠(4帧) Wrapper | [src/utils.py](src/utils.py) | | ✅ 环境预处理 | 灰度化 + Resize(84×84) + 帧堆叠(4帧) Wrapper | [src/utils.py](src/utils.py) |
| ✅ 评估脚本 | 渲染测试 + 多回合平均分数评估 | [src/evaluate.py](src/evaluate.py) | | ✅ 评估脚本 | 渲染测试 + 多回合平均分数评估 | [src/evaluate.py](src/evaluate.py) |
| ✅ 训练入口 | 主训练循环、TensorBoard 记录、模型保存 | [train.py](train.py) | | ✅ 训练入口 | 主训练循环、TensorBoard 记录、模型保存 | [train.py](train.py) |
| ✅ 并行训练 | 多环境并行采集 + WSL 支持 | [train_parallel.py](train_parallel.py) |
| ✅ WSL 脚本 | 环境配置 + 启动脚本 | [setup_wsl.sh](setup_wsl.sh)、[run_wsl.sh](run_wsl.sh)、[start_wsl_training.bat](start_wsl_training.bat) |
| ✅ 测试脚本 | 快速验证并行环境和网络 | [test_parallel.py](test_parallel.py) |
**核心算法实现要点** **核心算法实现要点**
- 策略网络:3 层 CNN + FC(512) → μ, σ(高斯策略,tanh 激活) - 策略网络:3 层 CNN + FC(512) → μ, σ(高斯策略,tanh 激活)
@@ -60,36 +63,54 @@
│ ├── trainer.py # PPO 更新逻辑 │ ├── trainer.py # PPO 更新逻辑
│ ├── utils.py # 环境预处理 wrappers │ ├── utils.py # 环境预处理 wrappers
│ └── evaluate.py # 评估脚本 │ └── evaluate.py # 评估脚本
├── train.py # 训练入口 ├── train.py # 单线程训练入口
├── train_parallel.py # 多环境并行训练(推荐)
├── setup_wsl.sh # WSL 环境配置
├── run_wsl.sh # WSL 训练启动脚本
├── start_wsl_training.bat # Windows 一键启动 WSL 训练
├── test_parallel.py # 并行训练测试
├── requirements.txt ├── requirements.txt
├── README.md ├── README.md
── TASK_PROGRESS.md # 本文档 ── WSL_README.md # WSL 训练指南
└── TASK_PROGRESS.md # 本文档
``` ```
--- ---
## 四、超参数配置 ## 四、超参数配置
| 参数 | | | 参数 | train.py (单线程) | train_parallel.py (并行) |
|------|-----| |------|-------------------|--------------------------|
| Learning rate | 3e-4 | | Learning rate | 3e-4 | 3e-4 |
| Gamma | 0.99 | | Gamma | 0.99 | 0.99 |
| GAE lambda | 0.95 | | GAE lambda | 0.95 | 0.98 |
| Clip epsilon | 0.2 | | Clip epsilon | 0.2 | 0.1 |
| PPO epochs | 4 | | PPO epochs | 4 | 10 |
| Mini-batch size | 64 | | Mini-batch size | 64 | 128 |
| Rollout steps | 2048 | | Rollout steps | 2048 | 2048 |
| Entropy coefficient | 0.01 | | Entropy coefficient | 0.01 | 0.005 |
| Value coefficient | 0.5 | | Value coefficient | 0.5 | 0.75 |
| Max gradient norm | 0.5 | | Max gradient norm | 0.5 | 0.5 |
| State shape | (84, 84, 4) | | 总步数 | 500,000 | 2,000,000 |
| Action dim | 3(连续:steer, gas, brake | | 环境数 | 1 | 4 |
| 预计时长 | ~8h | ~5h (4x) |
--- ---
## 五、下一步行动 ## 五、下一步行动
### 立即执行 ### 方案 AWSL 并行训练(推荐)
```bash
# Windows 下双击 start_wsl_training.bat
# 或手动:
wsl
cd "/mnt/d/Code/doing_exercises/programs/外教作业外快/强化学习个人项目报告"
chmod +x setup_wsl.sh run_wsl.sh
./setup_wsl.sh # 首次运行
./run_wsl.sh # 开始训练
```
### 方案 BWindows 单线程训练
```bash ```bash
# 1. 安装依赖 # 1. 安装依赖
uv pip install --system -r requirements.txt uv pip install --system -r requirements.txt
@@ -2,4 +2,9 @@ torch
gymnasium[box2d] gymnasium[box2d]
numpy numpy
matplotlib matplotlib
tensorboard tensorboard
opencv-python-headless
# uv 安装方式(可选):
# curl -LsSf https://astral.sh/uv/install.sh | sh
# uv pip install -r requirements.txt
+136 -130
View File
@@ -1,4 +1,5 @@
"""Improved training script with reward shaping and better hyperparameters.""" """Improved training script for CarRacing-v3 PPO with reward shaping."""
import os import os
import time import time
import argparse import argparse
@@ -12,36 +13,34 @@ import cv2
class RewardShapingWrapper(gym.Wrapper): class RewardShapingWrapper(gym.Wrapper):
"""Add reward shaping for better learning."""
def __init__(self, env): def __init__(self, env):
super().__init__(env) super().__init__(env)
self.steps_on_track = 0 self.steps_on_track = 0
def reset(self, **kwargs): def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs) obs, info = self.env.reset(**kwargs)
self.steps_on_track = 0 self.steps_on_track = 0
return obs, info return obs, info
def step(self, action): def step(self, action):
obs, reward, terminated, truncated, info = self.env.step(action) obs, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated done = terminated or truncated
shaped_reward = reward shaped_reward = reward
if info.get('speed', 0) > 0.1: if info.get("speed", 0) > 0.1:
shaped_reward += info['speed'] * 0.1 shaped_reward += info["speed"] * 0.1
if not info.get('offtrack', False): if not info.get("offtrack", False):
shaped_reward += 0.1 shaped_reward += 0.1
self.steps_on_track += 1 self.steps_on_track += 1
else: else:
shaped_reward -= 0.5 shaped_reward -= 0.5
self.steps_on_track = 0 self.steps_on_track = 0
if info.get('lap_complete', False): if info.get("lap_complete", False):
shaped_reward += 100 shaped_reward += 100
return obs, shaped_reward, terminated, truncated, info return obs, shaped_reward, terminated, truncated, info
@@ -70,9 +69,7 @@ class FrameStackWrapper(gym.ObservationWrapper):
self.frames = deque(maxlen=num_stack) self.frames = deque(maxlen=num_stack)
obs_shape = env.observation_space.shape obs_shape = env.observation_space.shape
self.observation_space = gym.spaces.Box( self.observation_space = gym.spaces.Box(
low=0, high=255, low=0, high=255, shape=(num_stack, *obs_shape[-2:]), dtype=np.uint8
shape=(num_stack, *obs_shape[-2:]),
dtype=np.uint8
) )
def reset(self, **kwargs): def reset(self, **kwargs):
@@ -115,7 +112,7 @@ class Actor(nn.Module):
def __init__(self, state_shape=(84, 84, 4), action_dim=3): def __init__(self, state_shape=(84, 84, 4), action_dim=3):
super().__init__() super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1] c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential( self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4), nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2), nn.LeakyReLU(0.2),
@@ -126,28 +123,28 @@ class Actor(nn.Module):
nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2), nn.LeakyReLU(0.2),
) )
out_h = (h - 8) // 4 + 1 out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1 out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1 out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h feat_size = 64 * out_h * out_h
self.fc = nn.Sequential( self.fc = nn.Sequential(
nn.Linear(feat_size, 512), nn.Linear(feat_size, 512),
nn.LeakyReLU(0.2), nn.LeakyReLU(0.2),
) )
self.mu_head = nn.Linear(512, action_dim) self.mu_head = nn.Linear(512, action_dim)
self.log_std_head = nn.Linear(512, action_dim) self.log_std_head = nn.Linear(512, action_dim)
for m in self.modules(): for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)): if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2)) nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None: if m.bias is not None:
nn.init.constant_(m.bias, 0) nn.init.constant_(m.bias, 0)
nn.init.orthogonal_(self.mu_head.weight, gain=0.01) nn.init.orthogonal_(self.mu_head.weight, gain=0.01)
nn.init.orthogonal_(self.log_std_head.weight, gain=0.01) nn.init.orthogonal_(self.log_std_head.weight, gain=0.01)
def forward(self, x): def forward(self, x):
x = x / 255.0 x = x / 255.0
x = self.conv(x) x = self.conv(x)
@@ -162,7 +159,7 @@ class Critic(nn.Module):
def __init__(self, state_shape=(84, 84, 4)): def __init__(self, state_shape=(84, 84, 4)):
super().__init__() super().__init__()
c, h, w = state_shape[2], state_shape[0], state_shape[1] c, h, w = state_shape[2], state_shape[0], state_shape[1]
self.conv = nn.Sequential( self.conv = nn.Sequential(
nn.Conv2d(c, 32, kernel_size=8, stride=4), nn.Conv2d(c, 32, kernel_size=8, stride=4),
nn.LeakyReLU(0.2), nn.LeakyReLU(0.2),
@@ -173,24 +170,20 @@ class Critic(nn.Module):
nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.LeakyReLU(0.2), nn.LeakyReLU(0.2),
) )
out_h = (h - 8) // 4 + 1 out_h = (h - 8) // 4 + 1
out_h = (out_h - 4) // 2 + 1 out_h = (out_h - 4) // 2 + 1
out_h = (out_h - 3) // 1 + 1 out_h = (out_h - 3) // 1 + 1
feat_size = 64 * out_h * out_h feat_size = 64 * out_h * out_h
self.fc = nn.Sequential( self.fc = nn.Sequential(nn.Linear(feat_size, 512), nn.LeakyReLU(0.2), nn.Linear(512, 1))
nn.Linear(feat_size, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 1)
)
for m in self.modules(): for m in self.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)): if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2)) nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
if m.bias is not None: if m.bias is not None:
nn.init.constant_(m.bias, 0) nn.init.constant_(m.bias, 0)
def forward(self, x): def forward(self, x):
x = x / 255.0 x = x / 255.0
x = self.conv(x) x = self.conv(x)
@@ -203,14 +196,14 @@ class RolloutBuffer:
self.buffer_size = buffer_size self.buffer_size = buffer_size
self.ptr = 0 self.ptr = 0
self.size = 0 self.size = 0
self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8) self.states = np.zeros((buffer_size, *state_shape), dtype=np.uint8)
self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32) self.actions = np.zeros((buffer_size, action_dim), dtype=np.float32)
self.rewards = np.zeros(buffer_size, dtype=np.float32) self.rewards = np.zeros(buffer_size, dtype=np.float32)
self.dones = np.zeros(buffer_size, dtype=np.bool_) self.dones = np.zeros(buffer_size, dtype=np.bool_)
self.values = np.zeros(buffer_size, dtype=np.float32) self.values = np.zeros(buffer_size, dtype=np.float32)
self.log_probs = np.zeros(buffer_size, dtype=np.float32) self.log_probs = np.zeros(buffer_size, dtype=np.float32)
def add(self, state, action, reward, done, value, log_prob): def add(self, state, action, reward, done, value, log_prob):
self.states[self.ptr] = state self.states[self.ptr] = state
self.actions[self.ptr] = action self.actions[self.ptr] = action
@@ -220,34 +213,34 @@ class RolloutBuffer:
self.log_probs[self.ptr] = log_prob self.log_probs[self.ptr] = log_prob
self.ptr = (self.ptr + 1) % self.buffer_size self.ptr = (self.ptr + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size) self.size = min(self.size + 1, self.buffer_size)
def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.98): def compute_returns(self, last_value, gamma=0.99, gae_lambda=0.98):
advantages = np.zeros(self.size, dtype=np.float32) advantages = np.zeros(self.size, dtype=np.float32)
last_gae = 0 last_gae = 0
for t in reversed(range(self.size)): for t in reversed(range(self.size)):
if t == self.size - 1: if t == self.size - 1:
next_value = last_value next_value = last_value
else: else:
next_value = self.values[t + 1] next_value = self.values[t + 1]
delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t] delta = self.rewards[t] + gamma * next_value * (1 - self.dones[t]) - self.values[t]
last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae last_gae = delta + gamma * gae_lambda * (1 - self.dones[t]) * last_gae
advantages[t] = last_gae advantages[t] = last_gae
returns = advantages + self.values[:self.size] returns = advantages + self.values[: self.size]
return returns, advantages return returns, advantages
def get(self): def get(self):
return ( return (
self.states[:self.size], self.states[: self.size],
self.actions[:self.size], self.actions[: self.size],
self.rewards[:self.size], self.rewards[: self.size],
self.dones[:self.size], self.dones[: self.size],
self.values[:self.size], self.values[: self.size],
self.log_probs[:self.size], self.log_probs[: self.size],
) )
def reset(self): def reset(self):
self.ptr = 0 self.ptr = 0
self.size = 0 self.size = 0
@@ -282,55 +275,53 @@ class PPOTrainer:
self.max_grad_norm = max_grad_norm self.max_grad_norm = max_grad_norm
self.ppo_epochs = ppo_epochs self.ppo_epochs = ppo_epochs
self.mini_batch_size = mini_batch_size self.mini_batch_size = mini_batch_size
self.actor_optim = torch.optim.Adam(actor.parameters(), lr=lr, eps=1e-5) self.actor_optim = torch.optim.Adam(actor.parameters(), lr=lr, eps=1e-5)
self.critic_optim = torch.optim.Adam(critic.parameters(), lr=lr, eps=1e-5) self.critic_optim = torch.optim.Adam(critic.parameters(), lr=lr, eps=1e-5)
self.total_updates = 0
def update(self, last_value): def update(self, last_value):
states, actions, rewards, dones, values, log_probs_old = self.buffer.get() states, actions, rewards, dones, values, log_probs_old = self.buffer.get()
returns, advantages = self.buffer.compute_returns(last_value, self.gamma, self.gae_lambda) returns, advantages = self.buffer.compute_returns(last_value, self.gamma, self.gae_lambda)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
states_t = torch.from_numpy(states).float().permute(0, 3, 1, 2).to(self.device) states_t = torch.from_numpy(states).float().permute(0, 3, 1, 2).to(self.device)
actions_t = torch.from_numpy(actions).float().to(self.device) actions_t = torch.from_numpy(actions).float().to(self.device)
log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device) log_probs_old_t = torch.from_numpy(log_probs_old).float().to(self.device)
returns_t = torch.from_numpy(returns).float().to(self.device) returns_t = torch.from_numpy(returns).float().to(self.device)
advantages_t = torch.from_numpy(advantages).float().to(self.device) advantages_t = torch.from_numpy(advantages).float().to(self.device)
dataset = torch.utils.data.TensorDataset( dataset = torch.utils.data.TensorDataset(
states_t, actions_t, log_probs_old_t, returns_t, advantages_t states_t, actions_t, log_probs_old_t, returns_t, advantages_t
) )
loader = torch.utils.data.DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True) loader = torch.utils.data.DataLoader(dataset, batch_size=self.mini_batch_size, shuffle=True)
total_actor_loss = 0 total_actor_loss = 0
total_critic_loss = 0 total_critic_loss = 0
total_entropy = 0 total_entropy = 0
count = 0 count = 0
for _ in range(self.ppo_epochs): for _ in range(self.ppo_epochs):
for batch in loader: for batch in loader:
s, a, log_pi_old, ret, adv = batch s, a, log_pi_old, ret, adv = batch
mu, std = self.actor(s) mu, std = self.actor(s)
dist = torch.distributions.Normal(mu, std) dist = torch.distributions.Normal(mu, std)
log_pi = dist.log_prob(a).sum(dim=-1) log_pi = dist.log_prob(a).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1) entropy = dist.entropy().sum(dim=-1)
ratio = torch.exp(log_pi - log_pi_old) ratio = torch.exp(log_pi - log_pi_old)
surr1 = ratio * adv surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * adv
actor_loss = -torch.min(surr1, surr2).mean() actor_loss = -torch.min(surr1, surr2).mean()
value = self.critic(s) value = self.critic(s)
critic_loss = nn.MSELoss()(value.squeeze(), ret) critic_loss = nn.MSELoss()(value.squeeze(), ret)
loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean() loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy.mean()
self.actor_optim.zero_grad() self.actor_optim.zero_grad()
self.critic_optim.zero_grad() self.critic_optim.zero_grad()
loss.backward() loss.backward()
@@ -338,18 +329,16 @@ class PPOTrainer:
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm) nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.actor_optim.step() self.actor_optim.step()
self.critic_optim.step() self.critic_optim.step()
total_actor_loss += actor_loss.item() total_actor_loss += actor_loss.item()
total_critic_loss += critic_loss.item() total_critic_loss += critic_loss.item()
total_entropy += entropy.mean().item() total_entropy += entropy.mean().item()
count += 1 count += 1
self.total_updates += 1
avg_actor = total_actor_loss / count avg_actor = total_actor_loss / count
avg_critic = total_critic_loss / count avg_critic = total_critic_loss / count
avg_entropy = total_entropy / count avg_entropy = total_entropy / count
self.buffer.reset() self.buffer.reset()
return avg_actor, avg_critic, avg_entropy return avg_actor, avg_critic, avg_entropy
@@ -357,10 +346,10 @@ class PPOTrainer:
def collect_rollout(actor, critic, env, buffer, device, rollout_steps): def collect_rollout(actor, critic, env, buffer, device, rollout_steps):
obs, _ = env.reset() obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0)) obs = np.transpose(obs, (1, 2, 0))
for step in range(rollout_steps): for step in range(rollout_steps):
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device) obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
with torch.no_grad(): with torch.no_grad():
mu, std = actor(obs_t) mu, std = actor(obs_t)
dist = torch.distributions.Normal(mu, std) dist = torch.distributions.Normal(mu, std)
@@ -368,27 +357,27 @@ def collect_rollout(actor, critic, env, buffer, device, rollout_steps):
action = torch.clamp(action, -1, 1) action = torch.clamp(action, -1, 1)
log_prob = dist.log_prob(action).sum(dim=-1) log_prob = dist.log_prob(action).sum(dim=-1)
value = critic(obs_t).squeeze(0).item() value = critic(obs_t).squeeze(0).item()
action_np = action.squeeze(0).cpu().numpy() action_np = action.squeeze(0).cpu().numpy()
log_prob_np = log_prob.squeeze(0).cpu().numpy() log_prob_np = log_prob.squeeze(0).cpu().numpy()
next_obs, reward, terminated, truncated, _ = env.step(action_np) next_obs, reward, terminated, truncated, _ = env.step(action_np)
done = terminated or truncated done = terminated or truncated
next_obs_stored = np.transpose(next_obs, (1, 2, 0)) next_obs_stored = np.transpose(next_obs, (1, 2, 0))
buffer.add(obs.copy(), action_np, reward, done, value, log_prob_np) buffer.add(obs.copy(), action_np, reward, done, value, log_prob_np)
obs = next_obs_stored obs = next_obs_stored
if done: if done:
obs, _ = env.reset() obs, _ = env.reset()
obs = np.transpose(obs, (1, 2, 0)) obs = np.transpose(obs, (1, 2, 0))
return obs return obs
def train_improved( def train(
total_steps=2000000, total_steps=2000000,
rollout_steps=2048, rollout_steps=2048,
eval_interval=10, eval_interval=10,
@@ -397,22 +386,22 @@ def train_improved(
): ):
if device is None: if device is None:
device = get_device() device = get_device()
env = make_env() env = make_env()
eval_env = make_env() eval_env = make_env()
state_shape = (84, 84, 4) state_shape = (84, 84, 4)
action_dim = 3 action_dim = 3
actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device) actor = Actor(state_shape=state_shape, action_dim=action_dim).to(device)
critic = Critic(state_shape=state_shape).to(device) critic = Critic(state_shape=state_shape).to(device)
buffer = RolloutBuffer( buffer = RolloutBuffer(
buffer_size=rollout_steps, buffer_size=rollout_steps,
state_shape=state_shape, state_shape=state_shape,
action_dim=action_dim, action_dim=action_dim,
) )
trainer = PPOTrainer( trainer = PPOTrainer(
actor=actor, actor=actor,
critic=critic, critic=critic,
@@ -428,46 +417,48 @@ def train_improved(
ppo_epochs=10, ppo_epochs=10,
mini_batch_size=128, mini_batch_size=128,
) )
log_dir = os.path.join("logs", "tensorboard", f"run_improved_{int(time.time())}") log_dir = os.path.join("logs", "tensorboard", f"run_improved_{int(time.time())}")
writer = SummaryWriter(log_dir) writer = SummaryWriter(log_dir)
print(f"Training on {device}") print(f"Training on {device}")
print(f"Log directory: {log_dir}") print(f"Log directory: {log_dir}")
print("Improvements: LeakyReLU, BatchNorm, He init, Reward shaping, LR decay, More epochs") print("Improvements: LeakyReLU, BatchNorm, He init, Reward shaping, More epochs")
episode = 0 episode = 0
total_timesteps = 0 total_timesteps = 0
episode_rewards = [] episode_rewards = []
best_eval = -float('inf') best_eval = -float("inf")
while total_timesteps < total_steps: while total_timesteps < total_steps:
obs = collect_rollout(actor, critic, env, buffer, device, rollout_steps) obs = collect_rollout(actor, critic, env, buffer, device, rollout_steps)
with torch.no_grad(): with torch.no_grad():
obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device) obs_t = torch.from_numpy(obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device)
last_value = critic(obs_t).squeeze(0).item() last_value = critic(obs_t).squeeze(0).item()
actor_loss, critic_loss, entropy = trainer.update(last_value) actor_loss, critic_loss, entropy = trainer.update(last_value)
writer.add_scalar("Loss/Actor", actor_loss, total_timesteps) writer.add_scalar("Loss/Actor", actor_loss, total_timesteps)
writer.add_scalar("Loss/Critic", critic_loss, total_timesteps) writer.add_scalar("Loss/Critic", critic_loss, total_timesteps)
writer.add_scalar("Loss/Entropy", entropy, total_timesteps) writer.add_scalar("Loss/Entropy", entropy, total_timesteps)
total_timesteps += rollout_steps total_timesteps += rollout_steps
episode += 1 episode += 1
ep_reward = buffer.rewards[:buffer.size].sum() ep_reward = buffer.rewards[: buffer.size].sum()
episode_rewards.append(ep_reward) episode_rewards.append(ep_reward)
recent_rewards = episode_rewards[-10:] if len(episode_rewards) >= 10 else episode_rewards recent_rewards = episode_rewards[-10:] if len(episode_rewards) >= 10 else episode_rewards
avg_reward = np.mean(recent_rewards) avg_reward = np.mean(recent_rewards)
writer.add_scalar("Reward/Episode", ep_reward, total_timesteps) writer.add_scalar("Reward/Episode", ep_reward, total_timesteps)
writer.add_scalar("Reward/AvgLast10", avg_reward, total_timesteps) writer.add_scalar("Reward/AvgLast10", avg_reward, total_timesteps)
print(f"Episode {episode}, steps {total_timesteps}, ep_reward={ep_reward:.1f}, avg_10={avg_reward:.1f}") print(
f"Episode {episode}, steps {total_timesteps}, ep_reward={ep_reward:.1f}, avg_10={avg_reward:.1f}"
)
if episode % eval_interval == 0: if episode % eval_interval == 0:
eval_returns = [] eval_returns = []
for _ in range(5): for _ in range(5):
@@ -475,54 +466,69 @@ def train_improved(
eval_obs = np.transpose(eval_obs, (1, 2, 0)) eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward = 0 eval_reward = 0
done = False done = False
while not done: while not done:
with torch.no_grad(): with torch.no_grad():
eval_obs_t = torch.from_numpy(eval_obs).float().unsqueeze(0).permute(0, 3, 1, 2).to(device) eval_obs_t = (
torch.from_numpy(eval_obs)
.float()
.unsqueeze(0)
.permute(0, 3, 1, 2)
.to(device)
)
mu, std = actor(eval_obs_t) mu, std = actor(eval_obs_t)
action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy() action = torch.clamp(mu, -1, 1).squeeze(0).cpu().numpy()
eval_obs, reward, terminated, truncated, _ = eval_env.step(action) eval_obs, reward, terminated, truncated, _ = eval_env.step(action)
eval_obs = np.transpose(eval_obs, (1, 2, 0)) eval_obs = np.transpose(eval_obs, (1, 2, 0))
eval_reward += reward eval_reward += reward
done = terminated or truncated done = terminated or truncated
eval_returns.append(eval_reward) eval_returns.append(eval_reward)
mean_eval = np.mean(eval_returns) mean_eval = np.mean(eval_returns)
writer.add_scalar("Eval/MeanReturn", mean_eval, episode) writer.add_scalar("Eval/MeanReturn", mean_eval, episode)
print(f" Eval: mean_return={mean_eval:.2f}") print(f" Eval: mean_return={mean_eval:.2f}")
if mean_eval > best_eval: if mean_eval > best_eval:
best_eval = mean_eval best_eval = mean_eval
os.makedirs("models", exist_ok=True) os.makedirs("models", exist_ok=True)
torch.save({ torch.save(
{
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
"best_eval": best_eval,
},
os.path.join("models", "ppo_improved_best.pt"),
)
print(f" New best model saved! eval={best_eval:.2f}")
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save(
{
"actor": actor.state_dict(), "actor": actor.state_dict(),
"critic": critic.state_dict(), "critic": critic.state_dict(),
"episode": episode, "episode": episode,
"timesteps": total_timesteps, "timesteps": total_timesteps,
"best_eval": best_eval, },
}, os.path.join("models", "ppo_improved_best.pt")) os.path.join("models", f"ppo_improved_ep{episode}.pt"),
print(f" New best model saved! eval={best_eval:.2f}") )
if episode % save_interval == 0:
os.makedirs("models", exist_ok=True)
torch.save({
"actor": actor.state_dict(),
"critic": critic.state_dict(),
"episode": episode,
"timesteps": total_timesteps,
}, os.path.join("models", f"ppo_improved_ep{episode}.pt"))
print(f" Saved model at episode {episode}") print(f" Saved model at episode {episode}")
os.makedirs("models", exist_ok=True) os.makedirs("models", exist_ok=True)
torch.save({ torch.save(
"actor": actor.state_dict(), {
"critic": critic.state_dict(), "actor": actor.state_dict(),
"episode": episode, "critic": critic.state_dict(),
"timesteps": total_timesteps, "episode": episode,
"best_eval": best_eval, "timesteps": total_timesteps,
}, os.path.join("models", "ppo_improved_final.pt")) "best_eval": best_eval,
},
os.path.join("models", "ppo_improved_final.pt"),
)
writer.close() writer.close()
env.close() env.close()
eval_env.close() eval_env.close()
@@ -534,6 +540,6 @@ if __name__ == "__main__":
parser.add_argument("--steps", type=int, default=2000000, help="Total training steps") parser.add_argument("--steps", type=int, default=2000000, help="Total training steps")
parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size") parser.add_argument("--rollout", type=int, default=2048, help="Rollout buffer size")
args = parser.parse_args() args = parser.parse_args()
device = get_device() device = get_device()
train_improved(total_steps=args.steps, rollout_steps=args.rollout, device=device) train(total_steps=args.steps, rollout_steps=args.rollout, device=device)
+250 -250
View File
@@ -1,251 +1,251 @@
XJTLU Entrepreneur College (Taicang) Cover Sheet XJTLU Entrepreneur College (Taicang) Cover Sheet
Module code and Title DTS307TC Reinforcement Learning Module code and Title DTS307TC Reinforcement Learning
School Title School of AI and Advanced Computing School Title School of AI and Advanced Computing
Assignment Title Coursework 1 Assignment Title Coursework 1
Submission Deadline 04/May/2026 23:59 Submission Deadline 04/May/2026 23:59
Final Word Count Final Word Count
If you agree to let the university use your work anonymously for teaching If you agree to let the university use your work anonymously for teaching
and learning purposes, please type “yes” here. and learning purposes, please type “yes” here.
I certify that I have read and understood the Universitys Policy for dealing with Plagiarism, I certify that I have read and understood the Universitys Policy for dealing with Plagiarism,
Collusion and the Fabrication of Data (available on Learning Mall Online). With reference to this Collusion and the Fabrication of Data (available on Learning Mall Online). With reference to this
policy I certify that: policy I certify that:
• My work does not contain any instances of plagiarism and/or collusion. • My work does not contain any instances of plagiarism and/or collusion.
My work does not contain any fabricated data. My work does not contain any fabricated data.
By uploading my assignment onto Learning Mall Online, I formally declare By uploading my assignment onto Learning Mall Online, I formally declare
that all of the above information is true to the best of my knowledge and that all of the above information is true to the best of my knowledge and
belief. belief.
Scoring For Tutor Use Scoring For Tutor Use
Student ID Student ID
Stage of Marker Learning Outcomes Achieved F/P/M/D Final Stage of Marker Learning Outcomes Achieved F/P/M/D Final
Marking Code (please modify as appropriate) Score Marking Code (please modify as appropriate) Score
A B C A B C
1st Marker red 1st Marker red
pen pen
Moderation The original mark has been accepted by the moderator Y/N Moderation The original mark has been accepted by the moderator Y/N
IM (please circle as appropriate): IM (please circle as appropriate):
green pen Initials green pen Initials
Data entry and score calculation have been checked by Y Data entry and score calculation have been checked by Y
another tutor (please circle): another tutor (please circle):
2nd Marker if 2nd Marker if
needed green needed green
pen pen
For Academic Office Use Possible Academic Infringement (please tick as appropriate) For Academic Office Use Possible Academic Infringement (please tick as appropriate)
Date Days Late ☐ Category A Date Days Late ☐ Category A
Received late Penalty Total Academic Infringement Penalty Received late Penalty Total Academic Infringement Penalty
☐ Category B (A,B, C, D, E, Please modify where ☐ Category B (A,B, C, D, E, Please modify where
necessary) _____________________ necessary) _____________________
☐ Category C ☐ Category C
☐ Category D ☐ Category D
☐ Category E ☐ Category E
School of Artificial Intelligence and Advanced Computing School of Artificial Intelligence and Advanced Computing
Xian Jiaotong-Liverpool University Xian Jiaotong-Liverpool University
DTS307TC Reinforcement Learning DTS307TC Reinforcement Learning
Coursework - Individual Report Coursework - Individual Report
Due: 04/May/2026 23:59 Due: 04/May/2026 23:59
Weight: 40% Weight: 40%
Maximum score: 40 marks Maximum score: 40 marks
Overview Overview
The purpose of this assignment is to gain experience in Python programming and the design of The purpose of this assignment is to gain experience in Python programming and the design of
reinforcement leaning algorithms. You are expected to implement an RL algorithm that solves a reinforcement leaning algorithms. You are expected to implement an RL algorithm that solves a
specific environment and provide an explanation of the algorithms methodology. You are expected specific environment and provide an explanation of the algorithms methodology. You are expected
to analyse your results, including challenges and your solutions. to analyse your results, including challenges and your solutions.
Learning Outcomes Assessed Learning Outcomes Assessed
A: Systematically understand the fundamental concepts and principles of reinforcement learning A: Systematically understand the fundamental concepts and principles of reinforcement learning
B: Critically analyse real-life problem situations and expertly map them as reinforcement learning B: Critically analyse real-life problem situations and expertly map them as reinforcement learning
tasks. tasks.
C: Mastery of Monte Carlo Methods and Temporal Difference Learning C: Mastery of Monte Carlo Methods and Temporal Difference Learning
D: Proficiency in Deep Reinforcement Learning algorithms D: Proficiency in Deep Reinforcement Learning algorithms
Late policy Late policy
5% of the total marks available for the assessment shall be deducted from the assessment mark for 5% of the total marks available for the assessment shall be deducted from the assessment mark for
each working day after the submission date, up to a maximum of five working days each working day after the submission date, up to a maximum of five working days
Avoid Plagiarism Avoid Plagiarism
• Do not submit work from other students. • Do not submit work from other students.
• Do not share code/work with other students • Do not share code/work with other students
• Do not use open-source code as it is or without proper reference. • Do not use open-source code as it is or without proper reference.
2 2
Risks Risks
• Please read the coursework instructions and requirements carefully. Not following these instructions • Please read the coursework instructions and requirements carefully. Not following these instructions
and requirements may result in a loss of marks. and requirements may result in a loss of marks.
• The assignment must be submitted via Learning Mall. Only electronic submission is accepted • The assignment must be submitted via Learning Mall. Only electronic submission is accepted
and no hard copy submission. and no hard copy submission.
• All students must download their file and check that it is viewable after submission. Documents • All students must download their file and check that it is viewable after submission. Documents
may become corrupted during the uploading process (e.g. due to slow internet connections). may become corrupted during the uploading process (e.g. due to slow internet connections).
However, students are responsible for submitting a functional and correct file for assessments. However, students are responsible for submitting a functional and correct file for assessments.
• Academic Integrity Policy is strictly followed. • Academic Integrity Policy is strictly followed.
Individual Report (40 marks) Individual Report (40 marks)
The primary objective of this coursework is to familiarize students with the PPO algorithm using The primary objective of this coursework is to familiarize students with the PPO algorithm using
basic deep learning libraries, enabling them to improve their capability in transferring mathematical basic deep learning libraries, enabling them to improve their capability in transferring mathematical
and theoretical knowledge into Python implementation, and further their understanding of the actor- and theoretical knowledge into Python implementation, and further their understanding of the actor-
critic algorithm. critic algorithm.
Algorithm Overview Algorithm Overview
Proximal Policy Optimization (PPO) is a state-of-the-art reinforcement learning algorithm that optimizes Proximal Policy Optimization (PPO) is a state-of-the-art reinforcement learning algorithm that optimizes
a stochastic policy in an on-policy manner. To ensure stable training and avoid catastrophic performance a stochastic policy in an on-policy manner. To ensure stable training and avoid catastrophic performance
collapse, PPO utilizes a clipped surrogate objective to prevent the policy update from stepping too collapse, PPO utilizes a clipped surrogate objective to prevent the policy update from stepping too
far from the current behavior. far from the current behavior.
The Environment: CarRacing-v3 The Environment: CarRacing-v3
We will be using the Car Racing environment from the OpenAI Gymnasium. This environment We will be using the Car Racing environment from the OpenAI Gymnasium. This environment
features a top-down racing track where the agent must learn to navigate through tiles based on features a top-down racing track where the agent must learn to navigate through tiles based on
pixel inputs. You can find more details about this environment on their website.(https://gymnasium. pixel inputs. You can find more details about this environment on their website.(https://gymnasium.
farama.org/environments/box2d/car_racing/) farama.org/environments/box2d/car_racing/)
Heres a code snippet for you to get started: Heres a code snippet for you to get started:
import gymnasium as gym import gymnasium as gym
env = gym . make ( " CarRacing - v3 " , render_mode = " rgb_array " ) env = gym . make ( " CarRacing - v3 " , render_mode = " rgb_array " )
env . reset () env . reset ()
Since CarRacing-v3 is quite computationally expensive for a standard laptop (due to the pixel processing), Since CarRacing-v3 is quite computationally expensive for a standard laptop (due to the pixel processing),
you might want to consider using a gray-scaling or frame-stacking wrapper to speed up training. you might want to consider using a gray-scaling or frame-stacking wrapper to speed up training.
Alternatively, you can also use the lab computers, which have GPUs and have all the environment Alternatively, you can also use the lab computers, which have GPUs and have all the environment
already set up. already set up.
The PPO Agent The PPO Agent
You will implement an RL agent using PPO to play the CarRacing-v3 environment. The agent You will implement an RL agent using PPO to play the CarRacing-v3 environment. The agent
will use the standard observation and actions provided by the environment. You may edit the will use the standard observation and actions provided by the environment. You may edit the
3 3
environment to speed up your training, but your agent must still perform well in the standard environment to speed up your training, but your agent must still perform well in the standard
environment. (i.e, removing the camera zoom at the beginning is allowed during training, but environment. (i.e, removing the camera zoom at the beginning is allowed during training, but
your agent should still be tested in the original environment.) You should record your training and your agent should still be tested in the original environment.) You should record your training and
evaluation process using Tensorboard. You should also record important losses and other data for evaluation process using Tensorboard. You should also record important losses and other data for
your analysis later. your analysis later.
The Report The Report
Upon completion of your implementation, you are required to submit a comprehensive technical Upon completion of your implementation, you are required to submit a comprehensive technical
report. The report should document your engineering decisions, the theoretical grounding of your report. The report should document your engineering decisions, the theoretical grounding of your
code, and a critical analysis of the agents performance. code, and a critical analysis of the agents performance.
1. Introduction 1. Introduction
• Provide a brief overview of Reinforcement Learning in the context of the CarRacing-v3 • Provide a brief overview of Reinforcement Learning in the context of the CarRacing-v3
environment. environment.
• Define the state space (pixels), action space (discrete commands), and the reward structure • Define the state space (pixels), action space (discrete commands), and the reward structure
of the task. of the task.
2. Methodology 2. Methodology
• Mathematical Foundation: Formulate the PPO objective function. Explain the significance • Mathematical Foundation: Formulate the PPO objective function. Explain the significance
of the clipping parameter and the probability ratio. of the clipping parameter and the probability ratio.
• Advantage Estimation: Describe your method for calculating advantages (e.g., standard • Advantage Estimation: Describe your method for calculating advantages (e.g., standard
advantage vs. Generalized Advantage Estimation (GAE)). advantage vs. Generalized Advantage Estimation (GAE)).
3. Implementation Details 3. Implementation Details
• Describe your implementation, including any challenges faced and how you addressed • Describe your implementation, including any challenges faced and how you addressed
them. them.
• Explain the structure of your policy and value networks. • Explain the structure of your policy and value networks.
• Detail the training process and hyperparameters used. • Detail the training process and hyperparameters used.
4. Results and Analysis 4. Results and Analysis
• Present your results (use graphs for better clarity). • Present your results (use graphs for better clarity).
• Discuss the performance of your agent and any trends observed. • Discuss the performance of your agent and any trends observed.
• Briefly compare your custom implementations stability and sample efficiency against baseline • Briefly compare your custom implementations stability and sample efficiency against baseline
benchmarks (e.g., Stable-Baselines3). benchmarks (e.g., Stable-Baselines3).
5. Conclusion 5. Conclusion
• Summarize your key findings regarding the sensitivity of PPO to hyperparameter tuning • Summarize your key findings regarding the sensitivity of PPO to hyperparameter tuning
and the effectiveness of the actor-critic framework in continuous-input environments. and the effectiveness of the actor-critic framework in continuous-input environments.
Note: All figures and plots must be clearly labeled with axes titles and legends. Raw code Note: All figures and plots must be clearly labeled with axes titles and legends. Raw code
snippets should be kept to a minimum in the report; focus on high-level logic and pseudo- snippets should be kept to a minimum in the report; focus on high-level logic and pseudo-
code where necessary. code where necessary.
4 4
Important Note Important Note
• Do NOT use Stable-baselines libraries or any other reinforcement learning specific libraries in • Do NOT use Stable-baselines libraries or any other reinforcement learning specific libraries in
your implementation (You may use tensorboard for recording your results). your implementation (You may use tensorboard for recording your results).
• Do NOT exceed the word count limit of 3000 words for each report, reference and appendix • Do NOT exceed the word count limit of 3000 words for each report, reference and appendix
excluded. excluded.
• Although you are allowed to use any generative AI tools to assist your work, please keep in mind • Although you are allowed to use any generative AI tools to assist your work, please keep in mind
that you should be using them responsibly. (Good use: Improve your report after writing it that you should be using them responsibly. (Good use: Improve your report after writing it
and always review its output to ensure that it is correct. Bad use: Copy-pasting an entire report and always review its output to ensure that it is correct. Bad use: Copy-pasting an entire report
from AI without any effort of your own. ) from AI without any effort of your own. )
Submission Requirements Submission Requirements
Please prepare and submit the following documents: Please prepare and submit the following documents:
• A cover page featuring your student ID. This page should be the first page of your report. • A cover page featuring your student ID. This page should be the first page of your report.
• A zip file containing all the source codes and your trained agent model, which should be named • A zip file containing all the source codes and your trained agent model, which should be named
using your full name and student ID in the following format: CW1_ID_Name.zip using your full name and student ID in the following format: CW1_ID_Name.zip
• One PDF file for your report. The file should be separated from the zip file, which contains your • One PDF file for your report. The file should be separated from the zip file, which contains your
code. The files should be named in the following format: CW1_ID_Name.pdf code. The files should be named in the following format: CW1_ID_Name.pdf
Note that the quality of the code, the clarity of your writing, and the format/style of your report will Note that the quality of the code, the clarity of your writing, and the format/style of your report will
be taken into consideration during the evaluation. The detailed rubric is outlined below. be taken into consideration during the evaluation. The detailed rubric is outlined below.
Rubric Rubric
CW1 (40 makrs) Criteria Marks CW1 (40 makrs) Criteria Marks
Code Performance Code runs without errors and performs tasks as specified. 6 Code Performance Code runs without errors and performs tasks as specified. 6
Code Quality Code is well-organized, includes meaningful comments, and uses appropriate variable names. 6 Code Quality Code is well-organized, includes meaningful comments, and uses appropriate variable names. 6
Methodology Comprehensive coverage of topics with detailed explanations of approaches and methodologies. 6 Methodology Comprehensive coverage of topics with detailed explanations of approaches and methodologies. 6
Result analysis Insightful analysis of results. 6 Result analysis Insightful analysis of results. 6
Report Quality Report is well-structured, formatted, and free of grammatical errors. 6 Report Quality Report is well-structured, formatted, and free of grammatical errors. 6
Evidence of Work All required elements are included and correct. 6 Evidence of Work All required elements are included and correct. 6
Submission Follows all requirements for submission 4 Submission Follows all requirements for submission 4
5 5
+259 -259
View File
@@ -1,260 +1,260 @@
XJTLU Entrepreneur College (Taicang) Cover Sheet XJTLU Entrepreneur College (Taicang) Cover Sheet
School of AI and Advanced School of AI and Advanced
Module code DTS304TC: Machine Learning School title Module code DTS304TC: Machine Learning School title
Computing Computing
Assessment title Coursework Task 1 Assessment type Coursework Assessment title Coursework Task 1 Assessment type Coursework
Submission Submission
01/May/2026 23:59 01/May/2026 23:59
deadline deadline
I certify that I have read and understood the University's Policy for dealing with Plagiarism, Collusion and the Fabrication of Data I certify that I have read and understood the University's Policy for dealing with Plagiarism, Collusion and the Fabrication of Data
(available on Learning Mall Online). (available on Learning Mall Online).
My work does not contain any instances of plagiarism and/or collusion. My work does not contain any instances of plagiarism and/or collusion.
My work does not contain any fabricated data. My work does not contain any fabricated data.
By uploading my assignment onto Learning Mall Online, I formally declare that all of the By uploading my assignment onto Learning Mall Online, I formally declare that all of the
above information is true to the best of my knowledge and belief. above information is true to the best of my knowledge and belief.
Scoring For Tutor Use Scoring For Tutor Use
Student ID Student ID
Theory and Reflection PDF Word Count (Filled by Theory and Reflection PDF Word Count (Filled by
Students) Students)
Stage of Marking Marker Learning Outcomes Achieved F/P/M/D Final Stage of Marking Marker Learning Outcomes Achieved F/P/M/D Final
Code Score Code Score
(please modify as appropriate) (please modify as appropriate)
A B C A B C
1st Marker red 1st Marker red
pen pen
Moderation The original mark has been accepted by the moderator Y/N Moderation The original mark has been accepted by the moderator Y/N
IM (please circle as appropriate): IM (please circle as appropriate):
green pen Initials green pen Initials
Data entry and score calculation have been checked by Y Data entry and score calculation have been checked by Y
another tutor (please circle): another tutor (please circle):
2nd Marker if 2nd Marker if
needed green needed green
pen pen
For Academic Office Use Possible Academic Infringement (please tick as appropriate) For Academic Office Use Possible Academic Infringement (please tick as appropriate)
Date Days Late ☐ Category A Date Days Late ☐ Category A
Received late Penalty Total Academic Infringement Penalty Received late Penalty Total Academic Infringement Penalty
☐ Category B (A,B, C, D, E, Please modify where ☐ Category B (A,B, C, D, E, Please modify where
necessary) _____________________ necessary) _____________________
☐ Category C ☐ Category C
☐ Category D ☐ Category D
☐ Category E ☐ Category E
DTS304TC Machine Learning DTS304TC Machine Learning
Coursework - Assessment Task 1 Coursework - Assessment Task 1
• Percentage in final mark: 50% • Percentage in final mark: 50%
• Assessment type: individual coursework • Assessment type: individual coursework
• Submission files: one Jupyter notebook (.ipynb), one Coursework Answer Sheet / Theory and Reflection PDF, and one • Submission files: one Jupyter notebook (.ipynb), one Coursework Answer Sheet / Theory and Reflection PDF, and one
hidden-test CSV hidden-test CSV
Learning outcomes assessed Learning outcomes assessed
• A. Demonstrate a solid understanding of the theoretical issues related to problems that machine-learning methods try to • A. Demonstrate a solid understanding of the theoretical issues related to problems that machine-learning methods try to
address. address.
• B. Demonstrate understanding of the properties of existing machine-learning algorithms and how they behave on practical data. • B. Demonstrate understanding of the properties of existing machine-learning algorithms and how they behave on practical data.
Notes Notes
• Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may • Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may
result in a loss of marks. result in a loss of marks.
• The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on Learning Mall will be provided • The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on Learning Mall will be provided
in due course. The submission timestamp on Learning Mall will be used to check late submission. in due course. The submission timestamp on Learning Mall will be used to check late submission.
• 5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the • 5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the
submission date, up to a maximum of five working days. submission date, up to a maximum of five working days.
• All modelling work must be completed individually. Discussion of general ideas is allowed, but code, experiments, and • All modelling work must be completed individually. Discussion of general ideas is allowed, but code, experiments, and
notebooks must be independently developed. notebooks must be independently developed.
• You may not use ChatGPT to directly generate answers for the coursework. High-scoring work must demonstrate your own • You may not use ChatGPT to directly generate answers for the coursework. High-scoring work must demonstrate your own
experimental design, controlled comparisons, failure analysis, and image-level interpretation. ChatGPT or similar tools may be experimental design, controlled comparisons, failure analysis, and image-level interpretation. ChatGPT or similar tools may be
used only in a limited support role such as code understanding, debugging, or grammar support. They must not replace your used only in a limited support role such as code understanding, debugging, or grammar support. They must not replace your
method design, ablation logic, qualitative analysis, or reflection. Generic AI-produced descriptions without matching evidence in method design, ablation logic, qualitative analysis, or reflection. Generic AI-produced descriptions without matching evidence in
code, tables, figures, and discussion will not receive high marks. code, tables, figures, and discussion will not receive high marks.
• If you use AI tools or outside code in any meaningful way, you must fully understand, verify, and take ownership of every • If you use AI tools or outside code in any meaningful way, you must fully understand, verify, and take ownership of every
method, number, figure, and written claim that appears in your submission. method, number, figure, and written claim that appears in your submission.
Question 1: Notebook-Based Coding Exercise - Insurance Premium-Risk Classification (60 Question 1: Notebook-Based Coding Exercise - Insurance Premium-Risk Classification (60
Marks) Marks)
In this coursework you will build and improve a multiclass classifier for a fictionalised health-insurance dataset. The task is In this coursework you will build and improve a multiclass classifier for a fictionalised health-insurance dataset. The task is
to predict whether each applicant belongs to a Low, Standard, or High premium-risk group before pricing a policy. The to predict whether each applicant belongs to a Low, Standard, or High premium-risk group before pricing a policy. The
dataset is intentionally realistic: it mixes numerical and categorical variables, contains missing values and dirty entries, and dataset is intentionally realistic: it mixes numerical and categorical variables, contains missing values and dirty entries, and
includes some fields that require careful handling to avoid weak modelling practice or label leakage. includes some fields that require careful handling to avoid weak modelling practice or label leakage.
Your work should show a clear machine-learning workflow: build a sensible first pipeline, compare model families, apply Your work should show a clear machine-learning workflow: build a sensible first pipeline, compare model families, apply
stronger hyperparameter optimisation, complete one compulsory improvement category plus at least one optional category, stronger hyperparameter optimisation, complete one compulsory improvement category plus at least one optional category,
carry out a compact K-Means/Gaussian Mixture Model (GMM) exploration, and then produce a hidden-test CSV using carry out a compact K-Means/Gaussian Mixture Model (GMM) exploration, and then produce a hidden-test CSV using
validation evidence only. validation evidence only.
The prediction target variable is premium_risk, and it has 3 imbalanced classes: Standard, High, Low. The dataset The prediction target variable is premium_risk, and it has 3 imbalanced classes: Standard, High, Low. The dataset
contains 33 raw columns: admin/PII columns, synthetic noise features, 1 leakage feature, and genuine predictors. contains 33 raw columns: admin/PII columns, synthetic noise features, 1 leakage feature, and genuine predictors.
Unless otherwise stated, macro-F1 is the primary validation metric because the dataset is imbalanced; accuracy is reported Unless otherwise stated, macro-F1 is the primary validation metric because the dataset is imbalanced; accuracy is reported
as a secondary metric. as a secondary metric.
(A) Clean First Pipeline and Baseline Modelling (8 marks) (A) Clean First Pipeline and Baseline Modelling (8 marks)
• Load the provided training and validation files and define a consistent target / feature setup. • Load the provided training and validation files and define a consistent target / feature setup.
• Handle leakage features, dirty values, missing values, and categorical variables sensibly. A compact sanity check is enough; a • Handle leakage features, dirty values, missing values, and categorical variables sensibly. A compact sanity check is enough; a
long data-audit section is not required. long data-audit section is not required.
Important: The dataset contains a leakage feature. You must identify and remove it before proceeding to the next stage Important: The dataset contains a leakage feature. You must identify and remove it before proceeding to the next stage
of analysis; otherwise, the classification results will be severely biased by this leakage and will not be meaningful. If of analysis; otherwise, the classification results will be severely biased by this leakage and will not be meaningful. If
this occurs, multiple parts of your Coursework 1 may be affected, which could significantly impact your marks. this occurs, multiple parts of your Coursework 1 may be affected, which could significantly impact your marks.
• Build one baseline modelling pipeline. • Build one baseline modelling pipeline.
• Report at least one validation result using accuracy and macro-F1 score and include a confusion matrix for the baseline model. • Report at least one validation result using accuracy and macro-F1 score and include a confusion matrix for the baseline model.
• Keep preprocessing consistent across train, validation, and hidden-test files. • Keep preprocessing consistent across train, validation, and hidden-test files.
(B) Controlled Comparison: Random Forest and One Boosting Model (8 marks) (B) Controlled Comparison: Random Forest and One Boosting Model (8 marks)
• Using the same preprocessing pipeline, validation split, and evaluation metric (primary metric is macro-F1 also report accuracy), • Using the same preprocessing pipeline, validation split, and evaluation metric (primary metric is macro-F1 also report accuracy),
carry out an initial controlled comparison between one Random Forest model and one boosting model. carry out an initial controlled comparison between one Random Forest model and one boosting model.
• Default XGBoost is recommended because it provides a richer tuning space later, but others may also be used. Default settings • Default XGBoost is recommended because it provides a richer tuning space later, but others may also be used. Default settings
or only light sensible adjustments are acceptable in this section. or only light sensible adjustments are acceptable in this section.
• In the notebook, report the validation result of each model and support the comparison with one or two additional analyses, such • In the notebook, report the validation result of each model and support the comparison with one or two additional analyses, such
as class-wise metrics, a confusion matrix, train-versus-validation behaviour, or stability / sensitivity after tuning. as class-wise metrics, a confusion matrix, train-versus-validation behaviour, or stability / sensitivity after tuning.
• Your goal is not to prove that one model type always wins. Your goal is to compare the two models fairly, explain the high-level • Your goal is not to prove that one model type always wins. Your goal is to compare the two models fairly, explain the high-level
learning difference between bagging and boosting, and use your own notebook evidence to give a careful, dataset-specific learning difference between bagging and boosting, and use your own notebook evidence to give a careful, dataset-specific
interpretation. A generic textbook answer without reference to your own results will receive limited credit. interpretation. A generic textbook answer without reference to your own results will receive limited credit.
(C) Advanced Hyperparameter Optimisation (12 marks) (C) Advanced Hyperparameter Optimisation (12 marks)
• At least one main model should be tuned with a genuinely advanced strategy such as Optuna/TPE, Bayesian optimisation, • At least one main model should be tuned with a genuinely advanced strategy such as Optuna/TPE, Bayesian optimisation,
Hyperopt, Ray Tune, or another comparably strong approach. Hyperopt, Ray Tune, or another comparably strong approach.
• Hyperparameter tuning should optimise macro-F1 score on the validation set, and the final tuned result should be reported • Hyperparameter tuning should optimise macro-F1 score on the validation set, and the final tuned result should be reported
using both accuracy and macro-F1. using both accuracy and macro-F1.
• RandomizedSearchCV alone is normally not enough for the top band. • RandomizedSearchCV alone is normally not enough for the top band.
• Explain briefly why your search space and optimiser are reasonable for the chosen model. • Explain briefly why your search space and optimiser are reasonable for the chosen model.
(D) Personalised Improvement Work (18 marks) (D) Personalised Improvement Work (18 marks)
You must complete one compulsory category based on the last digit of your XJTLU student ID, plus at least one additional You must complete one compulsory category based on the last digit of your XJTLU student ID, plus at least one additional
optional category of your choice. A second optional category is recommended for stronger differentiation but is not compulsory. optional category of your choice. A second optional category is recommended for stronger differentiation but is not compulsory.
You should report accuracy and macro-F1 for improved models and include class-wise metrics where helpful. A compact ablation You should report accuracy and macro-F1 for improved models and include class-wise metrics where helpful. A compact ablation
table should normally be included in the notebook for the personalized improvement work table should normally be included in the notebook for the personalized improvement work
Last digit Compulsory category Last digit Compulsory category
0-1 Category A - Data quality and missingness 0-1 Category A - Data quality and missingness
2-3 Category B - Feature representation and engineering 2-3 Category B - Feature representation and engineering
4-5 Category C - Imbalance and objective design 4-5 Category C - Imbalance and objective design
6-7 Category D - Model robustness, calibration, or ensembling 6-7 Category D - Model robustness, calibration, or ensembling
8-9 Category E - Fairness, diagnostics, or interpretability 8-9 Category E - Fairness, diagnostics, or interpretability
Category Examples of what may be done What good evidence looks like Category Examples of what may be done What good evidence looks like
better missing-value strategy; A concise before/after comparison with a short better missing-value strategy; A concise before/after comparison with a short
A MissForest or iterative imputation; explanation of why the data handling changed the A MissForest or iterative imputation; explanation of why the data handling changed the
sensible outlier handling; value cleaning result sensible outlier handling; value cleaning result
feature crosses; grouped categories; feature crosses; grouped categories;
A compact ablation showing what representation A compact ablation showing what representation
B alternative encodings; modest feature B alternative encodings; modest feature
changed and whether it helped changed and whether it helped
selection; transformations selection; transformations
class weighting; focal-style loss if class weighting; focal-style loss if
Clear evidence of how minority or harder classes Clear evidence of how minority or harder classes
C relevant; sampling / resampling; C relevant; sampling / resampling;
changed, even if overall score moved only slightly changed, even if overall score moved only slightly
thresholding logic thresholding logic
bagging/boosting variants; calibration bagging/boosting variants; calibration
A meaningful diagnostic or comparison rather A meaningful diagnostic or comparison rather
D checks; soft voting; stacking; D checks; soft voting; stacking;
than a large collection of loosely connected trials than a large collection of loosely connected trials
robustness checks robustness checks
SHAP / feature importance; subgroup- SHAP / feature importance; subgroup-
Concrete insight into model behaviour, not only Concrete insight into model behaviour, not only
E style fairness checks; error analysis; E style fairness checks; error analysis;
screenshots screenshots
model interpretation model interpretation
(E) K-Means and Gaussian Mixture Model (GMM) Exploration (6 marks) (E) K-Means and Gaussian Mixture Model (GMM) Exploration (6 marks)
This is a compact exploratory section. It is not the main performance section, and it does not require clusters to match the class This is a compact exploratory section. It is not the main performance section, and it does not require clusters to match the class
labels exactly. The aim is to show your understanding of unsupervised learning methods and your ability to interpret their results labels exactly. The aim is to show your understanding of unsupervised learning methods and your ability to interpret their results
carefully. carefully.
• Use a sensible processed numeric feature space and briefly explain what you clustered on. • Use a sensible processed numeric feature space and briefly explain what you clustered on.
• Explore a small range of cluster/component numbers, such as 2-8. • Explore a small range of cluster/component numbers, such as 2-8.
• For K-Means, provide sensible supporting evidence, such as inertia (SSE), cluster sizes, or another simple analysis.. • For K-Means, provide sensible supporting evidence, such as inertia (SSE), cluster sizes, or another simple analysis..
• For Gaussian Mixture Model (GMM), provide sensible supporting evidence, such as component sizes, posterior • For Gaussian Mixture Model (GMM), provide sensible supporting evidence, such as component sizes, posterior
confidence/responsibility, or overlap/uncertainty between components. confidence/responsibility, or overlap/uncertainty between components.
• Include at least one compact table or figure comparing K-Means and GMM. • Include at least one compact table or figure comparing K-Means and GMM.
• If class labels are used for reference, explain clearly that unsupervised structure does not need to align exactly with supervised • If class labels are used for reference, explain clearly that unsupervised structure does not need to align exactly with supervised
labels labels
• Stronger work may additionally use silhouette score, log-likelihood trends, or a simple visualization. • Stronger work may additionally use silhouette score, log-likelihood trends, or a simple visualization.
(F) Final Model Choice and Hidden-Test Export (8 marks) (F) Final Model Choice and Hidden-Test Export (8 marks)
• Choose the final model using validation evidence only. • Choose the final model using validation evidence only.
• Retrain appropriately using both train and validation dataset and generate the hidden-test CSV in the required format. • Retrain appropriately using both train and validation dataset and generate the hidden-test CSV in the required format.
• Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second • Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High, column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
Low). Low).
Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4 marks Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4 marks
from this section. from this section.
• Do not tune on the hidden test and do not claim hidden test performance. • Do not tune on the hidden test and do not claim hidden test performance.
• Note: Hidden test score contributes only a small portion of the final marks. High leaderboard rank alone cannot compensate for • Note: Hidden test score contributes only a small portion of the final marks. High leaderboard rank alone cannot compensate for
weak experimental design or poor documentation. weak experimental design or poor documentation.
Coursework Answer Sheet / Theory and Reflection (PDF) - all questions below are compulsory Coursework Answer Sheet / Theory and Reflection (PDF) - all questions below are compulsory
(30 Marks) (30 Marks)
The Coursework Answer Sheet / Theory and Reflection PDF should not repeat the notebook section by section. All prompt areas The Coursework Answer Sheet / Theory and Reflection PDF should not repeat the notebook section by section. All prompt areas
below are compulsory. The PDF must be concise, directly linked to your own notebook evidence, and no longer than 4 pages / below are compulsory. The PDF must be concise, directly linked to your own notebook evidence, and no longer than 4 pages /
1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from the PDF section. You should aim to 1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from the PDF section. You should aim to
demonstrate both your theoretical or algorithmic understanding and your experimental findings or practical observations and demonstrate both your theoretical or algorithmic understanding and your experimental findings or practical observations and
clearly link your understanding of the algorithms to your experimental analysis. At least one table, figure, or metric from the clearly link your understanding of the algorithms to your experimental analysis. At least one table, figure, or metric from the
notebook must be referenced in each theory answer. notebook must be referenced in each theory answer.
Prompt area What you should do Prompt area What you should do
(1) Briefly state the definitions and key theoretical properties of bagging (1) Briefly state the definitions and key theoretical properties of bagging
and boosting models; and boosting models;
(2) report the validation results of each model; (2) report the validation results of each model;
(3) support your comparison with one or two additional analyses, such as (3) support your comparison with one or two additional analyses, such as
class-wise metrics, a confusion matrix, trainvalidation behaviour, or class-wise metrics, a confusion matrix, trainvalidation behaviour, or
1. Bagging versus boosting stability/sensitivity after tuning; and 1. Bagging versus boosting stability/sensitivity after tuning; and
(4) provide a careful interpretation of what this comparison suggests (4) provide a careful interpretation of what this comparison suggests
about this dataset and how it relates to the theoretical properties of about this dataset and how it relates to the theoretical properties of
bagging versus boosting methods. bagging versus boosting methods.
You are not expected to prove that one model type always performs You are not expected to prove that one model type always performs
better. better.
Explain why your optimiser and search space were reasonable for the Explain why your optimiser and search space were reasonable for the
chosen model, which hyperparameters you expected to matter most, chosen model, which hyperparameters you expected to matter most,
2. Hyperparameter optimisation 2. Hyperparameter optimisation
whether the tuned results matched that intuition, and what you learned whether the tuned results matched that intuition, and what you learned
from the tuning process. from the tuning process.
Explain hard versus soft assignment and the main assumption difference Explain hard versus soft assignment and the main assumption difference
between K-Means and GMM. Then use your own compact evidence to between K-Means and GMM. Then use your own compact evidence to
3. K-Means versus Gaussian Mixture Model (GMM) discuss whether the results matched your intuition and whether GMM 3. K-Means versus Gaussian Mixture Model (GMM) discuss whether the results matched your intuition and whether GMM
revealed anything extra, such as soft membership, uncertainty, or a revealed anything extra, such as soft membership, uncertainty, or a
better fit to partial cluster structure. better fit to partial cluster structure.
Reflect on the compulsory category and on every optional category you Reflect on the compulsory category and on every optional category you
implemented. Highlight any unique or interesting algorithm or strategy implemented. Highlight any unique or interesting algorithm or strategy
4. Personalised reflection you tried, the personal challenges you faced, the effort you made to 4. Personalised reflection you tried, the personal challenges you faced, the effort you made to
address them, and the key lessons you learned. Honest reflection on a address them, and the key lessons you learned. Honest reflection on a
neutral or negative result is acceptable if the reasoning is concrete. neutral or negative result is acceptable if the reasoning is concrete.
State briefly what forms of AI assistance, if any, were used. Generic AI- State briefly what forms of AI assistance, if any, were used. Generic AI-
5. AI-use declaration written theory that does not match your notebook evidence will receive 5. AI-use declaration written theory that does not match your notebook evidence will receive
limited credit. limited credit.
Coding Quality, Coursework Answer Sheet Quality, and Submission Guidelines (10 marks) Coding Quality, Coursework Answer Sheet Quality, and Submission Guidelines (10 marks)
• Submit your Jupyter Notebook in .ipynb format. It must be well organised, include clear commentary and clean code practices, • Submit your Jupyter Notebook in .ipynb format. It must be well organised, include clear commentary and clean code practices,
and show visible outputs. Do not write a second mini-report repeating notebook content. and show visible outputs. Do not write a second mini-report repeating notebook content.
• The notebook should be reproducible from start to finish without errors. Results cited in the PDF should be visible in the • The notebook should be reproducible from start to finish without errors. Results cited in the PDF should be visible in the
notebook and should match the reported values. notebook and should match the reported values.
• If you used supplementary code outside the notebook, submit that code as well so the full workflow remains reproducible. • If you used supplementary code outside the notebook, submit that code as well so the full workflow remains reproducible.
• Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second • Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High, column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
Low). Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4 Low). Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4
marks from this section. marks from this section.
• Submit the Coursework Answer Sheet / Theory and Reflection in PDF format. All questions in that section are compulsory. The • Submit the Coursework Answer Sheet / Theory and Reflection in PDF format. All questions in that section are compulsory. The
Coursework Answer Sheet / Theory and Reflection PDF must answer every required prompt, refer to your own notebook Coursework Answer Sheet / Theory and Reflection PDF must answer every required prompt, refer to your own notebook
evidence, and remain within 4 pages and 1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from evidence, and remain within 4 pages and 1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from
the PDF section. the PDF section.
• Include all required components: Jupyter notebooks (code), any additional experimental scripts or custom code, the hidden • Include all required components: Jupyter notebooks (code), any additional experimental scripts or custom code, the hidden
test-results CSV file, and the Coursework Answer Sheet PDF. Submit all files through the Learning Mall platform. After test-results CSV file, and the Coursework Answer Sheet PDF. Submit all files through the Learning Mall platform. After
submission, download your files to verify that they can be opened and viewed correctly to ensure the submission was submission, download your files to verify that they can be opened and viewed correctly to ensure the submission was
successful. successful.
Project Material Access Instructions Project Material Access Instructions
To access the complete set of materials for this project, please use the links below: To access the complete set of materials for this project, please use the links below:
• OneDrive Link: • OneDrive Link:
https://1drv.ms/f/c/18f09d1a39585f84/IgCXDMbXkFYSSZUZkkTyXyZzAQ1poX9mujUqF8N3JlL0GD0?e=uNhAHq https://1drv.ms/f/c/18f09d1a39585f84/IgCXDMbXkFYSSZUZkkTyXyZzAQ1poX9mujUqF8N3JlL0GD0?e=uNhAHq
• The same coursework materials have also been uploaded to Learning Mall. • The same coursework materials have also been uploaded to Learning Mall.
When extracting the materials, use the following password to unlock the zip file: DTS304TC (case-sensitive, enter in When extracting the materials, use the following password to unlock the zip file: DTS304TC (case-sensitive, enter in
uppercase). uppercase).