Add lecture materials for Model-Free, Control, and Value topics

- Added Lecture4 - ModelFree.pdf (3013 KB) - Added Lecture5 - Control.pdf (2575 KB) - Added Lecture6 - Value.pdf (3320 KB)
2026-04-28 20:28:00 +08:00
commit ceddbdd559
52 changed files with 117740 additions and 0 deletions
@@ -0,0 +1,22 @@
+# 操作系统文件
+Thumbs.db
+.DS_Store
+ehthumbs.db
+
+# 编辑器配置
+.vscode/
+.trae/
+.idea/
+*.swp
+*.swo
+*~
+
+# Python (如果有用到)
+__pycache__/
+*.py[cod]
+.env/
+
+# 编译输出 (如果有用到C/C++)
+*.o
+*.exe
+*.out
@@ -0,0 +1,5 @@
+完成一份 强化学习个人课程作业报告：需要用 Python 从零实现一个 PPO（Proximal Policy Optimization）强化学习算法，让智能体在 CarRacing-v3 环境中完成赛车任务，并在此基础上提交一份不超过 3000 词 的技术报告，系统说明你的方法与结果；具体来说，要介绍该任务的强化学习背景，定义状态空间、动作空间和奖励机制，解释 PPO 的目标函数、裁剪机制和优势估计方法，说明策略网络与价值网络结构、训练流程、超参数设置以及实现过程中遇到的问题和解决办法，同时用图表展示训练与测试结果，分析模型表现和变化趋势，并与如 Stable-Baselines3 这类基线方法在稳定性和样本效率上做简要比较；另外，还要提交一个包含全部源代码和训练好模型的 zip 文件，以及一个单独的 PDF 报告，文件命名和提交格式都必须符合要求，而且实现中不能直接使用 Stable-Baselines 等强化学习专用库，但可以合理使用 TensorBoard 记录实验结果。
+
+这个 PDF 要求完成一份 强化学习个人项目报告：需要自己选择一个 Atari 游戏，实现并训练一个你选定的 深度强化学习算法 来达到有竞争力的表现，然后提交一份不超过 3000 词 的技术报告和一个包含全部源代码及训练模型的 zip 文件；报告中需要说明选择的游戏及其挑战，调研并总结深度强化学习尤其是在 Atari 游戏中的应用现状，比较考虑过的算法并解释为什么最终选择当前方法，详细介绍算法原理与具体实现，评估智能体表现、说明所选基准和评价指标，并分析为什么该算法在这个游戏上表现好或不好，同时用清晰标注坐标轴和图例的图表来展示结果；另外，作业明确要求不能直接用 Stable-Baselines 等强化学习专用库来实现算法，但可以用它们做 benchmark，对代码质量、结果分析、报告结构、图表使用和引用规范都会评分，最终还要按指定格式命名并提交 PDF 和 zip 文件。
+
+完成一份 机器学习个人课程作业：围绕一个健康保险数据集，建立并改进一个用于预测申请人保费风险等级（Low / Standard / High）的多分类模型。你需要先完成 Jupyter Notebook 部分，包括数据清理与预处理、识别并删除数据泄露特征、建立基线模型、对比随机森林和一种 boosting 模型、使用高级超参数优化方法调参、根据学号末位完成指定的个性化改进并至少再做一个可选改进、再进行一次 K-Means 与 GMM 的无监督探索，最后基于验证结果选出最终模型并导出规定格式的 hidden-test CSV；同时还要提交一份 不超过1200词 左右的 Theory and Reflection PDF，围绕 bagging vs boosting、超参数优化、K-Means vs GMM、个性化改进反思和 AI 使用声明进行理论与实验结合的总结，并且所有结论都要紧扣你自己 notebook 里的表格、图和指标证据，最终按要求提交 notebook、PDF、CSV 以及必要的补充代码。
@@ -0,0 +1,170 @@
+# 课程作业实现方案分析
+
+## 📚 三项作业任务概览
+
+---
+
+### 🔴 任务一：强化学习个人项目报告（Atari 游戏方向）
+
+**核心任务**：
+
+- 自选一个 Atari 游戏，从零实现并训练深度强化学习算法，达到有竞争力的表现
+- 提交不超过 3000 词的技术报告 + 源代码和训练模型的 zip 文件
+
+**推荐游戏选择**：Space Invaders 或 Breakout（相对简单，benchmark 充分）
+
+**技术栈**：
+
+- PyTorch 深度学习框架
+- Gymnasium（旧版 ALE） Atari 环境
+- NumPy 进行数据处理
+
+**核心算法选择建议**：
+
+| 算法 | 优点 | 缺点 |
+|------|------|------|
+| DQN | 经典稳定，实现相对简单 | 训练慢，需要大量样本 |
+| Double DQN | 解决 Q 值过估计问题 | 实现稍复杂 |
+| Dueling DQN | 收敛更快 | 实现复杂度中等 |
+
+**实现步骤**：
+
+1. 环境搭建 + 预处理（帧堆叠、灰度化、resize）
+2. 实现 Replay Buffer
+3. 实现 Q-Network（CNN 结构）
+4. 实现 DQN 训练循环
+5. 超参数调优 + 训练监控
+6. 对比 Stable-Baselines3 基线
+
+**报告要求**：
+
+1. 说明所选游戏及其挑战
+2. 调研深度强化学习在 Atari 游戏中的应用现状
+3. 对比不同算法，解释最终选择
+4. 详细介绍算法原理与具体实现
+5. 评估智能体表现
+6. 分析算法在该游戏上表现好坏的原因
+7. 用图表展示实验结果
+
+**预计工作量**：⭐⭐⭐⭐⭐（最高，约 40% 精力）
+
+---
+
+### 🟡 任务二：PPO + CarRacing-v3
+
+**核心任务**：
+
+- 用 Python 从零实现 PPO（Proximal Policy Optimization）算法
+- 让智能体在 CarRacing-v3 环境中完成赛车任务
+- 提交不超过 3000 词的技术报告 + 源代码和模型的 zip 文件
+
+**技术栈**：
+
+- PyTorch
+- Gymnasium（CarRacing-v3 环境）
+- TensorBoard 用于训练可视化
+
+**PPO 核心组件**：
+
+```
+1. 策略网络：CNN → 全连接层 → μ, σ（连续动作输出）
+2. 价值网络：CNN → 全连接层 → V(s)
+3. GAE（Generalized Advantage Estimation）计算优势
+4. PPO-Clip 目标函数：L^CLIP(θ) = E[min(r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A)]
+5. 熵正则化 + 价值函数损失
+```
+
+**实现细节**：
+
+- 状态空间：96×96×3 RGB 图像
+- 动作空间：连续空间（转向、油门、刹车）
+- 奖励塑形：保持中心线奖励 + 速度奖励 + 轮胎磨损惩罚
+
+**报告要求**：
+
+1. 介绍任务的强化学习背景
+2. 定义状态空间、动作空间和奖励机制
+3. 解释 PPO 的目标函数、裁剪机制和优势估计方法
+4. 说明网络结构、训练流程、超参数设置
+5. 记录并说明实现过程中遇到的问题与解决办法
+6. 用图表展示训练与测试结果
+7. 与 Stable-Baselines3 基线方法做简要对比
+
+**预计工作量**：⭐⭐⭐⭐（较高，约 35% 精力）
+
+---
+
+### 🟢 任务三：健康保险多分类（机器学习）
+
+**核心任务**：
+
+- 围绕健康保险数据集，建立并改进多分类模型
+- 预测申请人保费风险等级（Low / Standard / High）
+- 提交 Jupyter Notebook、1200 词左右的 Theory and Reflection PDF、hidden-test CSV
+
+**Jupyter Notebook 步骤**：
+
+| 步骤 | 内容 |
+|---|---|
+| 1 | 数据清洗 + 缺失值处理 |
+| 2 | 特征工程 + 识别并删除数据泄露特征 |
+| 3 | 基线模型（逻辑回归） |
+| 4 | 随机森林 vs XGBoost/LightGBM 对比 |
+| 5 | 贝叶斯优化/GridSearchCV 超参数调优 |
+| 6 | 个性化改进（根据学号末位决定） |
+| 7 | K-Means + GMM 无监督探索 |
+| 8 | 最终模型选择 + hidden-test 预测 |
+
+**PDF 报告主题**：
+
+1. Bagging vs Boosting 对比
+2. 超参数优化方法讨论
+3. K-Means vs GMM 对比
+4. 个性化改进反思
+5. AI 使用声明
+
+**预计工作量**：⭐⭐⭐（中等，约 25% 精力）
+
+---
+
+## ⏰ 时间规划建议
+
+| 周次 | 任务分配 | 备注 |
+|------|----------|------|
+| **Week 1-2** | 任务三（机器学习） | 时间最充裕，可提前完成 |
+| **Week 3-4** | 任务二（PPO） | 核心算法，需要充分调优 |
+| **Week 5-7** | 任务一（Atari） | 训练耗时最长，尽早开始 |
+| **最后1周** | 整体检查、报告撰写、格式调整 | 预留缓冲时间 |
+
+---
+
+## ⚠️ 关键注意事项
+
+1. **任务一和任务二**禁止使用 Stable-Baselines 等强化学习专用库实现核心算法
+2. **任务三**的个性化改进需根据学号末位决定
+3. 所有任务都需按指定格式命名并提交 **PDF + Zip（代码+模型）**
+4. 任务一可使用 Stable-Baselines3 作为 benchmark 对比
+
+---
+
+## 📝 文档语言要求（外教课特别说明）
+
+⚠️ **重要提醒**：这是**外教课**作业，所有提交的 **PDF 文档必须使用英文** 撰写，包括：
+
+- 技术报告（Technical Report）
+- Theory and Reflection PDF
+- 代码注释（Code Comments）
+
+建议：
+
+- 使用英文撰写报告正文
+- 图表标题和图例使用英文
+- 代码中变量命名和注释使用英文
+- 可保留中文的仅为个人笔记/思考过程（无需提交）
+
+---
+
+## ❓ 待确认事项
+
+- **学号末位**是多少？（用于确定任务三的个性化改进方向）
+- 是否有偏好选择的 Atari 游戏？
@@ -0,0 +1,129 @@
+# 课程作业整合及任务拆解与时间规划清单
+
+## 📋 课程作业要求整合版
+
+### 一、强化学习个人项目报告（Atari 游戏方向）
+
+**核心任务**：
+
+- 自选一个 Atari 游戏，从零实现并训练一个深度强化学习算法，达到有竞争力的表现
+
+- 提交一份不超过 3000 词的技术报告 \+ 包含全部源代码、训练模型的 zip 文件
+
+**报告要求**：
+
+1. 说明所选游戏及其挑战
+
+2. 调研并总结深度强化学习在 Atari 游戏中的应用现状
+
+3. 对比不同算法，解释最终选择当前方法的理由
+
+4. 详细介绍算法原理与具体实现细节
+
+5. 评估智能体表现，说明所选基准和评价指标
+
+6. 分析算法在该游戏上表现好坏的原因
+
+7. 用标注清晰坐标轴、图例的图表展示实验结果
+
+**实现限制**：
+
+- 禁止直接使用 Stable\-Baselines 等强化学习专用库实现算法
+
+- 可以使用 Stable\-Baselines 等库作为 benchmark 对比
+
+- 评分维度：代码质量、结果分析、报告结构、图表使用、引用规范
+
+- 需按指定格式命名并提交 PDF 和 zip 文件
+
+---
+
+### 二、强化学习个人课程作业（PPO \+ CarRacing\-v3 方向）
+
+**核心任务**：
+
+- 用 Python 从零实现 PPO（Proximal Policy Optimization）算法，让智能体在 CarRacing\-v3 环境中完成赛车任务
+
+- 提交一份不超过 3000 词的技术报告 \+ 包含全部源代码、训练模型的 zip 文件
+
+**报告要求**：
+
+1. 介绍任务的强化学习背景
+
+2. 定义状态空间、动作空间和奖励机制
+
+3. 解释 PPO 的目标函数、裁剪机制和优势估计方法
+
+4. 说明策略网络与价值网络结构、训练流程、超参数设置
+
+5. 记录并说明实现过程中遇到的问题与解决办法
+
+6. 用图表展示训练与测试结果，分析模型表现和变化趋势
+
+7. 与 Stable\-Baselines3 等基线方法，在稳定性和样本效率上做简要对比
+
+**实现限制**：
+
+- 禁止直接使用 Stable\-Baselines 等强化学习专用库实现算法
+
+- 可使用 TensorBoard 记录实验结果
+
+- 需按指定格式命名并提交 PDF 和 zip 文件
+
+---
+
+### 三、机器学习个人课程作业（健康保险多分类方向）
+
+**核心任务**：
+
+- 围绕健康保险数据集，建立并改进多分类模型，预测申请人保费风险等级（Low / Standard / High）
+
+- 提交 Jupyter Notebook、1200 词左右的 Theory and Reflection PDF、hidden\-test CSV 及补充代码
+
+**Jupyter Notebook 要求**：
+
+1. 数据清理与预处理
+
+2. 识别并删除数据泄露特征
+
+3. 建立基线模型
+
+4. 对比随机森林和一种 boosting 模型
+
+5. 使用高级超参数优化方法调参
+
+6. 根据学号末位完成指定的个性化改进，并额外完成至少一个可选改进
+
+7. 进行 K\-Means 与 GMM 的无监督探索
+
+8. 基于验证结果选出最终模型，导出规定格式的 hidden\-test CSV
+
+**PDF 报告要求**：
+
+- 围绕以下主题，结合实验数据（表格、图、指标）进行理论与实验结合的总结：
+
+    1. bagging vs boosting 对比
+
+    2. 超参数优化方法
+
+    3. K\-Means vs GMM 对比
+
+    4. 个性化改进反思
+
+    5. AI 使用声明
+
+---
+
+## ⚠️ 外教课文档语言要求
+
+⚠️ **重要提醒**：这是**外教课**作业，所有提交的 **PDF 文档必须使用英文** 撰写，包括：
+
+- 技术报告（Technical Report）— 英文
+- Theory and Reflection PDF — 英文
+- 代码注释（Code Comments）— 英文
+
+建议：
+- 使用英文撰写报告正文
+- 图表标题和图例使用英文
+- 代码中变量命名和注释使用英文
+- 可保留中文的仅为个人笔记/思考过程（无需提交）
@@ -0,0 +1,174 @@
+# Theory and Reflection PDF — 官方要求汇总
+
+> 来源：原始 PDF `DTS304TC_Assessment1_(word)_2026(1).pdf` + 外教课整理文件
+
+---
+
+## 1. 基本提交要求
+
+| 项目 | 要求 |
+|------|------|
+| 文件名 | `Coursework Answer Sheet / Theory and Reflection PDF` |
+| 格式 | **PDF** |
+| 分值 | **30 分**（占整份作业 50% 中的 30%） |
+| 提交位置 | Learning Mall 平台 |
+
+---
+
+## 2. 硬性约束（超限扣分）
+
+| 约束 | 说明 | 违规处罚 |
+|------|------|---------|
+| **页数** | ≤ **4 页** | 固定扣 **5 分** |
+| **词数** | ≤ **1200 词**（正文总计） | 固定扣 **5 分** |
+| **内容** | 不按 notebook 章节逐段重复 | 会扣分 |
+
+只要页数或词数 **任一超限**，从 PDF 部分直接扣 5 分，无例外。
+
+---
+
+## 3. 必须回答的 5 个主题（全部 compulsory）
+
+### Q1 - Bagging versus Boosting (8 marks 关联)
+
+PDF 必须包含：
+
+1. **简要定义** bagging 和 boosting 的理论性质
+2. **报告** 两类模型的验证结果（来自你自己的 notebook）
+3. **支撑比较**：至少 1-2 个额外分析，例如
+   - class-wise F1 metrics
+   - confusion matrix
+   - train-vs-validation behaviour
+   - tuning 后的稳定性/敏感性
+4. **数据集特定解释**：结合你自己的实验结果，解释 bagging vs boosting 在本数据集上的表现差异
+
+> ⚠️ 原文特别强调：泛化的教科书答案（无 notebook 证据支撑）将获得有限分数。
+
+---
+
+### Q2 - Hyperparameter Optimisation (12 marks 关联)
+
+必须解释：
+
+- 为什么你的优化器和搜索空间对所选模型合理
+- 你原本预期哪些超参数最重要
+- 调参结果是否符合你的预期
+- 从调参过程中学到了什么
+
+---
+
+### Q3 - K-Means versus GMM (6 marks 关联)
+
+必须包含：
+
+- 解释 **hard assignment vs soft assignment**
+- 解释两者**核心假设差异**
+- 用你自己的实验结果讨论结果是否符合直觉
+- 说明 GMM 是否揭示了额外信息，例如：
+  - soft membership
+  - uncertainty
+  - partial cluster structure
+
+---
+
+### Q4 - Personalised Reflection
+
+必须反思：
+
+- 你的 **compulsory category**（根据学号末位决定）
+- 你做的每个 **optional category**
+- 你尝试过的策略、遇到的挑战、如何解决
+- 学到的关键教训
+
+> 💡 即使结果中性或负面，只要反思具体，也可以接受。
+
+---
+
+### Q5 - AI Use Declaration
+
+必须说明：
+
+- 是否使用了 AI 工具，使用了什么形式的辅助
+- **Generic AI-written theory** 如果和 notebook 证据对不上，只会拿到很有限的分数
+
+---
+
+## 4. 证据引用要求（每题必须引用）
+
+| 要求 | 说明 |
+|------|------|
+| 每题至少引用 **1 个** notebook 证据 | 表格 / 图 / 指标皆可 |
+| 所有结论紧扣你自己的实验结果 | 不能凭空泛化 |
+
+原文原话：
+> *"At least one table, figure, or metric from the notebook must be referenced in each theory answer."*
+
+---
+
+## 5. AI 使用限制（硬约束）
+
+| 可以 | 不可以 |
+|------|--------|
+| ✅ code understanding | ❌ 直接用 ChatGPT 生成答案 |
+| ✅ debugging | ❌ 替代 method design |
+| ✅ grammar support | ❌ 替代 ablation logic |
+| ✅ 语法润色 | ❌ 替代 qualitative analysis |
+| - | ❌ 替代 reflection |
+
+原文原话：
+> *"High-scoring work must demonstrate your own experimental design, controlled comparisons, failure analysis, and image-level interpretation."*
+
+如果你以任何有意义的方式使用了 AI 工具或外部代码，你必须：
+- 完全理解每个 method、number、figure、written claim
+- 验证并对所有内容负责
+
+---
+
+## 6. 提交格式要求（额外扣分风险）
+
+| 项目 | 风险 |
+|------|------|
+| CSV 文件名格式错误 | **-4 分**（自动扣） |
+| CSV 列顺序错误 | **-4 分**（自动扣） |
+| CSV 列缺失（如没有 `customer_key` 或 `premium_risk`） | **-4 分**（自动扣） |
+
+CSV 正确格式：
+- 第 1 列：`applicant_id`
+- 第 2 列：`customer_key`
+- 第 3 列：`premium_risk`（只能是 Standard / High / Low）
+
+---
+
+## 7. 完整作业分值结构
+
+| 部分 | 分值 | 占比 |
+|------|------|------|
+| Q1: Notebook-Based Coding Exercise | **60 分** | 60% |
+| **Theory and Reflection PDF** | **30 分** | 30% |
+| Coding Quality / Answer Sheet Quality / Submission Guidelines | **10 分** | 10% |
+| **总计** | **100 分** | 100% |
+
+---
+
+## 8. 当前版本自查清单
+
+| 检查项 | 当前状态 |
+|--------|---------|
+| 总页数 ≤ 4 页 | ✅ 3 页 |
+| 总词数 ≤ 1200 词 | ✅ ~941 词 |
+| 5 个主题全部回答 | ✅ 是 |
+| 每题引用 ≥ 1 个 notebook 证据 | ✅ 是 |
+| 不重复 notebook 章节顺序 | ✅ 是 |
+| 全英文撰写 | ✅ 是 |
+| AI 使用说明克制、真实、可核验 | ✅ 是 |
+| CSV 文件名格式正确 | ✅ `test_result_1234560.csv` |
+| CSV 列顺序正确 | ✅ applicant_id, customer_key, premium_risk |
+
+---
+
+## 9. 参考文件
+
+- `DTS304TC_Assessment1_(word)_2026(1).pdf` — 原始评分说明 PDF（已放入 `docs/`）
+- `机器学习个人课程作业_需求分析与实现方案.md` — 需求分析整理文档（已放入 `docs/`）
+- `theory_and_reflection_1234560.pdf` — 本次提交的 PDF（已放入 `tex/`）
+- `theory_and_reflection_1234560.tex` — 本次提交的 TeX 源文件（已放入 `tex/`）
@@ -0,0 +1,942 @@
+# 机器学习个人课程作业需求分析与实现方案
+
+## 1. 文档目的
+
+本文档基于以下材料整理：
+
+- `外教课/原文要求.txt`
+- `外教课/课程作业实现方案分析.md`
+- `外教课/课程作业整合及任务拆解与时间规划清单.md`
+- `资料/DTS304TC_Assessment1_(word)_2026(1).pdf`
+- `资料/dataset_final(1).zip` 中的数据文件结构
+
+目标是输出一份面向实际执行的详细需求分析和实现方案，用于指导 `Jupyter Notebook + Theory and Reflection PDF + hidden-test CSV + 补充代码` 的完整完成过程。
+
+---
+
+## 2. 原始任务核心要求
+
+### 2.1 作业目标
+
+需要围绕一个虚构但接近真实场景的健康保险数据集，建立并改进一个多分类模型，用于预测申请人的保费风险等级：
+
+- `Low`
+- `Standard`
+- `High`
+
+该任务不是单纯追求 leaderboard 排名，而是要求展示完整、规范、可复现的机器学习工作流，包括：
+
+- 数据清理与预处理
+- 数据泄露特征识别与删除
+- 基线模型建立
+- `Random Forest` 与一种 `Boosting` 模型的公平对比
+- 至少一个主模型的高级超参数优化
+- 按学号末位完成一个必做改进类别，并额外完成至少一个可选改进类别
+- `K-Means` 与 `GMM` 的无监督探索
+- 基于验证集证据选择最终模型
+- 生成规定格式的 hidden-test 预测结果 CSV
+
+### 2.2 提交物
+
+必须提交以下文件：
+
+- 一个 `Jupyter Notebook`，格式为 `.ipynb`
+- 一个 `Coursework Answer Sheet / Theory and Reflection PDF`
+- 一个 hidden-test 预测结果 `CSV`
+- 如使用 notebook 外部的辅助脚本，也必须一并提交
+
+### 2.3 成绩结构
+
+根据 PDF 原文，整份作业分值结构如下：
+
+- `Question 1: Notebook-Based Coding Exercise`：`60` 分
+- `Theory and Reflection PDF`：`30` 分
+- `Coding Quality / Answer Sheet Quality / Submission Guidelines`：`10` 分
+
+这意味着不能只重模型分数，文档质量、实验组织、结果解释和提交格式同样直接影响成绩。
+
+---
+
+## 3. 从原始 PDF 提炼出的硬性约束
+
+以下内容属于正式要求中的关键硬约束，必须优先满足。
+
+### 3.1 评价指标约束
+
+- 主指标是 `macro-F1`
+- `accuracy` 只是辅助指标
+- 所有重要模型对比、调参结果、改进结果都应同时报告：
+  - `macro-F1`
+  - `accuracy`
+
+原因是类别不平衡明显，不能只用准确率判断模型优劣。
+
+### 3.2 数据泄露约束
+
+PDF 明确指出：
+
+- 数据中存在 `1` 个泄露特征
+- 必须先识别并移除，再进入后续分析
+- 如果没有删除，后续多个部分都会被视为无效或严重失真
+
+这说明“找出泄露特征”不是建议项，而是作业关键检查点。
+
+### 3.3 模型比较约束
+
+Notebook 中必须完成：
+
+- 一个 baseline pipeline
+- 一个 `Random Forest`
+- 一个 `Boosting` 模型
+
+并且需要满足“受控比较”原则：
+
+- 使用同一套预处理流程
+- 使用同一份训练/验证划分
+- 使用同一评价指标
+
+否则比较结论不成立。
+
+### 3.4 高级调参约束
+
+至少一个主模型需要使用真正的高级调参方法，例如：
+
+- `Optuna/TPE`
+- 贝叶斯优化
+- `Hyperopt`
+- `Ray Tune`
+
+PDF 明确说明：
+
+- 单独使用 `RandomizedSearchCV` 通常不足以进入高分档
+
+因此建议主方案使用 `Optuna`。
+
+### 3.5 个性化改进约束
+
+必须完成：
+
+- 根据学号末位对应的 `1` 个必做类别
+- 额外至少 `1` 个自选类别
+
+推荐但非强制：
+
+- 再额外完成第 `2` 个可选类别，以增强区分度
+
+### 3.6 无监督探索约束
+
+必须完成紧凑版的：
+
+- `K-Means`
+- `Gaussian Mixture Model (GMM)`
+
+重点不是聚类分数超过监督学习模型，而是体现：
+
+- 对无监督方法机制的理解
+- 对结果的谨慎解释
+- 对 `hard assignment` 和 `soft assignment` 区别的认识
+
+### 3.7 hidden-test 导出约束
+
+最终 CSV 必须：
+
+- 文件名为 `test_result_[your_student_id].csv`
+- 第一列必须是 `applicant_id`
+- 第二列必须是 `customer_key`
+- 第三列必须是 `premium_risk`
+- 预测标签只能使用：
+  - `Standard`
+  - `High`
+  - `Low`
+
+PDF 明确说明：
+
+- 命名或格式错误会在该部分自动扣 `4` 分
+- 不允许在 hidden test 上调参
+- 不允许声称 hidden test 性能
+
+### 3.8 PDF 写作约束
+
+`Theory and Reflection PDF` 必须：
+
+- 不超过 `4` 页
+- 不超过 `1200` 词
+- 不得简单重复 notebook 内容
+- 每一个理论问题都必须引用 notebook 中至少一个表、图或指标
+
+超过页数或字数限制会固定扣 `5` 分。
+
+### 3.9 AI 使用约束
+
+PDF 原文给出了比普通课程更严格的 AI 使用边界：
+
+- 不允许直接用 ChatGPT 生成作业答案
+- AI 只能作为有限支持工具：
+  - 代码理解
+  - 调试
+  - 语法润色
+- AI 不能替代：
+  - 方法设计
+  - 消融逻辑
+  - 定性分析
+  - 反思写作
+
+因此最终提交物必须明显体现“你自己做了实验和分析”。
+
+---
+
+## 4. 数据集层面的已知信息
+
+基于对 `dataset_final(1).zip` 的结构和文件头信息检查，目前已确认：
+
+### 4.1 数据文件
+
+压缩包内包含：
+
+- `dataset_final/train.csv`
+- `dataset_final/val.csv`
+- `dataset_final/test_features.csv`
+
+### 4.2 数据规模
+
+- `train.csv`：`74375 x 33`
+- `val.csv`：`13125 x 33`
+- `test_features.csv`：`12500 x 32`
+
+说明：
+
+- 训练集和验证集包含标签列 `premium_risk`
+- hidden-test 文件不包含标签
+
+### 4.3 字段结构
+
+训练集列名包括：
+
+- 标识类字段：`applicant_id`, `customer_key`, `applicant_ref_code`
+- 时间/类别字段：`application_month`, `employment_sector`, `prior_debt_products`, `debt_portfolio_quality`, `account_tenure`, `minimum_payment_only`, `spending_profile`
+- 数值特征：收入、负债、额度变化、查询次数、逾期情况、投资金额、余额等
+- 明显可疑字段：`bureau_risk_index`
+- 噪声字段：`noise_feature_1` 到 `noise_feature_5`
+- 标签：`premium_risk`
+
+### 4.4 类别分布
+
+训练集标签分布：
+
+- `Standard`: `39686`
+- `High`: `21586`
+- `Low`: `13103`
+
+结论：
+
+- 数据明显不平衡
+- 使用 `macro-F1` 作为主指标完全合理
+- 在个性化改进中，`Category C` 类的重采样、类权重、阈值逻辑会很自然
+
+### 4.5 缺失值概览
+
+当前观察到缺失值较多的字段包括：
+
+- `net_monthly_income_gbp`
+- `avg_payment_delay_days`
+- `monthly_investment_gbp`
+- `prior_debt_products`
+- `account_tenure`
+- `late_payment_count`
+- `credit_limit_change_pct`
+- `credit_inquiry_count`
+- `end_month_balance_gbp`
+
+这说明预处理部分不能仅做“简单删行”，更适合使用 pipeline 化的缺失值处理方案。
+
+---
+
+## 5. 对任务本质的理解
+
+这份作业本质上考查的不是“谁把模型调得最高”，而是以下四个能力：
+
+- 是否能建立规范的机器学习实验流程
+- 是否能识别不合理特征并避免数据泄露
+- 是否能做公平比较、合理调参和证据驱动分析
+- 是否能把理论概念和自己的实验结果一一对应起来
+
+因此，高分解法必须同时满足：
+
+- 模型结果合理
+- 实验过程规范
+- 分析论证充分
+- notebook 和 PDF 严格互相对应
+
+---
+
+## 6. Notebook 需求拆解
+
+下面按原始评分结构，对 notebook 部分做可执行拆解。
+
+### 6.1 A 部分：清洗、预处理与基线建模
+
+必须完成的内容：
+
+- 读取 `train.csv`、`val.csv`、`test_features.csv`
+- 明确 `X_train / y_train / X_val / y_val / X_test`
+- 识别并删除泄露特征
+- 处理脏值、缺失值、类别变量
+- 建立一个基线 pipeline
+- 报告 baseline 的：
+  - `accuracy`
+  - `macro-F1`
+  - confusion matrix
+- 确保 train、val、test 使用完全一致的预处理规则
+
+建议实现：
+
+- 使用 `ColumnTransformer + Pipeline`
+- 数值特征：
+  - `SimpleImputer(strategy='median')`
+- 类别特征：
+  - `SimpleImputer(strategy='most_frequent')`
+  - `OneHotEncoder(handle_unknown='ignore')`
+- baseline 模型：
+  - `LogisticRegression`
+  - 或 `HistGradientBoosting` 前的简单基线
+
+更稳妥的推荐是：
+
+- 用 `LogisticRegression(class_weight='balanced')` 作为 baseline
+
+原因：
+
+- 简单
+- 可解释
+- 适合作为后续树模型的比较起点
+
+### 6.2 泄露特征识别策略
+
+由于 PDF 强调必须自行识别，不建议在正式 notebook 中直接“拍脑袋认定”某列是泄露。
+
+建议采用下面的证据链：
+
+1. 先从业务语义初筛高风险列  
+   候选优先检查：
+   - `bureau_risk_index`
+   - 各类明显后验统计或接近标签定义的字段
+
+2. 做单变量或极简模型筛查  
+   例如：
+   - 单列训练一个简单模型
+   - 比较每列单独带来的验证集 `macro-F1`
+
+3. 检查“异常高”的预测能力  
+   若某列单独就能异常接近目标标签，则高度怀疑为 leakage
+
+4. 删除该特征后重新建立基线  
+   在 notebook 中明确说明：
+   - 删除前风险
+   - 删除后原因
+   - 为什么后续分析必须基于删除后的数据
+
+注意：
+
+- 从字段命名看，`bureau_risk_index` 是最值得优先怀疑的候选
+- 但正式提交中最好写成“通过字段语义 + 验证结果发现其构成泄露或近似泄露”
+
+### 6.3 B 部分：随机森林与 boosting 的受控比较
+
+必须完成：
+
+- 保持同一预处理
+- 比较：
+  - `RandomForestClassifier`
+  - 一个 boosting 模型
+
+推荐 boosting 模型优先级：
+
+1. `XGBoost`
+2. `LightGBM`
+3. `HistGradientBoostingClassifier`
+
+推荐理由：
+
+- PDF 明确点名推荐 `XGBoost`
+- 后续调参空间更丰富
+- 更容易做高质量的超参数优化
+
+本节输出建议至少包含：
+
+- 模型对比表：
+  - accuracy
+  - macro-F1
+  - 训练时间
+- 每个模型的 confusion matrix
+- 分类报告或按类别 F1
+- 简短解释 bagging 与 boosting 在本数据集上的差异
+
+重点：
+
+- 不是证明谁永远更强
+- 而是说明在当前数据集上，哪种偏差-方差特性更适合
+
+### 6.4 C 部分：高级超参数优化
+
+必须完成：
+
+- 至少选择一个主模型
+- 使用高级优化方法进行调参
+- 目标函数对准验证集 `macro-F1`
+
+推荐主模型：
+
+- `XGBoost`
+
+推荐优化器：
+
+- `Optuna` 的 `TPESampler`
+
+推荐搜索空间示例：
+
+- `n_estimators`
+- `max_depth`
+- `learning_rate`
+- `min_child_weight`
+- `subsample`
+- `colsample_bytree`
+- `gamma`
+- `reg_alpha`
+- `reg_lambda`
+
+建议输出：
+
+- 最优参数表
+- 前若干 trial 的结果摘要
+- 调参前后性能对比表
+- 关键超参数的重要性解释
+
+注意：
+
+- 需要简要说明为什么搜索空间这样设
+- 需要说明“原本预期哪些参数最关键”
+- 需要说明“调参结果是否符合预期”
+
+这部分内容将直接为 PDF 中第 2 题服务。
+
+### 6.5 D 部分：个性化改进
+
+这是 notebook 中占比最高、最容易拉开差距的部分。
+
+#### 学号末位对应关系
+
+- `0-1` -> `Category A`：数据质量与缺失机制
+- `2-3` -> `Category B`：特征表示与特征工程
+- `4-5` -> `Category C`：类别不平衡与目标设计
+- `6-7` -> `Category D`：鲁棒性、校准或集成
+- `8-9` -> `Category E`：公平性、诊断或可解释性
+
+#### 每类推荐落地方式
+
+`Category A` 可选实现：
+
+- `IterativeImputer`
+- 更细粒度缺失指示器
+- 异常值裁剪或 winsorization
+- 脏值统一清洗
+
+`Category B` 可选实现：
+
+- 特征交叉
+- 类别合并
+- 目标无监督编码之外的安全表示
+- log 变换、比例特征、债务收入比等衍生变量
+
+`Category C` 可选实现：
+
+- `class_weight`
+- `SMOTE` 或其他重采样
+- focal-like 思想的替代实现
+- 基于验证集的阈值调整
+
+`Category D` 可选实现：
+
+- 概率校准
+- soft voting
+- stacking
+- bootstrap 稳定性测试
+
+`Category E` 可选实现：
+
+- `SHAP`
+- permutation importance
+- 分组公平性检查
+- 错误样本分析
+- 高风险误判模式总结
+
+#### 强烈建议的写法
+
+无论学号末位落在哪类，都建议再加一个容易出效果的 optional 类别：
+
+- 若主模型是树模型，优先补：
+  - `Category E` 可解释性
+  - 或 `Category D` 集成/校准
+
+这是因为：
+
+- PDF 明确欢迎“具体洞见”
+- 可解释性和误差分析非常利于写 reflection
+- 这些内容能让 PDF 更有证据，不容易空泛
+
+#### 个性化改进部分必须有的证据
+
+- 紧凑版 ablation table
+- 改进前后 accuracy / macro-F1 对比
+- 必要时增加 class-wise F1
+- 简短解释：
+  - 做了什么
+  - 为什么这样做
+  - 结果是否提升
+  - 若未提升，为什么仍然有价值
+
+### 6.6 E 部分：K-Means 与 GMM 探索
+
+这一部分应控制篇幅，不宜做成主线。
+
+建议流程：
+
+1. 从清洗后的特征中取适合聚类的数值空间
+2. 必要时先做缩放
+3. 可以考虑降维到：
+   - PCA 2D 用于可视化
+4. 对 `k=2~8` 分别跑：
+   - `KMeans`
+   - `GaussianMixture`
+
+建议输出：
+
+- `K-Means`：
+  - inertia / SSE 曲线
+  - cluster size
+  - silhouette score（可选加强）
+- `GMM`：
+  - BIC / AIC 或 log-likelihood 趋势
+  - component size
+  - posterior probability / responsibility 统计
+
+最后做一个对比表或图，回答：
+
+- 为什么 `K-Means` 是 hard assignment
+- 为什么 `GMM` 是 soft assignment
+- 当前数据是否存在模糊边界
+- `GMM` 是否额外揭示了不确定性或重叠结构
+
+### 6.7 F 部分：最终模型与 hidden-test 导出
+
+必须满足的原则：
+
+- 只能依据验证集结果选最终模型
+- 不允许根据 hidden-test 结果回头调参
+
+建议流程：
+
+1. 锁定最终 pipeline
+2. 用 `train + val` 合并后重新训练
+3. 对 `test_features.csv` 预测
+4. 生成严格符合格式的 CSV
+
+推荐导出逻辑：
+
+- 从 `test_features.csv` 保留：
+  - `applicant_id`
+  - `customer_key`
+- 新增一列：
+  - `premium_risk`
+
+最终列顺序必须是：
+
+1. `applicant_id`
+2. `customer_key`
+3. `premium_risk`
+
+---
+
+## 7. 推荐的整体实现路线
+
+下面给出一条适合拿分且可操作的实现主线。
+
+### 7.1 技术栈建议
+
+- Python
+- `pandas`
+- `numpy`
+- `scikit-learn`
+- `xgboost`
+- `optuna`
+- `matplotlib`
+- `seaborn`
+- `shap`（如选择可解释性方向）
+- `imbalanced-learn`（如选择重采样方向）
+
+### 7.2 推荐项目结构
+
+```text
+coursework_ml/
+├─ notebook/
+│  └─ insurance_premium_risk.ipynb
+├─ src/
+│  ├─ data_utils.py
+│  ├─ features.py
+│  ├─ metrics.py
+│  ├─ tuning.py
+│  └─ export.py
+├─ outputs/
+│  ├─ figures/
+│  ├─ tables/
+│  └─ predictions/
+├─ report/
+│  └─ theory_and_reflection.pdf
+└─ README.md
+```
+
+如果想降低复杂度，也可以只保留：
+
+- 一个主 notebook
+- 一到两个辅助 `.py` 脚本
+
+关键是要保证复现性，而不是强行复杂化。
+
+### 7.3 Notebook 章节建议
+
+建议 notebook 按下面顺序组织：
+
+1. Introduction and Setup
+2. Data Loading
+3. Data Cleaning and Leakage Check
+4. Baseline Pipeline
+5. Controlled Comparison: Random Forest vs Boosting
+6. Advanced Hyperparameter Optimisation
+7. Personalised Improvement Work
+8. K-Means and GMM Exploration
+9. Final Model Selection
+10. Hidden-Test Export
+11. Conclusion
+
+优点：
+
+- 与评分结构高度对齐
+- PDF 写作时可以直接反向索引表格和图片
+
+---
+
+## 8. 推荐的模型方案
+
+### 8.1 baseline
+
+推荐：
+
+- `LogisticRegression`
+
+作用：
+
+- 给出最低可比较基线
+- 验证预处理链是否稳定
+- 提供线性模型与树模型的风格对照
+
+### 8.2 bagging 方案
+
+推荐：
+
+- `RandomForestClassifier`
+
+关注点：
+
+- 对类别变量经 one-hot 后的适应性
+- 是否更稳但不够激进
+- 是否在少数类上表现一般
+
+### 8.3 boosting 方案
+
+推荐首选：
+
+- `XGBoost`
+
+备选：
+
+- `LightGBM`
+
+若环境依赖受限可退而求其次：
+
+- `HistGradientBoostingClassifier`
+
+### 8.4 最终模型的推荐候选
+
+最可能的优胜路线通常是：
+
+- 删除泄露特征后
+- 用稳定的预处理 pipeline
+- 以 `XGBoost` 作为主模型
+- 叠加一个与你学号类别匹配的必做改进
+- 再叠加一个可解释性或校准类的 optional 改进
+
+一个较稳的最终候选组合是：
+
+- 主模型：调参后的 `XGBoost`
+- 必做改进：按学号末位选择
+- 可选改进：`SHAP + error analysis` 或 `probability calibration`
+
+---
+
+## 9. PDF 的写作映射方案
+
+为了避免 PDF 和 notebook 脱节，建议从 notebook 设计开始就准备好下面这些证据位。
+
+### 9.1 Bagging vs Boosting
+
+PDF 要回答：
+
+- bagging 和 boosting 的定义与性质
+- 两个模型的验证结果
+- 辅助分析
+- 与本数据集的结合解释
+
+Notebook 需要提前准备：
+
+- `RF vs XGB` 对比表
+- confusion matrix
+- class-wise F1
+- 调参前后稳定性或训练/验证表现
+
+### 9.2 Hyperparameter Optimisation
+
+PDF 要回答：
+
+- 优化器为什么合理
+- 搜索空间为什么合理
+- 哪些参数最重要
+- 调参结果是否符合预期
+
+Notebook 需要提前准备：
+
+- Optuna study 结果表
+- 最优参数
+- 调参前后指标变化
+- 若可能，参数重要性图
+
+### 9.3 K-Means vs GMM
+
+PDF 要回答：
+
+- hard vs soft assignment
+- 两者假设差异
+- 当前数据上的观察结论
+
+Notebook 需要提前准备：
+
+- 一张聚类比较图
+- 一张指标对比表
+- 一段关于重叠和不确定性的解释
+
+### 9.4 Personalised Reflection
+
+PDF 要回答：
+
+- 必做类别做了什么
+- 可选类别做了什么
+- 遇到的问题
+- 做出的努力
+- 学到了什么
+
+Notebook 需要提前准备：
+
+- ablation table
+- before/after 结果对比
+- 若有失败实验，也保留最关键的一两个作为证据
+
+### 9.5 AI-use Declaration
+
+PDF 中建议采用诚实、克制、可核验的表述，例如：
+
+- 使用 AI 协助理解报错、检查代码逻辑、润色语言
+- 所有方法设计、实验执行、结果核验和结论撰写均由本人完成
+- 所有表格、图和结论均以 notebook 实验结果为依据
+
+注意：
+
+- 不能写成“AI 帮我完成了模型设计”
+- 不能出现与 notebook 证据对不上的泛化陈述
+
+---
+
+## 10. 风险点分析
+
+### 10.1 最大风险：没有先删除泄露特征
+
+后果：
+
+- 分数看似很高
+- 但整个分析会被视为失真
+- 影响 baseline、比较、调参和最终模型选择
+
+### 10.2 常见风险：比较不公平
+
+表现：
+
+- baseline 和后续模型使用了不同预处理
+- 一个模型用 train，一个模型用 train+val
+- 某些模型调参后和默认模型直接横比
+
+后果：
+
+- 结论缺乏可信度
+
+### 10.3 常见风险：只报 accuracy
+
+由于类别不平衡：
+
+- 只看 accuracy 会掩盖少数类问题
+- 很容易丢掉对 `Low` 或 `High` 类的分析深度
+
+### 10.4 常见风险：个性化改进做成“试很多但没有逻辑”
+
+PDF 原文明确更看重：
+
+- meaningful diagnostic
+- concrete insight
+
+而不是：
+
+- 一大堆零散实验截图
+
+### 10.5 常见风险：PDF 空泛
+
+如果 PDF 只是教材知识复述，而没有引用 notebook 的具体图表或指标，会被明显扣分。
+
+### 10.6 常见风险：CSV 格式错误
+
+尤其要避免：
+
+- 文件名错误
+- 列顺序错误
+- 标签拼写错误
+- 导出时丢掉 `applicant_id` 或 `customer_key`
+
+---
+
+## 11. 建议的执行顺序
+
+建议按以下顺序推进，效率最高：
+
+1. 先读取并检查 train / val / test 的列结构
+2. 找出并删除泄露特征
+3. 建立统一预处理 pipeline
+4. 跑 baseline
+5. 跑 `Random Forest vs XGBoost` 初始比较
+6. 选一个主模型做 `Optuna`
+7. 根据学号末位完成必做改进
+8. 增加一个 optional 改进
+9. 做 `K-Means + GMM` 紧凑探索
+10. 选最终模型并重训
+11. 导出 `test_result_[student_id].csv`
+12. 最后再写 PDF
+
+原因：
+
+- PDF 的每个回答都依赖 notebook 证据
+- 先写 PDF 容易出现空泛和证据脱节
+
+---
+
+## 12. 建议的时间分配
+
+如果按一份完整高质量作业来做，建议分配如下：
+
+- 数据清洗与泄露识别：`15%`
+- baseline 与模型比较：`20%`
+- 高级调参：`20%`
+- 个性化改进：`25%`
+- K-Means / GMM：`10%`
+- 最终导出与提交检查：`5%`
+- PDF 写作：`5%`
+
+如果时间紧，最不能压缩的部分是：
+
+- 泄露识别
+- 受控比较
+- 个性化改进
+- PDF 与 notebook 的证据对应
+
+---
+
+## 13. 最推荐的落地方案
+
+在当前要求下，一套相对稳妥、兼顾分数与实现成本的方案如下：
+
+### 13.1 Notebook 主线
+
+- 删除泄露特征
+- 建立 `ColumnTransformer + Pipeline`
+- baseline 使用 `LogisticRegression`
+- 受控比较使用：
+  - `RandomForestClassifier`
+  - `XGBoost`
+- 用 `Optuna` 调 `XGBoost`
+- 个性化改进做：
+  - 学号对应的必做类别
+  - 再补一个 `Category E` 或 `Category D`
+- 无监督探索做：
+  - `KMeans`
+  - `GaussianMixture`
+- 最终模型大概率采用：
+  - 调参后的 `XGBoost` 或其增强版本
+
+### 13.2 PDF 主线
+
+围绕五个 compulsory prompt 分别写，每一题都强制绑定 notebook 中的至少一个：
+
+- 表
+- 图
+- 指标结果
+
+写法上坚持：
+
+- 先简短理论
+- 再接本次实验数据
+- 再给出数据集相关解释
+
+### 13.3 提交前检查清单
+
+- notebook 能从头运行到尾
+- 图表和表格在 notebook 中可见
+- PDF 中提到的数值和 notebook 完全一致
+- hidden-test CSV 命名正确
+- hidden-test CSV 列顺序正确
+- 标签名称拼写正确
+- 所有附加脚本一并提交
+
+---
+
+## 14. 当前仍待确认的信息
+
+以下信息目前仍需你确认，才可以把“最终实现方案”从通用版收束为定制版：
+
+- 你的 `学号末位`
+- 你是否打算使用 `XGBoost`
+- 你是否希望 optional 改进优先走：
+  - 可解释性方向
+  - 集成/校准方向
+  - 类别不平衡方向
+
+其中最关键的是：
+
+- `学号末位`
+
+因为它直接决定 `Category A/B/C/D/E` 中哪一类是你的必做项。
+
+---
+
+## 15. 结论
+
+这份机器学习作业的最优策略不是“盲目追求最高分模型”，而是：
+
+- 先保证实验规范
+- 再确保比较公平
+- 再通过高级调参和个性化改进拉开差距
+- 最后让 PDF 和 notebook 形成严格的证据闭环
+
+如果后续要真正动手实现，建议直接按照本文档第 `11` 节的执行顺序推进，并优先先确认学号末位，再定制个性化改进方案。
@@ -0,0 +1,463 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "170d0b4f",
+   "metadata": {},
+   "source": [
+    "# Insurance Premium Risk Classification\n",
+    "## DTS304TC Machine Learning - Coursework 1\n",
+    "\n",
+    "**Student ID**: 1234560 (Last digit = 0)\n",
+    "**Compulsory Category**: A - Data Quality & Missingness\n",
+    "**Optional Category**: D - Robustness & Soft Voting Ensemble\n",
+    "\n",
+    "**Primary metric**: macro-F1 (imbalanced dataset)\n",
+    "**Secondary metric**: accuracy\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Workflow\n",
+    "1. Data loading & EDA\n",
+    "2. Leakage feature identification & removal\n",
+    "3. Preprocessing pipeline construction\n",
+    "4. Baseline model (Logistic Regression)\n",
+    "5. Controlled comparison: Random Forest vs XGBoost\n",
+    "6. Advanced hyperparameter optimisation (Optuna/TPE)\n",
+    "7. Personalised improvement (Category A + Category D)\n",
+    "8. K-Means & GMM unsupervised exploration\n",
+    "9. Final model selection\n",
+    "10. Hidden-test CSV export"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "463d3e6d",
+   "metadata": {},
+   "source": [
+    "## Step 1: Setup & Data Loading"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a12f069a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\nwarnings.filterwarnings('ignore')\nimport os\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.preprocessing import StandardScaler, LabelEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier, VotingClassifier\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.cluster import KMeans\nfrom sklearn.mixture import GaussianMixture\nfrom sklearn.metrics import silhouette_score\nfrom sklearn.decomposition import PCA\nimport xgboost as xgb\nimport optuna\noptuna.logging.set_verbosity(optuna.logging.WARNING)\n\nRANDOM_STATE = 42\nnp.random.seed(RANDOM_STATE)\nplt.rcParams['figure.figsize'] = (10, 6)\nplt.rcParams['font.size'] = 12\nsns.set_style('whitegrid')\nprint('All libraries imported successfully!')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c4b453a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DATA_DIR = r'd:\\Code\\doing_exercises\\programs\\外教作业外快\\强化学习个人课程作业报告\\dataset_final'\nOUTPUT_DIR = r'd:\\Code\\doing_exercises\\programs\\外教作业外快\\强化学习个人课程作业报告\\outputs'\n\ntrain_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))\nval_df   = pd.read_csv(os.path.join(DATA_DIR, 'val.csv'))\ntest_df  = pd.read_csv(os.path.join(DATA_DIR, 'test_features.csv'))\n\nprint(f'Train shape:  {train_df.shape}')\nprint(f'Val shape:    {val_df.shape}')\nprint(f'Test shape:   {test_df.shape}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b8e7ad9",
+   "metadata": {},
+   "source": [
+    "## Step 2: Exploratory Data Analysis (EDA)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "45e520e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== TARGET DISTRIBUTION (TRAIN) ===')\ntarget_counts = train_df['premium_risk'].value_counts()\nprint(target_counts)\nprint((target_counts / len(train_df) * 100).round(2))\n\nfig, ax = plt.subplots(figsize=(8, 5))\ncolors = ['#4CAF50', '#FFC107', '#F44336']\ntarget_counts.sort_index().plot(kind='bar', ax=ax, color=colors)\nax.set_title('Target Variable Distribution (Train)', fontsize=14)\nax.set_xlabel('Premium Risk')\nax.set_ylabel('Count')\nax.set_xticklabels(ax.get_xticklabels(), rotation=0)\nfor i, (idx, val) in enumerate(target_counts.sort_index().items()):\n    ax.text(i, val + 300, f'{val}\\n({val/len(train_df)*100:.1f}%)', ha='center')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'target_distribution.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e2428e4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== MISSING VALUES (TRAIN) ===')\nmissing = train_df.isnull().sum()\nmissing = missing[missing > 0].sort_values(ascending=False)\nprint(missing)\n\nfig, ax = plt.subplots(figsize=(12, 6))\nmissing.plot(kind='barh', ax=ax, color='coral')\nax.set_title('Missing Values per Column (Train)', fontsize=14)\nax.set_xlabel('Count')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'missing_values.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5cafc5e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "noise_cols = [c for c in train_df.columns if 'noise' in c.lower()]\nprint(f'Noise features: {noise_cols}')\n\nprint('\\n=== bureau_risk_index stats ===')\nprint(train_df['bureau_risk_index'].describe())\n\nfig, ax = plt.subplots(figsize=(8, 5))\ntrain_df.boxplot(column='bureau_risk_index', by='premium_risk', ax=ax)\nax.set_title('bureau_risk_index by Premium Risk')\nax.set_xlabel('Premium Risk')\nax.set_ylabel('bureau_risk_index')\nplt.suptitle('')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'bureau_risk_boxplot.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4db79797",
+   "metadata": {},
+   "source": [
+    "## Step 3: Leakage Feature Identification & Removal\n",
+    "\n",
+    "**Strategy**: Train a DecisionTree with each feature individually.\n",
+    "Features with abnormally high macro-F1 are suspected leakage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fbf59f43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def screen_single_feature_leakage(df, target_col, feature_cols, scoring='f1_macro'):\n    from sklearn.tree import DecisionTreeClassifier\n    results = []\n    for col in feature_cols:\n        temp_df = df[[col, target_col]].dropna()\n        X_temp = temp_df[[col]].values\n        y_temp = temp_df[target_col].values\n        le = LabelEncoder()\n        y_enc = le.fit_transform(y_temp)\n        try:\n            clf = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=3)\n            scores = cross_val_score(clf, X_temp, y_enc, cv=3, scoring=scoring)\n            results.append({'feature': col, 'mean_f1_macro': scores.mean(), 'std': scores.std()})\n        except:\n            results.append({'feature': col, 'mean_f1_macro': 0.0, 'std': 0.0})\n    return pd.DataFrame(results).sort_values('mean_f1_macro', ascending=False)\n\nfeature_to_test = [c for c in train_df.columns if c not in ['applicant_id', 'customer_key', 'premium_risk']]\nprint('Screening single features for leakage detection (this may take a few minutes)...')\nleakage_results = screen_single_feature_leakage(train_df, 'premium_risk', feature_to_test)\nprint('\\n=== TOP 10 SINGLE-FEATURE F1 MACRO SCORES ===')\nprint(leakage_results.head(10))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec03b578",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "LEAKAGE_THRESHOLD = 0.85\nprint('=== LEAKAGE DETECTION RESULTS ===')\nprint(leakage_results.head(10))\n\nbureau_score = leakage_results[leakage_results['feature'] == 'bureau_risk_index']['mean_f1_macro'].values[0]\nprint(f'\\nbureau_risk_index F1 macro: {bureau_score:.4f}')\n\nif bureau_score > LEAKAGE_THRESHOLD:\n    print('\\n*** ALERT: bureau_risk_index shows abnormally high predictive power! ***')\n    print('*** This is consistent with a leakage feature. ***')\n    print('*** ACTION: bureau_risk_index will be removed from features. ***')\n    LEAKAGE_FEATURE = 'bureau_risk_index'\nelse:\n    top_feat = leakage_results.iloc[0]['feature']\n    top_score = leakage_results.iloc[0]['mean_f1_macro']\n    print(f'\\nTop feature: {top_feat} with F1 macro = {top_score:.4f}')\n    if top_score > 0.80:\n        LEAKAGE_FEATURE = top_feat\n    else:\n        LEAKAGE_FEATURE = None"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f01746fe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if LEAKAGE_FEATURE:\n    print(f'Removing leakage feature: {LEAKAGE_FEATURE}')\n    train_df_clean = train_df.drop(columns=[LEAKAGE_FEATURE])\n    val_df_clean   = val_df.drop(columns=[LEAKAGE_FEATURE])\n    test_df_clean  = test_df.drop(columns=[LEAKAGE_FEATURE])\nelse:\n    print('No leakage feature to remove.')\n    train_df_clean = train_df.copy()\n    val_df_clean   = val_df.copy()\n    test_df_clean  = test_df.copy()\n\nprint(f'After removal - Train: {train_df_clean.shape}, Val: {val_df_clean.shape}, Test: {test_df_clean.shape}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed28be55",
+   "metadata": {},
+   "source": [
+    "## Step 4: Preprocessing Pipeline Construction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6f56180d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ID_COLS = ['applicant_id', 'customer_key', 'applicant_ref_code']\nNOISE_COLS = ['noise_feature_1', 'noise_feature_2', 'noise_feature_3', 'noise_feature_4', 'noise_feature_5']\nTARGET_COL = 'premium_risk'\n\nall_cols = train_df_clean.columns.tolist()\nfeature_cols_all = [c for c in all_cols if c not in ID_COLS + NOISE_COLS + [TARGET_COL]]\n\nNUMERIC_FEATURES = train_df_clean[feature_cols_all].select_dtypes(include=[np.number]).columns.tolist()\nCATEGORICAL_FEATURES = train_df_clean[feature_cols_all].select_dtypes(include=['object']).columns.tolist()\n\nprint(f'Total features: {len(feature_cols_all)}')\nprint(f'Numeric ({len(NUMERIC_FEATURES)}): {NUMERIC_FEATURES}')\nprint(f'Categorical ({len(CATEGORICAL_FEATURES)}): {CATEGORICAL_FEATURES}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2fbe754d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "numeric_transformer = Pipeline(steps=[\n    ('imputer', SimpleImputer(strategy='median')),\n    ('scaler', StandardScaler())\n])\n\ncategorical_transformer = Pipeline(steps=[\n    ('imputer', SimpleImputer(strategy='most_frequent')),\n    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))\n])\n\npreprocessor = ColumnTransformer(\n    transformers=[\n        ('num', numeric_transformer, NUMERIC_FEATURES),\n        ('cat', categorical_transformer, CATEGORICAL_FEATURES)\n    ],\n    remainder='drop'\n)\nprint('Preprocessing pipeline created!')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e797d98",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train = train_df_clean[feature_cols_all]\ny_train = train_df_clean[TARGET_COL]\nX_val   = val_df_clean[feature_cols_all]\ny_val   = val_df_clean[TARGET_COL]\nX_test  = test_df_clean[feature_cols_all]\n\nle_target = LabelEncoder()\ny_train_enc = le_target.fit_transform(y_train)\ny_val_enc   = le_target.transform(y_val)\n\nprint(f'Classes: {le_target.classes_}')\nprint(f'X_train: {X_train.shape} | X_val: {X_val.shape} | X_test: {X_test.shape}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "481e4b48",
+   "metadata": {},
+   "source": [
+    "## Step 5: Baseline Model - Logistic Regression"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a900d26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def evaluate_model(pipeline, X_tr, y_tr, X_v, y_v, le, model_name='Model'):\n    y_tr_pred = pipeline.predict(X_tr)\n    y_v_pred  = pipeline.predict(X_v)\n    results = {\n        'model': model_name,\n        'train_accuracy': accuracy_score(y_tr, y_tr_pred),\n        'val_accuracy':   accuracy_score(y_v, y_v_pred),\n        'train_f1_macro': f1_score(y_tr, y_tr_pred, average='macro'),\n        'val_f1_macro':   f1_score(y_v, y_v_pred, average='macro'),\n    }\n    f1_per_class = f1_score(y_v, y_v_pred, average=None)\n    for i, cls in enumerate(le.classes_):\n        results[f'val_f1_{cls}'] = f1_per_class[i]\n    return results\n\ndef plot_confusion_matrix(pipeline, X_v, y_v, le, title, save_path):\n    y_pred = pipeline.predict(X_v)\n    fig, ax = plt.subplots(figsize=(8, 6))\n    cm = confusion_matrix(y_v, y_pred)\n    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=le.classes_)\n    disp.plot(ax=ax, cmap='Blues', values_format='d')\n    ax.set_title(title, fontsize=14)\n    plt.tight_layout()\n    plt.savefig(save_path, dpi=150)\n    plt.show()\n    return cm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c8992d98",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('Training Baseline: Logistic Regression...')\nbaseline_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000, random_state=RANDOM_STATE, n_jobs=-1))\n])\nbaseline_pipeline.fit(X_train, y_train_enc)\n\nbaseline_results = evaluate_model(baseline_pipeline, X_train, y_train_enc, X_val, y_val_enc, le_target, 'Baseline_LR')\n\nprint('\\n=== BASELINE MODEL RESULTS ===')\nfor k, v in baseline_results.items():\n    if k != 'model':\n        print(f'{k}: {v:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5ff29071",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_confusion_matrix(baseline_pipeline, X_val, y_val_enc, le_target,\n    'Baseline: Logistic Regression - Confusion Matrix',\n    os.path.join(OUTPUT_DIR, 'figures', 'baseline_confusion_matrix.png'))\n\nprint('\\n=== CLASSIFICATION REPORT (VAL) ===')\ny_val_pred = baseline_pipeline.predict(X_val)\nprint(classification_report(y_val_enc, y_val_pred, target_names=le_target.classes_))\n\nall_results = [baseline_results]\npd.DataFrame(all_results).to_csv(\n    os.path.join(OUTPUT_DIR, 'tables', 'model_comparison_summary.csv'), index=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8675fd8e",
+   "metadata": {},
+   "source": [
+    "## Step 6: Controlled Comparison - Random Forest vs XGBoost"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30cd02ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n\nprint('Training Random Forest...')\nstart = time.time()\nrf_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('classifier', RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=RANDOM_STATE, n_jobs=-1))\n])\nrf_pipeline.fit(X_train, y_train_enc)\nrf_time = time.time() - start\n\nrf_results = evaluate_model(rf_pipeline, X_train, y_train_enc, X_val, y_val_enc, le_target, 'RandomForest')\nrf_results['train_time'] = rf_time\n\nprint('Training XGBoost...')\nstart = time.time()\nxgb_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('classifier', xgb.XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=6,\n                                    objective='multi:softmax', num_class=3,\n                                    tree_method='gpu_hist', device='cuda', random_state=RANDOM_STATE, verbosity=0))\n])\nxgb_pipeline.fit(X_train, y_train_enc)\nxgb_time = time.time() - start\n\nxgb_results = evaluate_model(xgb_pipeline, X_train, y_train_enc, X_val, y_val_enc, le_target, 'XGBoost')\nxgb_results['train_time'] = xgb_time\n\nprint(f'RF time: {rf_time:.2f}s | XGB time: {xgb_time:.2f}s')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "814e6787",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_results.append(rf_results)\nall_results.append(xgb_results)\nresults_df = pd.DataFrame(all_results)\n\nprint('\\n=== MODEL COMPARISON SUMMARY ===')\ndisplay_cols = ['model', 'train_accuracy', 'val_accuracy', 'train_f1_macro', 'val_f1_macro', 'train_time']\nprint(results_df[display_cols].round(4).to_string(index=False))\n\nprint('\\n=== CLASS-WISE F1 (VAL) ===')\nclass_cols = [c for c in results_df.columns if c.startswith('val_f1_') and c != 'val_f1_macro']\nprint(results_df[['model'] + class_cols].round(4).to_string(index=False))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "704d4061",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\nmodels = results_df['model'].tolist()\nval_f1 = results_df['val_f1_macro'].tolist()\nval_acc = results_df['val_accuracy'].tolist()\n\nbars1 = axes[0].bar(models, val_f1, color=['#2196F3', '#4CAF50', '#FF9800'])\naxes[0].set_title('Validation Macro-F1 Comparison', fontsize=13)\naxes[0].set_ylabel('Macro-F1')\naxes[0].set_ylim(0, 1)\nfor bar, val in zip(bars1, val_f1):\n    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.4f}', ha='center')\n\nbars2 = axes[1].bar(models, val_acc, color=['#2196F3', '#4CAF50', '#FF9800'])\naxes[1].set_title('Validation Accuracy Comparison', fontsize=13)\naxes[1].set_ylabel('Accuracy')\naxes[1].set_ylim(0, 1)\nfor bar, val in zip(bars2, val_acc):\n    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.4f}', ha='center')\n\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'model_comparison.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "89747cf4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_confusion_matrix(rf_pipeline, X_val, y_val_enc, le_target,\n    'Random Forest - Confusion Matrix',\n    os.path.join(OUTPUT_DIR, 'figures', 'rf_confusion_matrix.png'))\n\nplot_confusion_matrix(xgb_pipeline, X_val, y_val_enc, le_target,\n    'XGBoost - Confusion Matrix',\n    os.path.join(OUTPUT_DIR, 'figures', 'xgb_confusion_matrix.png'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9e3d57d",
+   "metadata": {},
+   "source": [
+    "### Bagging vs Boosting Analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "81508463",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== BAGGING VS BOOSTING ANALYSIS ===')\nrf_val_f1 = rf_results['val_f1_macro']\nrf_train_f1 = rf_results['train_f1_macro']\nrf_gap = rf_train_f1 - rf_val_f1\n\nxgb_val_f1 = xgb_results['val_f1_macro']\nxgb_train_f1 = xgb_results['train_f1_macro']\nxgb_gap = xgb_train_f1 - xgb_val_f1\n\nprint(f'Random Forest - val_f1_macro: {rf_val_f1:.4f}, overfitting gap: {rf_gap:.4f}')\nprint(f'XGBoost     - val_f1_macro: {xgb_val_f1:.4f}, overfitting gap: {xgb_gap:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de4a5bc9",
+   "metadata": {},
+   "source": [
+    "## Step 7: Advanced Hyperparameter Optimisation (Optuna)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e6361576",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def objective(trial):\n    params = {\n        'n_estimators': trial.suggest_int('n_estimators', 100, 500),\n        'max_depth': trial.suggest_int('max_depth', 3, 10),\n        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),\n        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),\n        'subsample': trial.suggest_float('subsample', 0.5, 1.0),\n        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),\n        'gamma': trial.suggest_float('gamma', 0, 5),\n        'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 10.0, log=True),\n        'reg_lambda': trial.suggest_float('reg_lambda', 1e-4, 10.0, log=True),\n        'objective': 'multi:softmax',\n        'num_class': 3,\n        'random_state': RANDOM_STATE,\n        'tree_method': 'gpu_hist', 'device': 'cuda',\n        'verbosity': 0\n    }\n    pipeline = Pipeline(steps=[\n        ('preprocessor', preprocessor),\n        ('classifier', xgb.XGBClassifier(**params))\n    ])\n    pipeline.fit(X_train, y_train_enc)\n    y_pred = pipeline.predict(X_val)\n    score = f1_score(y_val_enc, y_pred, average='macro')\n    return score\n\nprint('Starting Optuna hyperparameter optimisation (30 trials)...')\nstudy = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE))\nstudy.optimize(objective, n_trials=30, show_progress_bar=False)\n\nprint(f'Best trial: {study.best_trial.number} | Best macro-F1: {study.best_value:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7ba4f2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('\\n=== BEST HYPERPARAMETERS ===')\nbest_params = study.best_params\nfor k, v in best_params.items():\n    print(f'  {k}: {v}')\n\nfig = optuna.visualization.matplotlib.plot_optimization_history(study)\nplt.title('Optuna Optimization History')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'optuna_optimization_history.png'), dpi=150)\nplt.show()\n\nfig = optuna.visualization.matplotlib.plot_param_importances(study)\nplt.title('Hyperparameter Importance')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'optuna_param_importance.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "640263ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "best_xgb_params = {\n    **study.best_params,\n    'objective': 'multi:softmax',\n    'num_class': 3,\n    'random_state': RANDOM_STATE,\n    'tree_method': 'gpu_hist', 'device': 'cuda',\n    'verbosity': 0\n}\n\nprint('Training tuned XGBoost...')\nimport time\nstart = time.time()\ntuned_xgb_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('classifier', xgb.XGBClassifier(**best_xgb_params))\n])\ntuned_xgb_pipeline.fit(X_train, y_train_enc)\ntuned_time = time.time() - start\n\ntuned_results = evaluate_model(tuned_xgb_pipeline, X_train, y_train_enc, X_val, y_val_enc, le_target, 'XGBoost_Tuned')\ntuned_results['train_time'] = tuned_time\n\nprint('\\n=== TUNED XGBOOST RESULTS ===')\nfor k, v in tuned_results.items():\n    if k != 'model':\n        print(f'{k}: {v:.4f}')\n\nprint(f'\\nTuning improvement (macro-F1): +{tuned_results[\"val_f1_macro\"] - xgb_results[\"val_f1_macro\"]:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "19742e63",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_results.append(tuned_results)\nresults_df = pd.DataFrame(all_results)\n\nprint('\\n=== BEFORE VS AFTER TUNING ===')\nprint(results_df[['model', 'val_f1_macro', 'val_accuracy', 'train_time']].round(4).to_string(index=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d01bcca7",
+   "metadata": {},
+   "source": [
+    "## Step 8: Personalised Improvement (Category A + Category D)\n",
+    "\n",
+    "**Student ID last digit = 0 → Category A (Compulsory) + Category D (Optional)**\n",
+    "\n",
+    "- **Category A** (Data Quality & Missingness): Add missing value indicator features\n",
+    "- **Category D** (Robustness & Ensemble): Soft Voting Ensemble (RF + XGBoost)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1f662833",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== CATEGORY A: IMPROVED MISSING VALUE HANDLING ===')\n\nMISSING_COLS = ['net_monthly_income_gbp', 'avg_payment_delay_days', 'monthly_investment_gbp',\n                 'prior_debt_products', 'account_tenure']\n\nfor col in MISSING_COLS:\n    missing_col_name = f'{col}_missing'\n    train_df_clean[missing_col_name] = train_df_clean[col].isnull().astype(int)\n    val_df_clean[missing_col_name]   = val_df_clean[col].isnull().astype(int)\n    test_df_clean[missing_col_name]  = test_df_clean[col].isnull().astype(int)\n    print(f'Added missing indicator: {missing_col_name}')\n\nfeature_cols_catA = feature_cols_all + [f'{c}_missing' for c in MISSING_COLS]\nprint(f'\\nFeature columns after adding indicators: {len(feature_cols_catA)}')\n\nX_train_A = train_df_clean[feature_cols_catA]\nX_val_A   = val_df_clean[feature_cols_catA]\nX_test_A  = test_df_clean[feature_cols_catA]\n\nNUMERIC_FEATURES_A = X_train_A.select_dtypes(include=[np.number]).columns.tolist()\nCATEGORICAL_FEATURES_A = X_train_A.select_dtypes(include=['object']).columns.tolist()\n\npreprocessor_A = ColumnTransformer(\n    transformers=[\n        ('num', numeric_transformer, NUMERIC_FEATURES_A),\n        ('cat', categorical_transformer, CATEGORICAL_FEATURES_A)\n    ],\n    remainder='drop'\n)\n\ncatA_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor_A),\n    ('classifier', xgb.XGBClassifier(**best_xgb_params))\n])\ncatA_pipeline.fit(X_train_A, y_train_enc)\n\ncatA_results = evaluate_model(catA_pipeline, X_train_A, y_train_enc, X_val_A, y_val_enc, le_target, 'XGB_CatA_MissingHandling')\n\nprint('\\n=== CATEGORY A RESULTS ===')\nprint(f'val_f1_macro: {catA_results[\"val_f1_macro\"]:.4f}')\nprint(f'val_accuracy: {catA_results[\"val_accuracy\"]:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0069e9d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== CATEGORY D: SOFT VOTING ENSEMBLE ===')\nprint('Training Soft Voting Ensemble (RF + XGBoost)...')\n\nrf_clf = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=RANDOM_STATE, n_jobs=-1)\nxgb_clf = xgb.XGBClassifier(**best_xgb_params)\n\nvoting_clf = VotingClassifier(\n    estimators=[\n        ('rf', rf_clf),\n        ('xgb', xgb_clf)\n    ],\n    voting='soft',\n    n_jobs=-1\n)\n\nensemble_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('classifier', voting_clf)\n])\nensemble_pipeline.fit(X_train, y_train_enc)\n\nensemble_results = evaluate_model(ensemble_pipeline, X_train, y_train_enc, X_val, y_val_enc, le_target, 'Ensemble_SoftVoting')\n\nprint(f'Ensemble val_f1_macro: {ensemble_results[\"val_f1_macro\"]:.4f}')\nprint(f'Ensemble val_accuracy: {ensemble_results[\"val_accuracy\"]:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c95e0008",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_results.append(catA_results)\nall_results.append(ensemble_results)\nresults_df = pd.DataFrame(all_results)\n\nprint('\\n=== PERSONALISED IMPROVEMENT SUMMARY ===')\nprint(results_df[['model', 'val_f1_macro', 'val_accuracy']].round(4).to_string(index=False))\n\nresults_df.to_csv(\n    os.path.join(OUTPUT_DIR, 'tables', 'personalised_improvement_summary.csv'), index=False)\n\nimprove_A = catA_results['val_f1_macro'] - tuned_results['val_f1_macro']\nimprove_D = ensemble_results['val_f1_macro'] - tuned_results['val_f1_macro']\nprint(f'\\nCategory A improvement (vs Tuned): +{improve_A:.4f}')\nprint(f'Category D improvement (vs Tuned): +{improve_D:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df4d2cc2",
+   "metadata": {},
+   "source": [
+    "## Step 9: K-Means & GMM Unsupervised Exploration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5ddfd4d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== K-MEANS & GMM CLUSTERING ===')\n\npreprocessor_eval = ColumnTransformer(\n    transformers=[\n        ('num', numeric_transformer, NUMERIC_FEATURES),\n        ('cat', categorical_transformer, CATEGORICAL_FEATURES)\n    ],\n    remainder='drop'\n)\n\nX_train_scaled = preprocessor_eval.fit_transform(X_train)\nprint(f'Scaled training data shape: {X_train_scaled.shape}')\n\npca = PCA(n_components=2, random_state=RANDOM_STATE)\nX_train_pca = pca.fit_transform(X_train_scaled)\nprint(f'PCA explained variance: {pca.explained_variance_ratio_.sum():.4f}')\n\nk_range = range(2, 9)\nkmeans_results = []\ngmm_results = []\n\nfor k in k_range:\n    print(f'  Running k={k}...')\n    \n    km = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)\n    km_labels = km.fit_predict(X_train_scaled)\n    sil_km = silhouette_score(X_train_scaled, km_labels)\n    \n    gmm_model = GaussianMixture(n_components=k, random_state=RANDOM_STATE, n_init=5)\n    gmm_labels = gmm_model.fit_predict(X_train_scaled)\n    sil_gmm = silhouette_score(X_train_scaled, gmm_labels)\n    \n    kmeans_results.append({\n        'k': k,\n        'inertia': km.inertia_,\n        'silhouette_x': sil_km\n    })\n    gmm_results.append({\n        'k': k,\n        'log_likelihood': gmm_model.score(X_train_scaled) * X_train_scaled.shape[0],\n        'bic': gmm_model.bic(X_train_scaled),\n        'aic': gmm_model.aic(X_train_scaled),\n        'silhouette_y': sil_gmm\n    })\n\nkm_df = pd.DataFrame(kmeans_results)\ngmm_df = pd.DataFrame(gmm_results)\ncluster_df = km_df.merge(gmm_df, on='k')\nprint('\\n=== CLUSTERING COMPARISON ===')\nprint(cluster_df.round(4).to_string(index=False))\n\ncluster_df.to_csv(os.path.join(OUTPUT_DIR, 'tables', 'clustering_comparison.csv'), index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d438c228",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n\naxes[0].plot(cluster_df['k'], cluster_df['inertia'], 'bo-', label='K-Means Inertia', linewidth=2)\naxes[0].set_xlabel('k')\naxes[0].set_ylabel('Inertia')\naxes[0].set_title('K-Means: Elbow Method')\naxes[0].grid(True)\n\naxes[1].plot(cluster_df['k'], cluster_df['bic'], 'g^-', label='BIC', linewidth=2)\naxes[1].plot(cluster_df['k'], cluster_df['aic'], 'rs--', label='AIC', linewidth=2)\naxes[1].set_xlabel('k')\naxes[1].set_ylabel('Score')\naxes[1].set_title('GMM: BIC & AIC (lower is better)')\naxes[1].legend()\naxes[1].grid(True)\n\naxes[2].plot(cluster_df['k'], cluster_df['silhouette_x'], 'bo-', label='K-Means', linewidth=2)\naxes[2].plot(cluster_df['k'], cluster_df['silhouette_y'], 'g^-', label='GMM', linewidth=2)\naxes[2].set_xlabel('k')\naxes[2].set_ylabel('Silhouette Score')\naxes[2].set_title('Silhouette Score Comparison (higher is better)')\naxes[2].legend()\naxes[2].grid(True)\n\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'clustering_comparison.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "08ba45ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "best_k = cluster_df.loc[cluster_df['silhouette_x'].idxmax(), 'k']\nprint(f'Best K for K-Means (by silhouette): {best_k}')\n\nkm_best = KMeans(n_clusters=int(best_k), random_state=RANDOM_STATE, n_init=10)\nkm_best_labels = km_best.fit_predict(X_train_scaled)\n\nfig, ax = plt.subplots(figsize=(8, 6))\nscatter = ax.scatter(X_train_pca[:, 0], X_train_pca[:, 1],\n                     c=km_best_labels, cmap='viridis', alpha=0.5, s=10)\nax.set_xlabel('PC1')\nax.set_ylabel('PC2')\nax.set_title(f'K-Means Clustering (k={best_k}) - PCA Visualization')\nplt.colorbar(scatter, ax=ax, label='Cluster')\nplt.tight_layout()\nplt.savefig(os.path.join(OUTPUT_DIR, 'figures', 'clustering_visualization.png'), dpi=150)\nplt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "48c4ad67",
+   "metadata": {},
+   "source": [
+    "## Step 10: Final Model Selection & Hidden-Test Export"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34692aa5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('=== FINAL MODEL SELECTION ===')\nprint('Based on val_f1_macro (primary metric):')\nfinal_model_name = results_df.loc[results_df['val_f1_macro'].idxmax(), 'model']\nprint(f'Selected model: {final_model_name} (val_f1_macro = {results_df[\"val_f1_macro\"].max():.4f})')\n\nif final_model_name == 'XGB_CatA_MissingHandling':\n    final_pipeline = catA_pipeline\n    X_test_final = X_test_A\nelif final_model_name == 'Ensemble_SoftVoting':\n    final_pipeline = ensemble_pipeline\n    X_test_final = X_test\nelse:\n    final_pipeline = tuned_xgb_pipeline\n    X_test_final = X_test\n\ny_val_final_pred = final_pipeline.predict(X_test_final if final_model_name == 'XGBoost_Tuned' else X_test)\ny_val_final_decoded = le_target.inverse_transform(y_val_final_pred)\n\nplot_confusion_matrix(final_pipeline, X_val_A if final_model_name == 'XGB_CatA_MissingHandling' else X_val,\n                      y_val_enc, le_target,\n                      f'Final Model: {final_model_name} - Confusion Matrix',\n    os.path.join(OUTPUT_DIR, 'figures', 'final_model_confusion_matrix.png'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e5d2526b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('\\n=== FINAL CLASSIFICATION REPORT (VAL) ===')\ny_val_pred_final = final_pipeline.predict(X_val_A if final_model_name == 'XGB_CatA_MissingHandling' else X_val)\nprint(classification_report(y_val_enc, y_val_pred_final, target_names=le_target.classes_))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "89d712c0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "STUDENT_ID = '1234560'\n\nif final_model_name == 'XGB_CatA_MissingHandling':\n    y_test_pred = final_pipeline.predict(X_test_A)\nelif final_model_name == 'Ensemble_SoftVoting':\n    y_test_pred = final_pipeline.predict(X_test)\nelse:\n    y_test_pred = final_pipeline.predict(X_test)\n\ny_test_labels = le_target.inverse_transform(y_test_pred)\n\nsubmission_df = pd.DataFrame({\n    'applicant_id': test_df['applicant_id'],\n    'customer_key': test_df['customer_key'],\n    'premium_risk': y_test_labels\n})\n\nprint('=== SUBMISSION CSV VALIDATION ===')\nprint(f'Shape: {submission_df.shape}')\nprint(f'Columns: {list(submission_df.columns)}')\nprint(submission_df.head())\n\nprint('\\nPrediction counts:')\nprint(submission_df['premium_risk'].value_counts())\n\ncsv_path = os.path.join(OUTPUT_DIR, 'predictions', f'test_result_{STUDENT_ID}.csv')\nsubmission_df.to_csv(csv_path, index=False)\nprint(f'\\n*** CSV saved to: {csv_path} ***')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "my_env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,8 @@
+k,inertia,silhouette_x,log_likelihood,bic,aic,silhouette_y
+2,1092962.434364126,0.174016661115075,181335.84491703784,-359250.54291550705,-362061.6898340757,0.41420390111182703
+3,1018586.5047121042,0.17317021187208304,554291.2303605897,-1103445.131905755,-1107666.4607211794,0.2977020104302583
+4,953249.4382030136,0.18080059886795355,972834.1094461675,-1938814.7081800548,-1944446.218892335,0.3964327255424141
+5,889284.892342685,0.1964251564081267,1002913.0930748597,-1997256.4935405836,-2004298.1861497194,0.40146893512413845
+6,818950.9117652641,0.17683056672008368,1180025.734163945,-2349765.5938218986,-2358217.46832789,0.24683353848428613
+7,777658.2185885893,0.197056012688701,1203191.531501821,-2394381.006600795,-2404243.063003642,0.3109553553475885
+8,691940.8330833976,0.20149802939267383,1261969.3739466753,-2510220.5095936474,-2521492.7478933507,0.17264064800570944
@@ -0,0 +1,5 @@
+model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard,train_time
+Baseline_LR,0.7593680672268908,0.7341714285714286,0.7492574544185482,0.7237629331592531,0.7665209565440987,0.6489501312335958,0.7558177117000646,
+RandomForest,1.0,0.7877333333333333,1.0,0.770789728543472,0.7874554916461244,0.7095334685598377,0.8153802254244543,57.91048526763916
+XGBoost,0.8519529411764706,0.8371047619047619,0.8297116592669606,0.8143842728003406,0.8904623073719283,0.6944039941751612,0.8582865168539325,67.63970804214478
+XGBoost_Tuned,0.9767663865546219,0.8700190476190476,0.9739400525375727,0.8519502714571496,0.9084439578486383,0.7620280474649407,0.8853788090578697,142.65462470054626
@@ -0,0 +1,7 @@
+model,train_accuracy,val_accuracy,train_f1_macro,val_f1_macro,val_f1_High,val_f1_Low,val_f1_Standard,train_time
+Baseline_LR,0.7593680672268908,0.7341714285714286,0.7492574544185482,0.7237629331592531,0.7665209565440987,0.6489501312335958,0.7558177117000646,
+RandomForest,1.0,0.7877333333333333,1.0,0.770789728543472,0.7874554916461244,0.7095334685598377,0.8153802254244543,57.91048526763916
+XGBoost,0.8519529411764706,0.8371047619047619,0.8297116592669606,0.8143842728003406,0.8904623073719283,0.6944039941751612,0.8582865168539325,67.63970804214478
+XGBoost_Tuned,0.9767663865546219,0.8700190476190476,0.9739400525375727,0.8519502714571496,0.9084439578486383,0.7620280474649407,0.8853788090578697,142.65462470054626
+XGB_CatA_MissingHandling,0.9772638655462185,0.870552380952381,0.9745439553742655,0.8529411889528661,0.910207423580786,0.763542562338779,0.885073580939033,
+Ensemble_SoftVoting,0.9972436974789916,0.8675047619047619,0.9969472283391928,0.851001101708816,0.9024125779343996,0.7684120902511707,0.8821786369408776,
@@ -0,0 +1,138 @@
+[project]
+name = "insurance-premium-risk"
+version = "0.1.0"
+description = "DTS304TC Coursework 1 - 健康保险保费风险预测"
+requires-python = ">=3.10"
+dependencies = [
+    # 基础科学计算
+    "numpy>=2.2.6",
+    "pandas>=2.3.3",
+    "scipy>=1.15.3",
+
+    # 机器学习核心
+    "scikit-learn>=1.7.2",
+    "xgboost>=3.2.0",
+
+    # 超参数优化
+    "optuna>=4.8.0",
+
+    # 可视化
+    "matplotlib>=3.10.8",
+    "seaborn>=0.13.2",
+
+    # 数据预处理
+    "imbalanced-learn>=0.14.1",
+    "shap>=0.49.1",
+
+    # 实用工具
+    "joblib>=1.5.3",
+    "tqdm>=4.67.1",
+]
+
+# 开发依赖
+[project.optional-dependencies]
+dev = ["pytest>=7.4.0", "black>=23.7.0", "ruff>=0.0.290", "mypy>=1.5.0"]
+docs = ["sphinx>=7.1.0", "nbsphinx>=0.9.0", "sphinx-rtd-theme>=1.3.0"]
+
+[tool.uv]
+# Python 环境设置 - 使用 Anaconda 的 my_env 环境
+python-version = "3.10"
+
+# 显式指定 Python 路径（确保使用 anaconda my_env）
+python-path = "D:\\ProgramData\\anaconda3\\envs\\my_env\\python.exe"
+
+# 包索引配置（使用清华镜像源加速下载）
+index-url = "https://pypi.tuna.tsinghua.edu.cn/simple"
+
+# 备用索引源（按优先级排列）
+extra-index-url = [
+    "https://pypi.org/simple",
+    "https://mirrors.aliyun.com/pypi/simple/",
+]
+
+# 并行安装（加快安装速度）
+parallel = true
+
+# 不自动管理 Python 版本（使用已有的 anaconda 环境）
+managed = false
+
+# 不允许自动下载 Python
+python-downloads = false
+
+# 预发布版本（如果需要测试最新功能）
+# prerelease = false
+
+# 排除特定包（如果有兼容性问题）
+# exclude-newer = "2025-01-01T00:00:00Z"
+
+# 环境变量
+[tool.uv.env]
+# 设置网络超时（秒）
+UV_HTTP_TIMEOUT = "300"
+
+# 设置并发下载数
+UV_CONCURRENT_DOWNLOADS = "8"
+
+[tool.black]
+line-length = 100
+target-version = ['py310']
+include = '\.pyi?$'
+exclude = '''
+/(
+    \.git
+  | \.venv
+  | build
+  | dist
+  | __pycache__
+)/
+'''
+
+[tool.ruff]
+line-length = 100
+target-version = "py310"
+
+[tool.ruff.lint]
+select = [
+    "E",  # pycodestyle errors
+    "W",  # pycodestyle warnings
+    "F",  # pyflakes
+    "I",  # isort
+    "B",  # flake8-bugbear
+    "C4", # flake8-comprehensions
+    "UP", # pyupgrade
+]
+ignore = [
+    "E501", # line too long (handled by black)
+    "B008", # do not perform function calls in argument defaults
+    "C901", # too complex
+]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+python_files = ["test_*.py"]
+python_classes = ["Test*"]
+python_functions = ["test_*"]
+addopts = "-v --tb=short"
+
+[tool.mypy]
+python_version = "3.10"
+warn_return_any = true
+warn_unused_configs = true
+disallow_untyped_defs = false
+ignore_missing_imports = true
+
+[tool.jupyter]
+# Jupyter 相关配置
+kernel_name = "my_env"
+
+[tool.jupyter.lab]
+# JupyterLab 相关配置
+autoreload = true
+
+[tool.lazy-logs]
+# 日志配置
+level = "INFO"
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
@@ -0,0 +1,65 @@
+"""
+运行 insurance_premium_risk.ipynb 的脚本
+将 notebook 代码单元格提取出来逐个执行
+"""
+import json, sys, os, warnings, traceback, time
+
+warnings.filterwarnings('ignore')
+
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as _real_mpl_plt
+_real_mpl_plt.show = lambda *a, **kw: None
+
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
+from sklearn.model_selection import cross_val_score
+from sklearn.preprocessing import StandardScaler, LabelEncoder
+from sklearn.linear_model import LogisticRegression
+from sklearn.ensemble import RandomForestClassifier, VotingClassifier
+from sklearn.pipeline import Pipeline
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.impute import SimpleImputer
+from sklearn.cluster import KMeans
+from sklearn.mixture import GaussianMixture
+from sklearn.metrics import silhouette_score
+from sklearn.decomposition import PCA
+import xgboost as xgb
+import optuna
+optuna.logging.set_verbosity(optuna.logging.WARNING)
+
+RANDOM_STATE = 42
+np.random.seed(RANDOM_STATE)
+plt.rcParams['figure.figsize'] = (10, 6)
+plt.rcParams['font.size'] = 12
+sns.set_style('whitegrid')
+
+# ===== 读取 notebook =====
+nb_path = r'd:\Code\doing_exercises\programs\外教作业外快\强化学习个人课程作业报告\notebooks\insurance_premium_risk.ipynb'
+cells = json.load(open(nb_path, encoding='utf-8'))['cells']
+code_cells = [c for c in cells if c['cell_type'] == 'code']
+print(f"Total code cells: {len(code_cells)}")
+
+# ===== 执行每个单元格 =====
+# 使用全局 __main__ 命名空间，变量跨单元格持久化
+main_ns = globals().copy()
+
+for i, cell in enumerate(code_cells):
+    src = ''.join(cell['source'])
+    print(f"\n{'='*60}")
+    print(f"Running cell {i+1}/{len(code_cells)}...")
+    print(f"  Source: {src[:80].replace(chr(10), ' ')}")
+    try:
+        exec(compile(src, f'cell_{i+1}', 'exec'), main_ns)
+    except Exception as e:
+        print(f"ERROR in cell {i+1}: {e}")
+        traceback.print_exc()
+        print("Stopping execution.")
+        break
+
+print("\n\nAll cells executed successfully!")
+print(f"Results saved to: outputs/figures/ and outputs/tables/")
@@ -0,0 +1,32 @@
+"""
+Part 2: 运行完整的 notebook cells 1-35
+解决中文路径编码问题
+"""
+import warnings, time, os, sys, json, traceback
+warnings.filterwarnings('ignore')
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as _p
+_p.show = lambda *a, **kw: None
+
+nb = r'D:\Code\doing_exercises\programs\外教作业外快\强化学习个人课程作业报告\notebooks\insurance_premium_risk.ipynb'
+cells = json.load(open(nb, encoding='utf-8'))['cells']
+code_cells = [c for c in cells if c['cell_type'] == 'code']
+print(f"Total code cells: {len(code_cells)}")
+
+main_ns = globals().copy()
+main_ns['RANDOM_STATE'] = 42
+
+for i, cell in enumerate(code_cells, start=1):
+    src = ''.join(cell['source'])
+    print(f"\n{'='*60}")
+    print(f"Running cell {i}/{len(code_cells)}...")
+    try:
+        exec(compile(src, f'cell_{i}', 'exec'), main_ns)
+    except Exception as e:
+        print(f"ERROR cell {i}: {e}")
+        traceback.print_exc()
+        print("Stopping.")
+        break
+
+print("\n\nAll cells executed!")
@@ -0,0 +1,17 @@
+\relax 
+\providecommand\hyper@newdestlabel[2]{}
+\providecommand*\HyPL@Entry[1]{}
+\HyPL@Entry{0<</S/D>>}
+\@writefile{toc}{\contentsline {section}{\numberline {1}Bagging vs Boosting}{1}{section.1}\protected@file@percent }
+\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces Controlled supervised model comparison (identical pipeline and split).}}{1}{table.caption.1}\protected@file@percent }
+\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
+\newlabel{tab: supervised-comparison}{{1}{1}{Controlled supervised model comparison (identical pipeline and split)}{table.caption.1}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {2}Hyperparameter Optimisation}{1}{section.2}\protected@file@percent }
+\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Optuna parameter importance. Larger bars indicate higher influence on validation macro-F1.}}{2}{figure.caption.2}\protected@file@percent }
+\newlabel{fig: param-importance}{{1}{2}{Optuna parameter importance. Larger bars indicate higher influence on validation macro-F1}{figure.caption.2}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {3}K-Means vs GMM}{2}{section.3}\protected@file@percent }
+\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Full clustering comparison across k=2 to k=8.}}{3}{table.caption.3}\protected@file@percent }
+\newlabel{tab: clustering}{{2}{3}{Full clustering comparison across k=2 to k=8}{table.caption.3}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {4}Personalised Improvement Reflection}{3}{section.4}\protected@file@percent }
+\@writefile{toc}{\contentsline {section}{\numberline {5}AI Use Declaration}{4}{section.5}\protected@file@percent }
+\gdef \@abspage@last{4}
@@ -0,0 +1,660 @@
+This is XeTeX, Version 3.141592653-2.6-0.999997 (TeX Live 2025) (preloaded format=xelatex 2025.6.5)  25 APR 2026 01:38
+entering extended mode
+ restricted \write18 enabled.
+ %&-line parsing enabled.
+**theory_and_reflection_1234560.tex
+(./theory_and_reflection_1234560.tex
+LaTeX2e <2024-11-01> patch level 2
+L3 programming layer <2025-01-18>
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/article.cls
+Document Class: article 2024/06/29 v1.4n Standard LaTeX document class
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/size11.clo
+File: size11.clo 2024/06/29 v1.4n Standard LaTeX file (size option)
+)
+\c@part=\count192
+\c@section=\count193
+\c@subsection=\count194
+\c@subsubsection=\count195
+\c@paragraph=\count196
+\c@subparagraph=\count197
+\c@figure=\count198
+\c@table=\count199
+\abovecaptionskip=\skip49
+\belowcaptionskip=\skip50
+\bibindent=\dimen141
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/geometry/geometry.sty
+Package: geometry 2020/01/02 v5.9 Page Geometry
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/keyval.sty
+Package: keyval 2022/05/29 v1.15 key=value parser (DPC)
+\KV@toks@=\toks17
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/iftex/ifvtex.sty
+Package: ifvtex 2019/10/25 v1.7 ifvtex legacy package. Use iftex instead.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/iftex/iftex.sty
+Package: iftex 2024/12/12 v1.0g TeX engine tests
+))
+\Gm@cnth=\count266
+\Gm@cntv=\count267
+\c@Gm@tempcnt=\count268
+\Gm@bindingoffset=\dimen142
+\Gm@wd@mp=\dimen143
+\Gm@odd@mp=\dimen144
+\Gm@even@mp=\dimen145
+\Gm@layoutwidth=\dimen146
+\Gm@layoutheight=\dimen147
+\Gm@layouthoffset=\dimen148
+\Gm@layoutvoffset=\dimen149
+\Gm@dimlist=\toks18
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec.sty
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3packages/xparse/xpars
+e.sty
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3kernel/expl3.sty
+Package: expl3 2025-01-18 L3 programming layer (loader) 
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3backend/l3backend-xet
+ex.def
+File: l3backend-xetex.def 2024-05-08 L3 backend support: XeTeX
+\g__graphics_track_int=\count269
+\l__pdf_internal_box=\box52
+\g__pdf_backend_annotation_int=\count270
+\g__pdf_backend_link_int=\count271
+))
+Package: xparse 2024-08-16 L3 Experimental document command parser
+)
+Package: fontspec 2024/05/11 v2.9e Font selection for XeLaTeX and LuaLaTeX
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec-xetex
+.sty
+Package: fontspec-xetex 2024/05/11 v2.9e Font selection for XeLaTeX and LuaLaTe
+X
+\l__fontspec_script_int=\count272
+\l__fontspec_language_int=\count273
+\l__fontspec_strnum_int=\count274
+\l__fontspec_tmp_int=\count275
+\l__fontspec_tmpa_int=\count276
+\l__fontspec_tmpb_int=\count277
+\l__fontspec_tmpc_int=\count278
+\l__fontspec_em_int=\count279
+\l__fontspec_emdef_int=\count280
+\l__fontspec_strong_int=\count281
+\l__fontspec_strongdef_int=\count282
+\l__fontspec_tmpa_dim=\dimen150
+\l__fontspec_tmpb_dim=\dimen151
+\l__fontspec_tmpc_dim=\dimen152
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/fontenc.sty
+Package: fontenc 2021/04/29 v2.0v Standard LaTeX package
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec.cfg))
+) (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/graphicx.sty
+Package: graphicx 2021/09/16 v1.2d Enhanced LaTeX Graphics (DPC,SPQR)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/graphics.sty
+Package: graphics 2024/08/06 v1.4g Standard LaTeX Graphics (DPC,SPQR)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/trig.sty
+Package: trig 2023/12/02 v1.11 sin cos tan (DPC)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics-cfg/graphics.c
+fg
+File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration
+)
+Package graphics Info: Driver file: xetex.def on input line 106.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics-def/xetex.def
+File: xetex.def 2022/09/22 v5.0n Graphics/color driver for xetex
+))
+\Gin@req@height=\dimen153
+\Gin@req@width=\dimen154
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/booktabs/booktabs.sty
+Package: booktabs 2020/01/12 v1.61803398 Publication quality tables
+\heavyrulewidth=\dimen155
+\lightrulewidth=\dimen156
+\cmidrulewidth=\dimen157
+\belowrulesep=\dimen158
+\belowbottomsep=\dimen159
+\aboverulesep=\dimen160
+\abovetopsep=\dimen161
+\cmidrulesep=\dimen162
+\cmidrulekern=\dimen163
+\defaultaddspace=\dimen164
+\@cmidla=\count283
+\@cmidlb=\count284
+\@aboverulesep=\dimen165
+\@belowrulesep=\dimen166
+\@thisruleclass=\count285
+\@lastruleclass=\count286
+\@thisrulewidth=\dimen167
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/tools/array.sty
+Package: array 2024/10/17 v2.6g Tabular extension package (FMi)
+\col@sep=\dimen168
+\ar@mcellbox=\box53
+\extrarowheight=\dimen169
+\NC@list=\toks19
+\extratabsurround=\skip51
+\backup@length=\skip52
+\ar@cellbox=\box54
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/tools/tabularx.sty
+Package: tabularx 2023/12/11 v2.12a `tabularx' package (DPC)
+\TX@col@width=\dimen170
+\TX@old@table=\dimen171
+\TX@old@col=\dimen172
+\TX@target=\dimen173
+\TX@delta=\dimen174
+\TX@cols=\count287
+\TX@ftn=\toks20
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/float/float.sty
+Package: float 2001/11/08 v1.3d Float enhancements (AL)
+\c@float@type=\count288
+\float@exts=\toks21
+\float@box=\box55
+\@float@everytoks=\toks22
+\@floatcapt=\box56
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/hyperref.sty
+Package: hyperref 2024-11-05 v7.01l Hypertext links for LaTeX
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/kvsetkeys/kvsetkeys.sty
+Package: kvsetkeys 2022-10-05 v1.19 Key value parser (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/kvdefinekeys/kvdefine
+keys.sty
+Package: kvdefinekeys 2019-12-19 v1.6 Define keys (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/pdfescape/pdfescape.s
+ty
+Package: pdfescape 2019/12/09 v1.15 Implements pdfTeX's escape features (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/ltxcmds/ltxcmds.sty
+Package: ltxcmds 2023-12-04 v1.26 LaTeX kernel commands for general use (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/pdftexcmds/pdftexcmds
+.sty
+Package: pdftexcmds 2020-06-27 v0.33 Utility functions of pdfTeX for LuaTeX (HO
+)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/infwarerr/infwarerr.s
+ty
+Package: infwarerr 2019/12/03 v1.5 Providing info/warning/error messages (HO)
+)
+Package pdftexcmds Info: \pdf@primitive is available.
+Package pdftexcmds Info: \pdf@ifprimitive is available.
+Package pdftexcmds Info: \pdfdraftmode not found.
+))
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hycolor/hycolor.sty
+Package: hycolor 2020-01-27 v1.10 Color options for hyperref/bookmark (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/nameref.sty
+Package: nameref 2023-11-26 v2.56 Cross-referencing by name of section
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/refcount/refcount.sty
+Package: refcount 2019/12/15 v3.6 Data extraction from label references (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/gettitlestring/gettit
+lestring.sty
+Package: gettitlestring 2019/12/15 v1.6 Cleanup title references (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/kvoptions/kvoptions.sty
+Package: kvoptions 2022-06-15 v3.15 Key value format for package options (HO)
+))
+\c@section@level=\count289
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/etoolbox/etoolbox.sty
+Package: etoolbox 2025/02/11 v2.5l e-TeX tools for LaTeX (JAW)
+\etb@tempcnta=\count290
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/stringenc/stringenc.s
+ty
+Package: stringenc 2019/11/29 v1.12 Convert strings between diff. encodings (HO
+)
+)
+\@linkdim=\dimen175
+\Hy@linkcounter=\count291
+\Hy@pagecounter=\count292
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/pd1enc.def
+File: pd1enc.def 2024-11-05 v7.01l Hyperref: PDFDocEncoding definition (HO)
+) (d:/settings/Language/texlive/2025/texmf-dist/tex/generic/intcalc/intcalc.sty
+Package: intcalc 2019/12/15 v1.3 Expandable calculations with integers (HO)
+)
+\Hy@SavedSpaceFactor=\count293
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/puenc.def
+File: puenc.def 2024-11-05 v7.01l Hyperref: PDF Unicode definition (HO)
+)
+Package hyperref Info: Hyper figures OFF on input line 4157.
+Package hyperref Info: Link nesting OFF on input line 4162.
+Package hyperref Info: Hyper index ON on input line 4165.
+Package hyperref Info: Plain pages OFF on input line 4172.
+Package hyperref Info: Backreferencing OFF on input line 4177.
+Package hyperref Info: Implicit mode ON; LaTeX internals redefined.
+Package hyperref Info: Bookmarks ON on input line 4424.
+\c@Hy@tempcnt=\count294
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/url/url.sty
+\Urlmuskip=\muskip17
+Package: url 2013/09/16  ver 3.4  Verb mode for urls, etc.
+)
+LaTeX Info: Redefining \url on input line 4763.
+\XeTeXLinkMargin=\dimen176
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/bitset/bitset.sty
+Package: bitset 2019/12/09 v1.3 Handle bit-vector datatype (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/bigintcalc/bigintcalc
+.sty
+Package: bigintcalc 2019/12/15 v1.5 Expandable calculations on big integers (HO
+)
+))
+\Fld@menulength=\count295
+\Field@Width=\dimen177
+\Fld@charsize=\dimen178
+Package hyperref Info: Hyper figures OFF on input line 6042.
+Package hyperref Info: Link nesting OFF on input line 6047.
+Package hyperref Info: Hyper index ON on input line 6050.
+Package hyperref Info: backreferencing OFF on input line 6057.
+Package hyperref Info: Link coloring OFF on input line 6062.
+Package hyperref Info: Link coloring with OCG OFF on input line 6067.
+Package hyperref Info: PDF/A mode OFF on input line 6072.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/atbegshi-ltx.sty
+Package: atbegshi-ltx 2021/01/10 v1.0c Emulation of the original atbegshi
+package with kernel methods
+)
+\Hy@abspage=\count296
+\c@Item=\count297
+\c@Hfootnote=\count298
+)
+Package hyperref Info: Driver (autodetected): hxetex.
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/hxetex.def
+File: hxetex.def 2024-11-05 v7.01l Hyperref driver for XeTeX
+\pdfm@box=\box57
+\c@Hy@AnnotLevel=\count299
+\HyField@AnnotCount=\count300
+\Fld@listcount=\count301
+\c@bookmark@seq@number=\count302
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/rerunfilecheck/rerunfil
+echeck.sty
+Package: rerunfilecheck 2022-07-10 v1.10 Rerun checks for auxiliary files (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/atveryend-ltx.sty
+Package: atveryend-ltx 2020/08/19 v1.0a Emulation of the original atveryend pac
+kage
+with kernel methods
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/uniquecounter/uniquec
+ounter.sty
+Package: uniquecounter 2019/12/15 v1.4 Provide unlimited unique counter (HO)
+)
+Package uniquecounter Info: New unique counter `rerunfilecheck' on input line 2
+85.
+)
+\Hy@SectionHShift=\skip53
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/caption/caption.sty
+Package: caption 2023/08/05 v3.6o Customizing captions (AR)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/caption/caption3.sty
+Package: caption3 2023/07/31 v2.4d caption3 kernel (AR)
+\caption@tempdima=\dimen179
+\captionmargin=\dimen180
+\caption@leftmargin=\dimen181
+\caption@rightmargin=\dimen182
+\caption@width=\dimen183
+\caption@indent=\dimen184
+\caption@parindent=\dimen185
+\caption@hangindent=\dimen186
+Package caption Info: Standard document class detected.
+)
+\c@caption@flags=\count303
+\c@continuedfloat=\count304
+Package caption Info: float package is loaded.
+Package caption Info: hyperref package is loaded.
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/setspace/setspace.sty
+Package: setspace 2022/12/04 v6.7b set line spacing
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/parskip/parskip.sty
+Package: parskip 2021-03-14 v2.0h non-zero parskip adjustments
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/enumitem/enumitem.sty
+Package: enumitem 2025/02/06 v3.11 Customized lists
+\labelindent=\skip54
+\enit@outerparindent=\dimen187
+\enit@toks=\toks23
+\enit@inbox=\box58
+\enit@count@id=\count305
+\enitdp@description=\count306
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/titlesec/titlesec.sty
+Package: titlesec 2025/01/04 v2.17 Sectioning titles
+\ttl@box=\box59
+\beforetitleunit=\skip55
+\aftertitleunit=\skip56
+\ttl@plus=\dimen188
+\ttl@minus=\dimen189
+\ttl@toksa=\toks24
+\titlewidth=\dimen190
+\titlewidthlast=\dimen191
+\titlewidthfirst=\dimen192
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsmath.sty
+Package: amsmath 2024/11/05 v2.17t AMS math features
+\@mathmargin=\skip57
+
+For additional information on amsmath, use the `?' option.
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amstext.sty
+Package: amstext 2021/08/26 v2.01 AMS text
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsgen.sty
+File: amsgen.sty 1999/11/30 v2.0 generic functions
+\@emptytoks=\toks25
+\ex@=\dimen193
+))
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsbsy.sty
+Package: amsbsy 1999/11/29 v1.2d Bold Symbols
+\pmbraise@=\dimen194
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsopn.sty
+Package: amsopn 2022/04/08 v2.04 operator names
+)
+\inf@bad=\count307
+LaTeX Info: Redefining \frac on input line 233.
+\uproot@=\count308
+\leftroot@=\count309
+LaTeX Info: Redefining \overline on input line 398.
+LaTeX Info: Redefining \colon on input line 409.
+\classnum@=\count310
+\DOTSCASE@=\count311
+LaTeX Info: Redefining \ldots on input line 495.
+LaTeX Info: Redefining \dots on input line 498.
+LaTeX Info: Redefining \cdots on input line 619.
+\Mathstrutbox@=\box60
+\strutbox@=\box61
+LaTeX Info: Redefining \big on input line 721.
+LaTeX Info: Redefining \Big on input line 722.
+LaTeX Info: Redefining \bigg on input line 723.
+LaTeX Info: Redefining \Bigg on input line 724.
+\big@size=\dimen195
+LaTeX Font Info:    Redeclaring font encoding OML on input line 742.
+LaTeX Font Info:    Redeclaring font encoding OMS on input line 743.
+\macc@depth=\count312
+LaTeX Info: Redefining \bmod on input line 904.
+LaTeX Info: Redefining \pmod on input line 909.
+LaTeX Info: Redefining \smash on input line 939.
+LaTeX Info: Redefining \relbar on input line 969.
+LaTeX Info: Redefining \Relbar on input line 970.
+\c@MaxMatrixCols=\count313
+\dotsspace@=\muskip18
+\c@parentequation=\count314
+\dspbrk@lvl=\count315
+\tag@help=\toks26
+\row@=\count316
+\column@=\count317
+\maxfields@=\count318
+\andhelp@=\toks27
+\eqnshift@=\dimen196
+\alignsep@=\dimen197
+\tagshift@=\dimen198
+\tagwidth@=\dimen199
+\totwidth@=\dimen256
+\lineht@=\dimen257
+\@envbody=\toks28
+\multlinegap=\skip58
+\multlinetaggap=\skip59
+\mathdisplay@stack=\toks29
+LaTeX Info: Redefining \[ on input line 2953.
+LaTeX Info: Redefining \] on input line 2954.
+)
+
+Package fontspec Info: 
+(fontspec)             Font family 'TimesNewRoman(0)' created for font 'Times
+(fontspec)             New Roman' with options [Ligatures=TeX].
+(fontspec)              
+(fontspec)              This font family consists of the following NFSS
+(fontspec)             series/shapes:
+(fontspec)              
+(fontspec)             - 'normal' (m/n) with NFSS spec.: <->"Times New
+(fontspec)             Roman/OT:script=latn;language=dflt;mapping=tex-text;"
+(fontspec)             - 'small caps'  (m/sc) with NFSS spec.: <->"Times New
+(fontspec)             Roman/OT:script=latn;language=dflt;+smcp;mapping=tex-tex
+t;"
+(fontspec)             - 'bold' (b/n) with NFSS spec.: <->"Times New
+(fontspec)             Roman/B/OT:script=latn;language=dflt;mapping=tex-text;"
+(fontspec)             - 'bold small caps'  (b/sc) with NFSS spec.: <->"Times
+(fontspec)             New
+(fontspec)             Roman/B/OT:script=latn;language=dflt;+smcp;mapping=tex-t
+ext;"
+(fontspec)             - 'italic' (m/it) with NFSS spec.: <->"Times New
+(fontspec)             Roman/I/OT:script=latn;language=dflt;mapping=tex-text;"
+(fontspec)             - 'italic small caps'  (m/scit) with NFSS spec.:
+(fontspec)             <->"Times New
+(fontspec)             Roman/I/OT:script=latn;language=dflt;+smcp;mapping=tex-t
+ext;"
+(fontspec)             - 'bold italic' (b/it) with NFSS spec.: <->"Times New
+(fontspec)             Roman/BI/OT:script=latn;language=dflt;mapping=tex-text;"
+
+(fontspec)             - 'bold italic small caps'  (b/scit) with NFSS spec.:
+(fontspec)             <->"Times New
+(fontspec)             Roman/BI/OT:script=latn;language=dflt;+smcp;mapping=tex-
+text;"
+
+
+Package fontspec Info: 
+(fontspec)             Font family 'Arial(0)' created for font 'Arial' with
+(fontspec)             options [Ligatures=TeX].
+(fontspec)              
+(fontspec)              This font family consists of the following NFSS
+(fontspec)             series/shapes:
+(fontspec)              
+(fontspec)             - 'normal' (m/n) with NFSS spec.:
+(fontspec)             <->"Arial/OT:script=latn;language=dflt;mapping=tex-text;
+"
+(fontspec)             - 'small caps'  (m/sc) with NFSS spec.:
+(fontspec)             <->"Arial/OT:script=latn;language=dflt;+smcp;mapping=tex
+-text;"
+(fontspec)             - 'bold' (b/n) with NFSS spec.:
+(fontspec)             <->"Arial/B/OT:script=latn;language=dflt;mapping=tex-tex
+t;"
+(fontspec)             - 'bold small caps'  (b/sc) with NFSS spec.:
+(fontspec)             <->"Arial/B/OT:script=latn;language=dflt;+smcp;mapping=t
+ex-text;"
+(fontspec)             - 'italic' (m/it) with NFSS spec.:
+(fontspec)             <->"Arial/I/OT:script=latn;language=dflt;mapping=tex-tex
+t;"
+(fontspec)             - 'italic small caps'  (m/scit) with NFSS spec.:
+(fontspec)             <->"Arial/I/OT:script=latn;language=dflt;+smcp;mapping=t
+ex-text;"
+(fontspec)             - 'bold italic' (b/it) with NFSS spec.:
+(fontspec)             <->"Arial/BI/OT:script=latn;language=dflt;mapping=tex-te
+xt;"
+(fontspec)             - 'bold italic small caps'  (b/scit) with NFSS spec.:
+(fontspec)             <->"Arial/BI/OT:script=latn;language=dflt;+smcp;mapping=
+tex-text;"
+
+
+Package fontspec Info: 
+(fontspec)             Font family 'Consolas(0)' created for font 'Consolas'
+(fontspec)             with options
+(fontspec)             [WordSpace={1,0,0},HyphenChar=None,PunctuationSpace=Word
+Space].
+(fontspec)              
+(fontspec)              This font family consists of the following NFSS
+(fontspec)             series/shapes:
+(fontspec)              
+(fontspec)             - 'normal' (m/n) with NFSS spec.:
+(fontspec)             <->"Consolas/OT:script=latn;language=dflt;"
+(fontspec)             - 'bold' (b/n) with NFSS spec.:
+(fontspec)             <->"Consolas/B/OT:script=latn;language=dflt;"
+(fontspec)             - 'italic' (m/it) with NFSS spec.:
+(fontspec)             <->"Consolas/I/OT:script=latn;language=dflt;"
+(fontspec)             - 'bold italic' (b/it) with NFSS spec.:
+(fontspec)             <->"Consolas/BI/OT:script=latn;language=dflt;"
+
+
+(./theory_and_reflection_1234560.aux)
+\openout1 = `theory_and_reflection_1234560.aux'.
+
+LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for TS1/cmr/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for TU/lmr/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for PD1/pdf/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+LaTeX Font Info:    Checking defaults for PU/pdf/m/n on input line 29.
+LaTeX Font Info:    ... okay on input line 29.
+
+*geometry* driver: auto-detecting
+*geometry* detected driver: xetex
+*geometry* verbose mode - [ preamble ] result:
+* driver: xetex
+* paper: a4paper
+* layout: <same size as paper>
+* layoutoffset:(h,v)=(0.0pt,0.0pt)
+* modes: 
+* h-part:(L,W,R)=(41.25641pt, 514.99506pt, 41.25641pt)
+* v-part:(T,H,B)=(42.67912pt, 759.6886pt, 42.67912pt)
+* \paperwidth=597.50787pt
+* \paperheight=845.04684pt
+* \textwidth=514.99506pt
+* \textheight=759.6886pt
+* \oddsidemargin=-31.01358pt
+* \evensidemargin=-31.01358pt
+* \topmargin=-66.59087pt
+* \headheight=12.0pt
+* \headsep=25.0pt
+* \topskip=11.0pt
+* \footskip=30.0pt
+* \marginparwidth=50.0pt
+* \marginparsep=10.0pt
+* \columnsep=10.0pt
+* \skip\footins=10.0pt plus 4.0pt minus 2.0pt
+* \hoffset=0.0pt
+* \voffset=0.0pt
+* \mag=1000
+* \@twocolumnfalse
+* \@twosidefalse
+* \@mparswitchfalse
+* \@reversemarginfalse
+* (1in=72.27pt=25.4mm, 1cm=28.453pt)
+
+
+Package fontspec Info: 
+(fontspec)             Adjusting the maths setup (use [no-math] to avoid
+(fontspec)             this).
+
+\symlegacymaths=\mathgroup4
+LaTeX Font Info:    Overwriting symbol font `legacymaths' in version `bold'
+(Font)                  OT1/cmr/m/n --> OT1/cmr/bx/n on input line 29.
+LaTeX Font Info:    Redeclaring math accent \acute on input line 29.
+LaTeX Font Info:    Redeclaring math accent \grave on input line 29.
+LaTeX Font Info:    Redeclaring math accent \ddot on input line 29.
+LaTeX Font Info:    Redeclaring math accent \tilde on input line 29.
+LaTeX Font Info:    Redeclaring math accent \bar on input line 29.
+LaTeX Font Info:    Redeclaring math accent \breve on input line 29.
+LaTeX Font Info:    Redeclaring math accent \check on input line 29.
+LaTeX Font Info:    Redeclaring math accent \hat on input line 29.
+LaTeX Font Info:    Redeclaring math accent \dot on input line 29.
+LaTeX Font Info:    Redeclaring math accent \mathring on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Gamma on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Delta on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Theta on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Lambda on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Xi on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Pi on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Sigma on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Upsilon on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Phi on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Psi on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \Omega on input line 29.
+LaTeX Font Info:    Redeclaring math symbol \mathdollar on input line 29.
+LaTeX Font Info:    Redeclaring symbol font `operators' on input line 29.
+LaTeX Font Info:    Encoding `OT1' has changed to `TU' for symbol font
+(Font)              `operators' in the math version `normal' on input line 29.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `normal'
+(Font)                  OT1/cmr/m/n --> TU/TimesNewRoman(0)/m/n on input line 2
+9.
+LaTeX Font Info:    Encoding `OT1' has changed to `TU' for symbol font
+(Font)              `operators' in the math version `bold' on input line 29.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `bold'
+(Font)                  OT1/cmr/bx/n --> TU/TimesNewRoman(0)/m/n on input line 
+29.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `normal'
+(Font)                  TU/TimesNewRoman(0)/m/n --> TU/TimesNewRoman(0)/m/n on 
+input line 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathit' in version `normal'
+(Font)                  OT1/cmr/m/it --> TU/TimesNewRoman(0)/m/it on input line
+ 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathbf' in version `normal'
+(Font)                  OT1/cmr/bx/n --> TU/TimesNewRoman(0)/b/n on input line 
+29.
+LaTeX Font Info:    Overwriting math alphabet `\mathsf' in version `normal'
+(Font)                  OT1/cmss/m/n --> TU/Arial(0)/m/n on input line 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathtt' in version `normal'
+(Font)                  OT1/cmtt/m/n --> TU/Consolas(0)/m/n on input line 29.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `bold'
+(Font)                  TU/TimesNewRoman(0)/m/n --> TU/TimesNewRoman(0)/b/n on 
+input line 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathit' in version `bold'
+(Font)                  OT1/cmr/bx/it --> TU/TimesNewRoman(0)/b/it on input lin
+e 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathsf' in version `bold'
+(Font)                  OT1/cmss/bx/n --> TU/Arial(0)/b/n on input line 29.
+LaTeX Font Info:    Overwriting math alphabet `\mathtt' in version `bold'
+(Font)                  OT1/cmtt/m/n --> TU/Consolas(0)/b/n on input line 29.
+Package hyperref Info: Link coloring OFF on input line 29.
+(./theory_and_reflection_1234560.out) (./theory_and_reflection_1234560.out)
+\@outlinefile=\write3
+\openout3 = `theory_and_reflection_1234560.out'.
+
+Package caption Info: Begin \AtBeginDocument code.
+Package caption Info: End \AtBeginDocument code.
+
+
+[1
+
+]
+File: ../outputs/figures/optuna_param_importance.png Graphic file (type bmp)
+<../outputs/figures/optuna_param_importance.png>
+
+
+[2]
+
+[3]
+
+[4] (./theory_and_reflection_1234560.aux)
+ ***********
+LaTeX2e <2024-11-01> patch level 2
+L3 programming layer <2025-01-18>
+ ***********
+Package rerunfilecheck Info: File `theory_and_reflection_1234560.out' has not c
+hanged.
+(rerunfilecheck)             Checksum: 52A657089D543B64425D1F91299C4F1D;807.
+ ) 
+Here is how much of TeX's memory you used:
+ 14081 strings out of 473832
+ 277944 string characters out of 5733159
+ 744098 words of memory out of 5000000
+ 36945 multiletter control sequences out of 15000+600000
+ 564585 words of font info for 103 fonts, out of 8000000 for 9000
+ 1348 hyphenation exceptions out of 8191
+ 79i,11n,93p,1218b,425s stack positions out of 10000i,1000n,20000p,200000b,200000s
+
+Output written on theory_and_reflection_1234560.pdf (4 pages).
@@ -0,0 +1,122 @@
+\documentclass[11pt,a4paper]{article}
+\usepackage[margin=1.45cm,top=1.5cm,bottom=1.5cm]{geometry}
+\usepackage{fontspec}
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{array}
+\usepackage{tabularx}
+\usepackage{float}
+\usepackage{hyperref}
+\usepackage{caption}
+\usepackage{setspace}
+\usepackage{parskip}
+\usepackage{enumitem}
+\usepackage{titlesec}
+\usepackage{amsmath}
+
+\setmainfont{Times New Roman}
+\setsansfont{Arial}
+\setmonofont{Consolas}
+\setstretch{1.03}
+\setlist[itemize]{leftmargin=1.1em,itemsep=0.12em,topsep=0.12em}
+\captionsetup{font=small,labelfont=bf}
+\titlespacing*{\section}{0pt}{0.6em}{0.28em}
+\titlespacing*{\subsection}{0pt}{0.28em}{0.12em}
+\titleformat{\section}{\large\bfseries}{\thesection.}{0.4em}{}
+\newcolumntype{Y}{>{\centering\arraybackslash}X}
+\pagestyle{plain}
+
+\begin{document}
+
+\begin{center}
+{\Large \textbf{Theory and Reflection}}\\
+\vspace{0.2em}
+{\normalsize DTS304TC Coursework 1 \quad Student ID: 1234560}
+\vspace{0.15em}
+
+\rule{0.6\linewidth}{0.4pt}
+\end{center}
+
+\section{Bagging vs Boosting}
+Bagging (Bootstrap Aggregating) trains $B$ independent decision-tree base learners on bootstrapped samples drawn uniformly from the original dataset, then aggregates their predictions by majority vote for classification or averaging for regression. By making each tree learn on a different random subset of the data, bagging reduces variance through decorrelation: even if individual trees overfit, their errors partially cancel out when combined. Boosting, by contrast, trains base learners sequentially: each new learner is fitted to the residuals or misclassified instances from the current ensemble, with the effect that the ensemble reduces bias more aggressively. The key conceptual difference is that bagging treats all base learners as equally informative, while boosting adaptively reweights observations based on past mistakes.
+
+My notebook implemented a controlled comparison of Random Forest (representing bagging) and XGBoost (representing boosting) under identical preprocessing and identical train/validation split. This design is crucial: any difference in results must then reflect the learning algorithm itself, not differences in data preparation or evaluation.
+
+Table~\ref{tab: supervised-comparison} summarises the key results drawn from \texttt{outputs/tables/personalised\_improvement\_summary.csv}. It shows model name, validation macro-F1, validation accuracy, the generalisation gap (train F1 minus val F1), per-class F1 scores, and training time. Four rows are shown: Baseline LR, Random Forest, untuned XGBoost, and Optuna-tuned XGBoost.
+
+\begin{table}[H]
+\centering
+\caption{Controlled supervised model comparison (identical pipeline and split).}
+\label{tab: supervised-comparison}
+\small
+\begin{tabularx}{\textwidth}{>{\raggedright\arraybackslash}p{2.3cm}YYYYYYY}
+\toprule
+Model & Val F1 & Val Acc & Gap & High F1 & Low F1 & Std F1 & Time(s)\\
+\midrule
+Baseline LR & 0.7238 & 0.7342 & 0.0146 & 0.7665 & 0.6490 & 0.7558 & --\\
+Random Forest & 0.7708 & 0.7877 & \textbf{0.2292} & 0.7875 & 0.7095 & 0.8154 & 57.91\\
+XGBoost & 0.8144 & 0.8371 & 0.0155 & 0.8905 & 0.6944 & 0.8583 & 67.64\\
+Tuned XGBoost & 0.8520 & 0.8700 & 0.1219 & 0.9084 & 0.7620 & 0.8854 & 142.65\\
+\bottomrule
+\end{tabularx}\\[3pt]
+{\small Gap = train\_F1 $-$ val\_F1.}
+\end{table}
+
+The results provide strong evidence for the theoretical predictions. Random Forest achieved a training macro-F1 of $1.0000$ (perfect fit on the training set) but a validation macro-F1 of only $0.7708$, yielding a generalisation gap of $0.2292$. This extreme overfitting is also confirmed visually in the Random Forest confusion matrix produced in the notebook. XGBoost, by contrast, had a training macro-F1 of $0.8297$ and a validation macro-F1 of $0.8144$, giving a gap of only $0.0155$. The difference in gaps is striking: RF's overfitting is roughly 15 times larger than XGBoost's.
+
+The per-class F1 column in Table~\ref{tab: supervised-comparison} reveals further structure. Before tuning, RF achieved a Low-class F1 of $0.7095$, outperforming untuned XGBoost ($0.6944$) on the minority class---but this advantage disappears once XGBoost is tuned. After Optuna tuning, XGBoost's Low-class F1 rises to $0.7620$, a gain of $+0.0676$ over its untuned state and substantially higher than RF's $0.7095$. This demonstrates that boosting's sequential residual correction is better suited to learning the non-linear decision boundary between risk classes on this dataset. Bagging's variance-reduction mechanism cannot compensate for the bias that fully-grown trees impose on a mixed numerical and categorical feature space, which is why Random Forest underperforms here.
+
+\section{Hyperparameter Optimisation}
+I used Optuna with the TPE (Tree-structured Parzen Estimator) sampler for 30 trials, targeting maximisation of validation macro-F1. The search space covered nine XGBoost hyperparameters: n\_estimators (100--500), max\_depth (3--10), learning\_rate (0.01--0.3, log-scale), min\_child\_weight (1--10), subsample (0.5--1.0), colsample\_bytree (0.5--1.0), gamma (0--5), reg\_alpha ($10^{-4}$--10, log-scale), and reg\_lambda ($10^{-4}$--10, log-scale). The mixture of discrete and continuous parameters with multiple interactions makes a full grid search computationally prohibitive; TPE avoids exhaustive enumeration by modelling the density of good and bad trial configurations and directing subsequent searches toward promising regions of the parameter space.
+
+Trial 22 produced the best validation macro-F1 of $0.8520$, a gain of $+0.0376$ over the untuned XGBoost baseline of $0.8144$. The optimal configuration was: n\_estimators$=276$, max\_depth$=9$, learning\_rate$\approx0.192$, subsample$\approx0.707$, colsample\_bytree$\approx0.799$, reg\_lambda$\approx5.0$, and gamma$\approx2.5$. These values align with expectations: a moderate learning rate combined with large tree depth and many estimators allows the model to fit complex interactions, while subsample and colsample ratios around $0.7$--$0.8$ provide regularisation. Figure~\ref{fig: param-importance} shows the Optuna parameter-importance plot, confirming that structural parameters and the learning rate dominated the optimisation.
+
+\begin{figure}[H]
+\centering
+\fbox{\includegraphics[width=0.58\textwidth]{../outputs/figures/optuna_param_importance.png}}
+\caption{Optuna parameter importance. Larger bars indicate higher influence on validation macro-F1.}
+\label{fig: param-importance}
+\end{figure}
+
+The per-class F1 changes in Table~\ref{tab: supervised-comparison} deserve particular attention, because macro-F1 weights all three classes equally. Optuna's improvement in the \texttt{Low} class (minority) from $0.6944$ to $0.7620$ ($+0.0676$) is especially large, while the \texttt{High} class F1 increased from $0.8905$ to $0.9084$ ($+0.0179$) and the \texttt{Standard} class from $0.8583$ to $0.8854$ ($+0.0271$). This broad-based improvement across all three classes shows that TPE successfully optimised the class-balanced objective rather than overfitting to the majority class. The tuned model did not sacrifice performance on any single class to achieve higher overall metrics, which is exactly what the macro-F1 metric rewards.
+
+\section{K-Means vs GMM}
+K-Means assigns each sample $x_i$ to the cluster $c_i\in\{1,\dots,k\}$ whose centroid $\mu_c$ minimises the squared Euclidean distance $\|x_i-\mu_{c_i}\|^2$. This is a \textbf{hard assignment}: each sample belongs to exactly one cluster, with no notion of uncertainty or partial membership. GMM (Gaussian Mixture Model) takes a fundamentally different approach by modelling the data as a mixture of $k$ multivariate Gaussian distributions: $p(x)=\sum_{j=1}^{k}\pi_j\,\mathcal{N}(x\mid\mu_j,\Sigma_j)$, where $\pi_j$ are the mixing proportions. Each sample receives a posterior probability $p(c_j\mid x_i)$ for every component, enabling \textbf{soft assignment}: a sample can belong partially to multiple clusters. For insurance risk, where applicant profiles naturally overlap across risk bands rather than forming isolated groups, soft assignment is more aligned with the domain.
+
+Table~\ref{tab: clustering} reports the complete clustering results from \texttt{outputs/tables/clustering\_comparison.csv}, covering k=2 through k=8. The columns are: $k$, K-Means inertia, K-Means silhouette score, GMM log-likelihood, GMM BIC, GMM AIC, and GMM silhouette score. K-Means silhouette scores remain low across the entire range, peaking at only $0.2015$ at $k=8$. This confirms that even the best K-Means configuration fails to find well-separated spherical clusters in this data. GMM achieves substantially higher silhouette scores: $0.4142$ at $k=2$ and $0.4015$ at $k=5$, which are roughly double the best K-Means values. At $k=2$, the GMM silhouette of $0.4142$ versus K-Means's $0.1740$ is particularly revealing: it suggests that the two-cluster structure in this insurance dataset is inherently probabilistic (overlapping Gaussian components) rather than discrete (centroid-defined).
+
+\begin{table}[H]
+\centering
+\caption{Full clustering comparison across k=2 to k=8.}
+\label{tab: clustering}
+\footnotesize
+\begin{tabularx}{\textwidth}{YYYYYY}
+\toprule
+$k$ & K-Means Inertia & K-Means Sil & GMM BIC & GMM AIC & GMM Sil\\
+\midrule
+2 & 1,092,962 & 0.1740 & $-$359,251 & $-$362,062 & \textbf{0.4142}\\
+3 & 1,018,587 & 0.1732 & $-$1,103,445 & $-$1,107,666 & 0.2977\\
+4 & 953,249 & 0.1808 & $-$1,938,815 & $-$1,944,446 & 0.3964\\
+5 & 889,285 & 0.1964 & $-$1,997,256 & $-$2,004,298 & 0.4015\\
+6 & 818,951 & 0.1768 & $-$2,349,766 & $-$2,358,217 & 0.2468\\
+7 & 777,658 & 0.1971 & $-$2,394,381 & $-$2,404,243 & 0.3110\\
+8 & 691,941 & \textbf{0.2015} & $-$2,510,221 & $-$2,521,493 & 0.1726\\
+\bottomrule
+\end{tabularx}
+\end{table}
+
+The GMM BIC column in Table~\ref{tab: clustering} shows a monotonic decrease with larger $k$, which is expected since adding more components always allows a better fit to the training data. However, BIC also penalises model complexity, so the rate of decrease slows at larger $k$, suggesting diminishing returns. The K-Means inertia curve is gradual with no sharp elbow, indicating the absence of a natural cluster count---another sign that the data does not contain clearly separable spherical structures. Overall, the GMM's consistently higher silhouette scores across most values of $k$ indicate that insurance applicants form probabilistic subtypes with soft boundaries. This validates the conceptual distinction between hard and soft assignment: GMM captures the overlapping nature of risk profiles that K-Means cannot represent. Importantly, neither clustering method is intended to replace the supervised classifier---they serve different objectives and the unsupervised analysis is purely exploratory.
+
+\section{Personalised Improvement Reflection}
+My compulsory category was \textbf{Category A: Data Quality and Missingness}. Before any modelling, I conducted an EDA that identified significant missing values in multiple columns. Five columns had particularly high missing rates: net\_monthly\_income\_gbp, avg\_payment\_delay\_days, monthly\_investment\_gbp, prior\_debt\_products, and account\_tenure (at 30.6, 19.0, 21.1, 7.6, and 4.3 percent respectively). Rather than treating missing values as noise and simply applying median imputation, I added five binary missing-indicator features---one for each of these five columns---appending them to the feature set alongside median imputation. This is based on the hypothesis that the \textit{pattern} of missingness itself may be informative: a missing income value might indicate financial instability or unemployment, which is a legitimate risk signal in insurance.
+
+After adding the five missing indicators, validation macro-F1 rose from $0.8520$ (the Optuna-tuned model) to $0.8529$ (Category A XGBoost). The gain is modest ($+0.0009$) but meaningful, given that the tuned model was already strong and operating close to the performance ceiling implied by the feature space. More importantly, the gain confirms the hypothesis that missingness carries behavioural signal: in financial applications, missing income data does not occur at random and is therefore legitimately predictive. This also demonstrates an important methodological lesson: even small improvements should be interrogated to determine whether they reflect genuine signal or overfitting.
+
+For my optional category, I implemented \textbf{Category D: Soft Voting Ensemble} by combining Random Forest and tuned XGBoost using soft voting (averaging predicted class probabilities). The ensemble achieved validation macro-F1 of $0.8510$, which is below both the Category A model ($0.8529$) and the tuned XGBoost alone ($0.8520$). This outcome is instructive: it shows that model diversity alone is insufficient for ensemble improvement. The two base learners had very different prediction profiles---RF overfitted dramatically while XGBoost was well-calibrated---and combining them diluted the boosting model's advantage rather than complementing it. In practice, effective ensembles typically require base learners that are both individually strong and diverse in their error patterns. My final model selection was therefore the Category A XGBoost, chosen strictly on the basis of validation macro-F1 evidence.
+
+A critical prerequisite for all modelling steps was data leakage control. Before any model training, I screened all available features using single-feature DecisionTree cross-validation. The feature \texttt{bureau\_risk\_index} achieved a single-feature macro-F1 of $0.9999$---an extraordinarily high score that indicates near-perfect class separation. This immediately triggered the leakage detection threshold (set at $0.85$), and the feature was removed before any further experimentation. This step is fundamental: without removing the leakage feature, all subsequent validation scores in Table~\ref{tab: supervised-comparison} would be artificially inflated and every model comparison would be invalid. The leakage check also illustrates an important broader principle in applied machine learning: even when a feature appears to improve performance, it must be evaluated for its relationship to the target before being accepted.
+
+\section{AI Use Declaration}
+AI tools were used only in a limited support role throughout this coursework: they assisted with environment debugging (resolving package import and GPU configuration issues), and with \LaTeX{} formatting to produce the final document. The experimental design, leakage detection decision, controlled model comparison, personalised improvement strategy, and all written interpretations of tables, figures, and metrics were derived from my own notebook results. No claims are made about hidden-test performance; the CSV file (\texttt{test\_result\_1234560.csv}) follows the required filename and column order from the assignment brief, generated solely for submission formatting.
+
+\end{document}
@@ -0,0 +1,17 @@
+\relax 
+\providecommand\hyper@newdestlabel[2]{}
+\providecommand*\HyPL@Entry[1]{}
+\HyPL@Entry{0<</S/D>>}
+\@writefile{toc}{\contentsline {section}{\numberline {1}Bagging 与 Boosting 对比}{1}{section.1}\protected@file@percent }
+\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces 受控监督模型对比（相同流程和划分）。}}{1}{table.caption.1}\protected@file@percent }
+\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
+\newlabel{tab: supervised-comparison}{{1}{1}{受控监督模型对比（相同流程和划分）。}{table.caption.1}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {2}超参数优化}{1}{section.2}\protected@file@percent }
+\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Optuna 超参数重要性图。条越长表示对验证集宏-F1 的影响越大。}}{2}{figure.caption.2}\protected@file@percent }
+\newlabel{fig: param-importance}{{1}{2}{Optuna 超参数重要性图。条越长表示对验证集宏-F1 的影响越大。}{figure.caption.2}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {3}K-Means 与 GMM 对比}{2}{section.3}\protected@file@percent }
+\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces 完整聚类对比（k=2 到 k=8）。}}{2}{table.caption.3}\protected@file@percent }
+\newlabel{tab: clustering}{{2}{2}{完整聚类对比（k=2 到 k=8）。}{table.caption.3}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {4}个性化改进反思}{3}{section.4}\protected@file@percent }
+\@writefile{toc}{\contentsline {section}{\numberline {5}AI 使用声明}{3}{section.5}\protected@file@percent }
+\gdef \@abspage@last{3}
@@ -0,0 +1,687 @@
+This is XeTeX, Version 3.141592653-2.6-0.999997 (TeX Live 2025) (preloaded format=xelatex 2025.6.5)  25 APR 2026 01:43
+entering extended mode
+ restricted \write18 enabled.
+ %&-line parsing enabled.
+**theory_and_reflection_1234560_cn.tex
+(./theory_and_reflection_1234560_cn.tex
+LaTeX2e <2024-11-01> patch level 2
+L3 programming layer <2025-01-18>
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/article.cls
+Document Class: article 2024/06/29 v1.4n Standard LaTeX document class
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/size11.clo
+File: size11.clo 2024/06/29 v1.4n Standard LaTeX file (size option)
+)
+\c@part=\count192
+\c@section=\count193
+\c@subsection=\count194
+\c@subsubsection=\count195
+\c@paragraph=\count196
+\c@subparagraph=\count197
+\c@figure=\count198
+\c@table=\count199
+\abovecaptionskip=\skip49
+\belowcaptionskip=\skip50
+\bibindent=\dimen141
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/geometry/geometry.sty
+Package: geometry 2020/01/02 v5.9 Page Geometry
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/keyval.sty
+Package: keyval 2022/05/29 v1.15 key=value parser (DPC)
+\KV@toks@=\toks17
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/iftex/ifvtex.sty
+Package: ifvtex 2019/10/25 v1.7 ifvtex legacy package. Use iftex instead.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/iftex/iftex.sty
+Package: iftex 2024/12/12 v1.0g TeX engine tests
+))
+\Gm@cnth=\count266
+\Gm@cntv=\count267
+\c@Gm@tempcnt=\count268
+\Gm@bindingoffset=\dimen142
+\Gm@wd@mp=\dimen143
+\Gm@odd@mp=\dimen144
+\Gm@even@mp=\dimen145
+\Gm@layoutwidth=\dimen146
+\Gm@layoutheight=\dimen147
+\Gm@layouthoffset=\dimen148
+\Gm@layoutvoffset=\dimen149
+\Gm@dimlist=\toks18
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/xelatex/xecjk/xeCJK.sty
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3kernel/expl3.sty
+Package: expl3 2025-01-18 L3 programming layer (loader) 
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3backend/l3backend-xet
+ex.def
+File: l3backend-xetex.def 2024-05-08 L3 backend support: XeTeX
+\g__graphics_track_int=\count269
+\l__pdf_internal_box=\box52
+\g__pdf_backend_annotation_int=\count270
+\g__pdf_backend_link_int=\count271
+))
+Package: xeCJK 2022/08/05 v3.9.1 Typesetting CJK scripts with XeLaTeX
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/ctex/ctexhook.sty
+Package: ctexhook 2022/07/14 v2.5.10 Document and package hooks (CTEX)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3packages/xtemplate/xt
+emplate.sty
+Package: xtemplate 2024-08-16 L3 Experimental prototype document functions
+)
+\l__xeCJK_tmp_int=\count272
+\l__xeCJK_tmp_box=\box53
+\l__xeCJK_tmp_dim=\dimen150
+\l__xeCJK_tmp_skip=\skip51
+\g__xeCJK_space_factor_int=\count273
+\l__xeCJK_begin_int=\count274
+\l__xeCJK_end_int=\count275
+\c__xeCJK_CJK_class_int=\XeTeXcharclass1
+\c__xeCJK_FullLeft_class_int=\XeTeXcharclass2
+\c__xeCJK_FullRight_class_int=\XeTeXcharclass3
+\c__xeCJK_HalfLeft_class_int=\XeTeXcharclass4
+\c__xeCJK_HalfRight_class_int=\XeTeXcharclass5
+\c__xeCJK_NormalSpace_class_int=\XeTeXcharclass6
+\c__xeCJK_CM_class_int=\XeTeXcharclass7
+\c__xeCJK_HangulJamo_class_int=\XeTeXcharclass8
+\l__xeCJK_last_skip=\skip52
+\c__xeCJK_none_node=\count276
+\g__xeCJK_node_int=\count277
+\c__xeCJK_CJK_node_dim=\dimen151
+\c__xeCJK_CJK-space_node_dim=\dimen152
+\c__xeCJK_default_node_dim=\dimen153
+\c__xeCJK_CJK-widow_node_dim=\dimen154
+\c__xeCJK_normalspace_node_dim=\dimen155
+\c__xeCJK_default-space_node_skip=\skip53
+\l__xeCJK_ccglue_skip=\skip54
+\l__xeCJK_ecglue_skip=\skip55
+\l__xeCJK_punct_kern_skip=\skip56
+\l__xeCJK_indent_box=\box54
+\l__xeCJK_last_penalty_int=\count278
+\l__xeCJK_last_bound_dim=\dimen156
+\l__xeCJK_last_kern_dim=\dimen157
+\l__xeCJK_widow_penalty_int=\count279
+
+LaTeX template Info: Declaring template type 'xeCJK/punctuation' taking 0
+(template)           argument(s) on line 2396.
+
+\l__xeCJK_fixed_punct_width_dim=\dimen158
+\l__xeCJK_mixed_punct_width_dim=\dimen159
+\l__xeCJK_middle_punct_width_dim=\dimen160
+\l__xeCJK_fixed_margin_width_dim=\dimen161
+\l__xeCJK_mixed_margin_width_dim=\dimen162
+\l__xeCJK_middle_margin_width_dim=\dimen163
+\l__xeCJK_bound_punct_width_dim=\dimen164
+\l__xeCJK_bound_margin_width_dim=\dimen165
+\l__xeCJK_margin_minimum_dim=\dimen166
+\l__xeCJK_kerning_total_width_dim=\dimen167
+\l__xeCJK_same_align_margin_dim=\dimen168
+\l__xeCJK_different_align_margin_dim=\dimen169
+\l__xeCJK_kerning_margin_width_dim=\dimen170
+\l__xeCJK_kerning_margin_minimum_dim=\dimen171
+\l__xeCJK_bound_dim=\dimen172
+\l__xeCJK_reverse_bound_dim=\dimen173
+\l__xeCJK_margin_dim=\dimen174
+\l__xeCJK_minimum_bound_dim=\dimen175
+\l__xeCJK_kerning_margin_dim=\dimen176
+\g__xeCJK_family_int=\count280
+\l__xeCJK_fam_int=\count281
+\g__xeCJK_fam_allocation_int=\count282
+\l__xeCJK_verb_case_int=\count283
+\l__xeCJK_verb_exspace_skip=\skip57
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec.sty
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/l3packages/xparse/xpars
+e.sty
+Package: xparse 2024-08-16 L3 Experimental document command parser
+)
+Package: fontspec 2024/05/11 v2.9e Font selection for XeLaTeX and LuaLaTeX
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec-xetex
+.sty
+Package: fontspec-xetex 2024/05/11 v2.9e Font selection for XeLaTeX and LuaLaTe
+X
+\l__fontspec_script_int=\count284
+\l__fontspec_language_int=\count285
+\l__fontspec_strnum_int=\count286
+\l__fontspec_tmp_int=\count287
+\l__fontspec_tmpa_int=\count288
+\l__fontspec_tmpb_int=\count289
+\l__fontspec_tmpc_int=\count290
+\l__fontspec_em_int=\count291
+\l__fontspec_emdef_int=\count292
+\l__fontspec_strong_int=\count293
+\l__fontspec_strongdef_int=\count294
+\l__fontspec_tmpa_dim=\dimen177
+\l__fontspec_tmpb_dim=\dimen178
+\l__fontspec_tmpc_dim=\dimen179
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/fontenc.sty
+Package: fontenc 2021/04/29 v2.0v Standard LaTeX package
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/fontspec/fontspec.cfg))
+) (d:/settings/Language/texlive/2025/texmf-dist/tex/xelatex/xecjk/xeCJK.cfg
+File: xeCJK.cfg 2022/08/05 v3.9.1 Configuration file for xeCJK package
+))
+
+Package fontspec Info: 
+(fontspec)             Could not resolve font "SimSun/BI" (it probably doesn't
+(fontspec)             exist).
+
+
+Package fontspec Info: 
+(fontspec)             Could not resolve font "SimSun/B" (it probably doesn't
+(fontspec)             exist).
+
+
+Package fontspec Info: 
+(fontspec)             Could not resolve font "SimSun/I" (it probably doesn't
+(fontspec)             exist).
+
+
+Package fontspec Info: 
+(fontspec)             Font family 'SimSun(0)' created for font 'SimSun' with
+(fontspec)             options [Script={CJK}].
+(fontspec)              
+(fontspec)              This font family consists of the following NFSS
+(fontspec)             series/shapes:
+(fontspec)              
+(fontspec)             - 'normal' (m/n) with NFSS spec.:
+(fontspec)             <->"SimSun/OT:script=hani;language=dflt;"
+
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/graphicx.sty
+Package: graphicx 2021/09/16 v1.2d Enhanced LaTeX Graphics (DPC,SPQR)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/graphics.sty
+Package: graphics 2024/08/06 v1.4g Standard LaTeX Graphics (DPC,SPQR)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics/trig.sty
+Package: trig 2023/12/02 v1.11 sin cos tan (DPC)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics-cfg/graphics.c
+fg
+File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration
+)
+Package graphics Info: Driver file: xetex.def on input line 106.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/graphics-def/xetex.def
+File: xetex.def 2022/09/22 v5.0n Graphics/color driver for xetex
+))
+\Gin@req@height=\dimen180
+\Gin@req@width=\dimen181
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/booktabs/booktabs.sty
+Package: booktabs 2020/01/12 v1.61803398 Publication quality tables
+\heavyrulewidth=\dimen182
+\lightrulewidth=\dimen183
+\cmidrulewidth=\dimen184
+\belowrulesep=\dimen185
+\belowbottomsep=\dimen186
+\aboverulesep=\dimen187
+\abovetopsep=\dimen188
+\cmidrulesep=\dimen189
+\cmidrulekern=\dimen190
+\defaultaddspace=\dimen191
+\@cmidla=\count295
+\@cmidlb=\count296
+\@aboverulesep=\dimen192
+\@belowrulesep=\dimen193
+\@thisruleclass=\count297
+\@lastruleclass=\count298
+\@thisrulewidth=\dimen194
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/tools/array.sty
+Package: array 2024/10/17 v2.6g Tabular extension package (FMi)
+\col@sep=\dimen195
+\ar@mcellbox=\box55
+\extrarowheight=\dimen196
+\NC@list=\toks19
+\extratabsurround=\skip58
+\backup@length=\skip59
+\ar@cellbox=\box56
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/tools/tabularx.sty
+Package: tabularx 2023/12/11 v2.12a `tabularx' package (DPC)
+\TX@col@width=\dimen197
+\TX@old@table=\dimen198
+\TX@old@col=\dimen199
+\TX@target=\dimen256
+\TX@delta=\dimen257
+\TX@cols=\count299
+\TX@ftn=\toks20
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/float/float.sty
+Package: float 2001/11/08 v1.3d Float enhancements (AL)
+\c@float@type=\count300
+\float@exts=\toks21
+\float@box=\box57
+\@float@everytoks=\toks22
+\@floatcapt=\box58
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/hyperref.sty
+Package: hyperref 2024-11-05 v7.01l Hypertext links for LaTeX
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/kvsetkeys/kvsetkeys.sty
+Package: kvsetkeys 2022-10-05 v1.19 Key value parser (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/kvdefinekeys/kvdefine
+keys.sty
+Package: kvdefinekeys 2019-12-19 v1.6 Define keys (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/pdfescape/pdfescape.s
+ty
+Package: pdfescape 2019/12/09 v1.15 Implements pdfTeX's escape features (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/ltxcmds/ltxcmds.sty
+Package: ltxcmds 2023-12-04 v1.26 LaTeX kernel commands for general use (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/pdftexcmds/pdftexcmds
+.sty
+Package: pdftexcmds 2020-06-27 v0.33 Utility functions of pdfTeX for LuaTeX (HO
+)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/infwarerr/infwarerr.s
+ty
+Package: infwarerr 2019/12/03 v1.5 Providing info/warning/error messages (HO)
+)
+Package pdftexcmds Info: \pdf@primitive is available.
+Package pdftexcmds Info: \pdf@ifprimitive is available.
+Package pdftexcmds Info: \pdfdraftmode not found.
+))
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hycolor/hycolor.sty
+Package: hycolor 2020-01-27 v1.10 Color options for hyperref/bookmark (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/nameref.sty
+Package: nameref 2023-11-26 v2.56 Cross-referencing by name of section
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/refcount/refcount.sty
+Package: refcount 2019/12/15 v3.6 Data extraction from label references (HO)
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/gettitlestring/gettit
+lestring.sty
+Package: gettitlestring 2019/12/15 v1.6 Cleanup title references (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/kvoptions/kvoptions.sty
+Package: kvoptions 2022-06-15 v3.15 Key value format for package options (HO)
+))
+\c@section@level=\count301
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/etoolbox/etoolbox.sty
+Package: etoolbox 2025/02/11 v2.5l e-TeX tools for LaTeX (JAW)
+\etb@tempcnta=\count302
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/stringenc/stringenc.s
+ty
+Package: stringenc 2019/11/29 v1.12 Convert strings between diff. encodings (HO
+)
+)
+\@linkdim=\dimen258
+\Hy@linkcounter=\count303
+\Hy@pagecounter=\count304
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/pd1enc.def
+File: pd1enc.def 2024-11-05 v7.01l Hyperref: PDFDocEncoding definition (HO)
+) (d:/settings/Language/texlive/2025/texmf-dist/tex/generic/intcalc/intcalc.sty
+Package: intcalc 2019/12/15 v1.3 Expandable calculations with integers (HO)
+)
+\Hy@SavedSpaceFactor=\count305
+ (d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/puenc.def
+File: puenc.def 2024-11-05 v7.01l Hyperref: PDF Unicode definition (HO)
+)
+Package hyperref Info: Hyper figures OFF on input line 4157.
+Package hyperref Info: Link nesting OFF on input line 4162.
+Package hyperref Info: Hyper index ON on input line 4165.
+Package hyperref Info: Plain pages OFF on input line 4172.
+Package hyperref Info: Backreferencing OFF on input line 4177.
+Package hyperref Info: Implicit mode ON; LaTeX internals redefined.
+Package hyperref Info: Bookmarks ON on input line 4424.
+\c@Hy@tempcnt=\count306
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/url/url.sty
+\Urlmuskip=\muskip17
+Package: url 2013/09/16  ver 3.4  Verb mode for urls, etc.
+)
+LaTeX Info: Redefining \url on input line 4763.
+\XeTeXLinkMargin=\dimen259
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/bitset/bitset.sty
+Package: bitset 2019/12/09 v1.3 Handle bit-vector datatype (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/bigintcalc/bigintcalc
+.sty
+Package: bigintcalc 2019/12/15 v1.5 Expandable calculations on big integers (HO
+)
+))
+\Fld@menulength=\count307
+\Field@Width=\dimen260
+\Fld@charsize=\dimen261
+Package hyperref Info: Hyper figures OFF on input line 6042.
+Package hyperref Info: Link nesting OFF on input line 6047.
+Package hyperref Info: Hyper index ON on input line 6050.
+Package hyperref Info: backreferencing OFF on input line 6057.
+Package hyperref Info: Link coloring OFF on input line 6062.
+Package hyperref Info: Link coloring with OCG OFF on input line 6067.
+Package hyperref Info: PDF/A mode OFF on input line 6072.
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/atbegshi-ltx.sty
+Package: atbegshi-ltx 2021/01/10 v1.0c Emulation of the original atbegshi
+package with kernel methods
+)
+\Hy@abspage=\count308
+\c@Item=\count309
+\c@Hfootnote=\count310
+)
+Package hyperref Info: Driver (autodetected): hxetex.
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/hyperref/hxetex.def
+File: hxetex.def 2024-11-05 v7.01l Hyperref driver for XeTeX
+\pdfm@box=\box59
+\c@Hy@AnnotLevel=\count311
+\HyField@AnnotCount=\count312
+\Fld@listcount=\count313
+\c@bookmark@seq@number=\count314
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/rerunfilecheck/rerunfil
+echeck.sty
+Package: rerunfilecheck 2022-07-10 v1.10 Rerun checks for auxiliary files (HO)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/base/atveryend-ltx.sty
+Package: atveryend-ltx 2020/08/19 v1.0a Emulation of the original atveryend pac
+kage
+with kernel methods
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/generic/uniquecounter/uniquec
+ounter.sty
+Package: uniquecounter 2019/12/15 v1.4 Provide unlimited unique counter (HO)
+)
+Package uniquecounter Info: New unique counter `rerunfilecheck' on input line 2
+85.
+)
+\Hy@SectionHShift=\skip60
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/caption/caption.sty
+Package: caption 2023/08/05 v3.6o Customizing captions (AR)
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/caption/caption3.sty
+Package: caption3 2023/07/31 v2.4d caption3 kernel (AR)
+\caption@tempdima=\dimen262
+\captionmargin=\dimen263
+\caption@leftmargin=\dimen264
+\caption@rightmargin=\dimen265
+\caption@width=\dimen266
+\caption@indent=\dimen267
+\caption@parindent=\dimen268
+\caption@hangindent=\dimen269
+Package caption Info: Standard document class detected.
+)
+\c@caption@flags=\count315
+\c@continuedfloat=\count316
+Package caption Info: float package is loaded.
+Package caption Info: hyperref package is loaded.
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/setspace/setspace.sty
+Package: setspace 2022/12/04 v6.7b set line spacing
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/parskip/parskip.sty
+Package: parskip 2021-03-14 v2.0h non-zero parskip adjustments
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/enumitem/enumitem.sty
+Package: enumitem 2025/02/06 v3.11 Customized lists
+\labelindent=\skip61
+\enit@outerparindent=\dimen270
+\enit@toks=\toks23
+\enit@inbox=\box60
+\enit@count@id=\count317
+\enitdp@description=\count318
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/titlesec/titlesec.sty
+Package: titlesec 2025/01/04 v2.17 Sectioning titles
+\ttl@box=\box61
+\beforetitleunit=\skip62
+\aftertitleunit=\skip63
+\ttl@plus=\dimen271
+\ttl@minus=\dimen272
+\ttl@toksa=\toks24
+\titlewidth=\dimen273
+\titlewidthlast=\dimen274
+\titlewidthfirst=\dimen275
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsmath.sty
+Package: amsmath 2024/11/05 v2.17t AMS math features
+\@mathmargin=\skip64
+
+For additional information on amsmath, use the `?' option.
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amstext.sty
+Package: amstext 2021/08/26 v2.01 AMS text
+
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsgen.sty
+File: amsgen.sty 1999/11/30 v2.0 generic functions
+\@emptytoks=\toks25
+\ex@=\dimen276
+))
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsbsy.sty
+Package: amsbsy 1999/11/29 v1.2d Bold Symbols
+\pmbraise@=\dimen277
+)
+(d:/settings/Language/texlive/2025/texmf-dist/tex/latex/amsmath/amsopn.sty
+Package: amsopn 2022/04/08 v2.04 operator names
+)
+\inf@bad=\count319
+LaTeX Info: Redefining \frac on input line 233.
+\uproot@=\count320
+\leftroot@=\count321
+LaTeX Info: Redefining \overline on input line 398.
+LaTeX Info: Redefining \colon on input line 409.
+\classnum@=\count322
+\DOTSCASE@=\count323
+LaTeX Info: Redefining \ldots on input line 495.
+LaTeX Info: Redefining \dots on input line 498.
+LaTeX Info: Redefining \cdots on input line 619.
+\Mathstrutbox@=\box62
+\strutbox@=\box63
+LaTeX Info: Redefining \big on input line 721.
+LaTeX Info: Redefining \Big on input line 722.
+LaTeX Info: Redefining \bigg on input line 723.
+LaTeX Info: Redefining \Bigg on input line 724.
+\big@size=\dimen278
+LaTeX Font Info:    Redeclaring font encoding OML on input line 742.
+LaTeX Font Info:    Redeclaring font encoding OMS on input line 743.
+\macc@depth=\count324
+LaTeX Info: Redefining \bmod on input line 904.
+LaTeX Info: Redefining \pmod on input line 909.
+LaTeX Info: Redefining \smash on input line 939.
+LaTeX Info: Redefining \relbar on input line 969.
+LaTeX Info: Redefining \Relbar on input line 970.
+\c@MaxMatrixCols=\count325
+\dotsspace@=\muskip18
+\c@parentequation=\count326
+\dspbrk@lvl=\count327
+\tag@help=\toks26
+\row@=\count328
+\column@=\count329
+\maxfields@=\count330
+\andhelp@=\toks27
+\eqnshift@=\dimen279
+\alignsep@=\dimen280
+\tagshift@=\dimen281
+\tagwidth@=\dimen282
+\totwidth@=\dimen283
+\lineht@=\dimen284
+\@envbody=\toks28
+\multlinegap=\skip65
+\multlinetaggap=\skip66
+\mathdisplay@stack=\toks29
+LaTeX Info: Redefining \[ on input line 2953.
+LaTeX Info: Redefining \] on input line 2954.
+)
+(./theory_and_reflection_1234560_cn.aux)
+\openout1 = `theory_and_reflection_1234560_cn.aux'.
+
+LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for TS1/cmr/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for TU/lmr/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for PD1/pdf/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+LaTeX Font Info:    Checking defaults for PU/pdf/m/n on input line 27.
+LaTeX Font Info:    ... okay on input line 27.
+
+*geometry* driver: auto-detecting
+*geometry* detected driver: xetex
+*geometry* verbose mode - [ preamble ] result:
+* driver: xetex
+* paper: a4paper
+* layout: <same size as paper>
+* layoutoffset:(h,v)=(0.0pt,0.0pt)
+* modes: 
+* h-part:(L,W,R)=(41.25641pt, 514.99506pt, 41.25641pt)
+* v-part:(T,H,B)=(42.67912pt, 759.6886pt, 42.67912pt)
+* \paperwidth=597.50787pt
+* \paperheight=845.04684pt
+* \textwidth=514.99506pt
+* \textheight=759.6886pt
+* \oddsidemargin=-31.01358pt
+* \evensidemargin=-31.01358pt
+* \topmargin=-66.59087pt
+* \headheight=12.0pt
+* \headsep=25.0pt
+* \topskip=11.0pt
+* \footskip=30.0pt
+* \marginparwidth=50.0pt
+* \marginparsep=10.0pt
+* \columnsep=10.0pt
+* \skip\footins=10.0pt plus 4.0pt minus 2.0pt
+* \hoffset=0.0pt
+* \voffset=0.0pt
+* \mag=1000
+* \@twocolumnfalse
+* \@twosidefalse
+* \@mparswitchfalse
+* \@reversemarginfalse
+* (1in=72.27pt=25.4mm, 1cm=28.453pt)
+
+
+Package fontspec Info: 
+(fontspec)             Adjusting the maths setup (use [no-math] to avoid
+(fontspec)             this).
+
+\symlegacymaths=\mathgroup4
+LaTeX Font Info:    Overwriting symbol font `legacymaths' in version `bold'
+(Font)                  OT1/cmr/m/n --> OT1/cmr/bx/n on input line 27.
+LaTeX Font Info:    Redeclaring math accent \acute on input line 27.
+LaTeX Font Info:    Redeclaring math accent \grave on input line 27.
+LaTeX Font Info:    Redeclaring math accent \ddot on input line 27.
+LaTeX Font Info:    Redeclaring math accent \tilde on input line 27.
+LaTeX Font Info:    Redeclaring math accent \bar on input line 27.
+LaTeX Font Info:    Redeclaring math accent \breve on input line 27.
+LaTeX Font Info:    Redeclaring math accent \check on input line 27.
+LaTeX Font Info:    Redeclaring math accent \hat on input line 27.
+LaTeX Font Info:    Redeclaring math accent \dot on input line 27.
+LaTeX Font Info:    Redeclaring math accent \mathring on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Gamma on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Delta on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Theta on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Lambda on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Xi on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Pi on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Sigma on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Upsilon on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Phi on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Psi on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \Omega on input line 27.
+LaTeX Font Info:    Redeclaring math symbol \mathdollar on input line 27.
+LaTeX Font Info:    Redeclaring symbol font `operators' on input line 27.
+LaTeX Font Info:    Encoding `OT1' has changed to `TU' for symbol font
+(Font)              `operators' in the math version `normal' on input line 27.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `normal'
+(Font)                  OT1/cmr/m/n --> TU/lmr/m/n on input line 27.
+LaTeX Font Info:    Encoding `OT1' has changed to `TU' for symbol font
+(Font)              `operators' in the math version `bold' on input line 27.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `bold'
+(Font)                  OT1/cmr/bx/n --> TU/lmr/m/n on input line 27.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `normal'
+(Font)                  TU/lmr/m/n --> TU/lmr/m/n on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathit' in version `normal'
+(Font)                  OT1/cmr/m/it --> TU/lmr/m/it on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathbf' in version `normal'
+(Font)                  OT1/cmr/bx/n --> TU/lmr/b/n on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathsf' in version `normal'
+(Font)                  OT1/cmss/m/n --> TU/lmss/m/n on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathtt' in version `normal'
+(Font)                  OT1/cmtt/m/n --> TU/lmtt/m/n on input line 27.
+LaTeX Font Info:    Overwriting symbol font `operators' in version `bold'
+(Font)                  TU/lmr/m/n --> TU/lmr/b/n on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathit' in version `bold'
+(Font)                  OT1/cmr/bx/it --> TU/lmr/b/it on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathsf' in version `bold'
+(Font)                  OT1/cmss/bx/n --> TU/lmss/b/n on input line 27.
+LaTeX Font Info:    Overwriting math alphabet `\mathtt' in version `bold'
+(Font)                  OT1/cmtt/m/n --> TU/lmtt/b/n on input line 27.
+Package hyperref Info: Link coloring OFF on input line 27.
+(./theory_and_reflection_1234560_cn.out)
+(./theory_and_reflection_1234560_cn.out)
+\@outlinefile=\write3
+\openout3 = `theory_and_reflection_1234560_cn.out'.
+
+Package caption Info: Begin \AtBeginDocument code.
+Package caption Info: End \AtBeginDocument code.
+
+
+LaTeX Font Warning: Font shape `TU/SimSun(0)/b/n' undefined
+(Font)              using `TU/SimSun(0)/m/n' instead on input line 30.
+
+
+Package xeCJK Warning: Unknown CJK family `\CJKttdefault' is being ignored.
+(xeCJK)                
+(xeCJK)                Try to use `\setCJKmonofont[<...>]{<...>}' to define
+(xeCJK)                it.
+
+File: ../outputs/figures/optuna_param_importance.png Graphic file (type bmp)
+<../outputs/figures/optuna_param_importance.png>
+
+
+[1
+
+]
+
+LaTeX Font Warning: Font shape `TU/SimSun(0)/m/it' undefined
+(Font)              using `TU/SimSun(0)/m/n' instead on input line 109.
+
+
+
+[2]
+
+[3] (./theory_and_reflection_1234560_cn.aux)
+ ***********
+LaTeX2e <2024-11-01> patch level 2
+L3 programming layer <2022/08/05>
+ ***********
+
+
+LaTeX Font Warning: Some font shapes were not available, defaults substituted.
+
+Package rerunfilecheck Info: File `theory_and_reflection_1234560_cn.out' has no
+t changed.
+(rerunfilecheck)             Checksum: A703C6812D998839E80788701035983A;582.
+ ) 
+Here is how much of TeX's memory you used:
+ 15601 strings out of 473832
+ 329643 string characters out of 5733159
+ 807732 words of memory out of 5000000
+ 38476 multiletter control sequences out of 15000+600000
+ 564951 words of font info for 81 fonts, out of 8000000 for 9000
+ 1348 hyphenation exceptions out of 8191
+ 74i,11n,92p,1221b,478s stack positions out of 10000i,1000n,20000p,200000b,200000s
+
+Output written on theory_and_reflection_1234560_cn.pdf (3 pages).
@@ -0,0 +1,120 @@
+\documentclass[11pt,a4paper]{article}
+\usepackage[margin=1.45cm,top=1.5cm,bottom=1.5cm]{geometry}
+\usepackage{xeCJK}
+\setCJKmainfont{SimSun}
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{array}
+\usepackage{tabularx}
+\usepackage{float}
+\usepackage{hyperref}
+\usepackage{caption}
+\usepackage{setspace}
+\usepackage{parskip}
+\usepackage{enumitem}
+\usepackage{titlesec}
+\usepackage{amsmath}
+
+\setstretch{1.03}
+\setlist[itemize]{leftmargin=1.1em,itemsep=0.12em,topsep=0.12em}
+\captionsetup{font=small,labelfont=bf}
+\titlespacing*{\section}{0pt}{0.6em}{0.28em}
+\titlespacing*{\subsection}{0pt}{0.28em}{0.12em}
+\titleformat{\section}{\large\bfseries}{\thesection.}{0.4em}{}
+\newcolumntype{Y}{>{\centering\arraybackslash}X}
+\pagestyle{plain}
+
+\begin{document}
+
+\begin{center}
+{\Large \textbf{理论与反思}}\\
+\vspace{0.2em}
+{\normalsize DTS304TC 课程作业 1 \quad 学号: 1234560}
+\vspace{0.15em}
+
+\rule{0.6\linewidth}{0.4pt}
+\end{center}
+
+\section{Bagging 与 Boosting 对比}
+Bagging（Bootstrap Aggregating，自助聚合）通过对原始数据集进行 $B$ 次独立自助采样，在每次采样得到的子集上训练一棵决策树作为基学习器，然后通过多数投票（分类）或平均（回归）来聚合所有基学习器的预测结果。由于每棵树都在不同的随机数据子集上学习，它们的预测误差在聚合时会部分相互抵消，因此 Bagging 主要降低方差。相比之下，Boosting 按顺序训练基学习器：每个新的学习器都专注于之前集成模型的残差或错误样本，从而更激进地降低偏差。两者的核心概念区别在于：Bagging 将所有基学习器视为同等重要，而 Boosting 根据过去的错误动态地重新调整观测样本的权重。
+
+我的 Notebook 在相同的预处理流程和相同的训练/验证划分下，对 Random Forest（代表 Bagging）和 XGBoost（代表 Boosting）进行了受控比较。这种设计至关重要：任何结果差异都必然反映学习算法本身的特性，而非数据准备或评估方式的不同。
+
+表~\ref{tab: supervised-comparison} 总结了来自 \texttt{outputs/tables/personalised\_improvement\_summary.csv} 的关键结果，包含模型名称、验证集宏-F1、验证集准确率、泛化差距（训练 F1 减去验证 F1）、各类 F1 以及训练时间四个模型的完整对比。
+
+\begin{table}[H]
+\centering
+\caption{受控监督模型对比（相同流程和划分）。}
+\label{tab: supervised-comparison}
+\small
+\begin{tabularx}{\textwidth}{>{\raggedright\arraybackslash}p{2.3cm}YYYYYYY}
+\toprule
+模型 & 验证 F1 & 验证 Acc & Gap & High F1 & Low F1 & Std F1 & 时间(s)\\
+\midrule
+Baseline LR & 0.7238 & 0.7342 & 0.0146 & 0.7665 & 0.6490 & 0.7558 & --\\
+Random Forest & 0.7708 & 0.7877 & \textbf{0.2292} & 0.7875 & 0.7095 & 0.8154 & 57.91\\
+XGBoost & 0.8144 & 0.8371 & 0.0155 & 0.8905 & 0.6944 & 0.8583 & 67.64\\
+调参后 XGBoost & 0.8520 & 0.8700 & 0.1219 & 0.9084 & 0.7620 & 0.8854 & 142.65\\
+\bottomrule
+\end{tabularx}\\[3pt]
+{\small Gap = train\_F1 $-$ val\_F1（即过拟合差距）。}
+\end{table}
+
+结果为理论预测提供了有力证据。Random Forest 取得了完美的训练宏-F1（$1.0000$），但验证宏-F1 仅 $0.7708$，泛化差距高达 $0.2292$。这种严重过拟合也已在 Notebook 输出的 Random Forest 混淆矩阵中得到视觉确认。相比之下，XGBoost 的训练宏-F1 为 $0.8297$，验证宏-F1 为 $0.8144$，差距仅 $0.0155$。差距的对比令人印象深刻：RF 的过拟合程度大约是 XGBoost 的 15 倍。
+
+表~\ref{tab: supervised-comparison} 中按类别 F1 列进一步揭示了细节。调参前，RF 在少数类 Low 上（F1=0.7095）优于未调参的 XGBoost（F1=0.6944），但这一优势在 XGBoost 调参后消失。Optuna 调参后，XGBoost 的 Low 类 F1 升至 $0.7620$，相比调参前提升了 $+0.0676$，远高于 RF 的 $0.7095$。这表明 Boosting 的序列残差修正机制更适合学习本数据集中各类别之间的非线性决策边界。Bagging 的方差降低机制无法弥补 RF 全生长树在混合数值和类别特征空间中引入的偏差，这就是 Random Forest 在此处表现不佳的原因。
+
+\section{超参数优化}
+我使用 Optuna 配合 TPE（树结构 Parzen 估计器）采样器进行了 30 次试验，目标是最大化验证集宏-F1。搜索空间覆盖了 9 个 XGBoost 超参数：n\_estimators（100--500）、max\_depth（3--10）、learning\_rate（0.01--0.3，对数尺度）、min\_child\_weight（1--10）、subsample（0.5--1.0）、colsample\_bytree（0.5--1.0）、gamma（0--5）、reg\_alpha（$10^{-4}$--10，对数尺度）以及 reg\_lambda（$10^{-4}$--10，对数尺度）。离散和连续参数混合且存在多维交互，使得穷举网格搜索在计算上不可行；TPE 通过对好试验和坏试验配置的密度建模，并将后续搜索引导至有前景的参数区域，从而高效地探索搜索空间。
+
+第 22 次试验取得了最佳验证宏-F1 $0.8520$，相比未调参的 XGBoost 基线（$0.8144$）提升了 $+0.0376$。该试验的最优配置为：n\_estimators$=276$、max\_depth$=9$、learning\_rate$\approx0.192$、subsample$\approx0.707$、colsample\_bytree$\approx0.799$、reg\_lambda$\approx5.0$、gamma$\approx2.5$。这些值符合预期：适中的学习率配合大树的深度和大量估计器，使模型能够拟合复杂交互，而 0.7--0.8 左右的 subsample 和 colsample 比率提供了正则化效果。图~\ref{fig: param-importance} 显示了 Optuna 参数重要性图，证实了结构参数和学习率主导了优化过程。
+
+\begin{figure}[H]
+\centering
+\fbox{\includegraphics[width=0.58\textwidth]{../outputs/figures/optuna_param_importance.png}}
+\caption{Optuna 超参数重要性图。条越长表示对验证集宏-F1 的影响越大。}
+\label{fig: param-importance}
+\end{figure}
+
+表~\ref{tab: supervised-comparison} 中按类别 F1 的变化值得特别关注，因为宏-F1 对所有三个类别赋予同等权重。Optuna 将 Low 类（少数类）F1 从 $0.6944$ 提升至 $0.7620$（$+0.0676$），High 类 F1 从 $0.8905$ 提升至 $0.9084$（$+0.0179$），Standard 类从 $0.8583$ 提升至 $0.8854$（$+0.0271$）。这种全面改善表明 TPE 成功优化了类别平衡的目标，而非过拟合到多数类。调参后的模型在没有任何类别性能下降的情况下实现了这一目标，而这正是宏-F1 所奖励的。
+
+\section{K-Means 与 GMM 对比}
+K-Means 将每个样本 $x_i$ 分配给质心 $\mu_c$ 使 $\|x_i-\mu_{c_i}\|^2$ 最小的簇 $c_i\in\{1,\dots,k\}$，这是\textbf{硬分配}：每个样本只属于一个簇，没有不确定性或部分成员身份的概念。GMM（高斯混合模型）采用了根本不同的方法，将数据建模为 $k$ 个多元高斯分布的混合：$p(x)=\sum_{j=1}^{k}\pi_j\,\mathcal{N}(x\mid\mu_j,\Sigma_j)$，其中 $\pi_j$ 为混合比例。每个样本获得每个成分的后验概率 $p(c_j\mid x_i)$，从而实现\textbf{软分配}：一个样本可以部分属于多个簇。在保险风险场景中，申请人档案在各风险等级之间自然重叠，而非形成孤立的群体，因此软分配更符合领域实际。
+
+表~\ref{tab: clustering} 展示了来自 \texttt{outputs/tables/clustering\_comparison.csv} 的完整聚类结果，涵盖 k=2 到 k=8。列依次为：$k$、K-Means 惯性、K-Means 轮廓系数、GMM 对数似然、GMM BIC、GMM AIC、GMM 轮廓系数。K-Means 的轮廓系数在所有 $k$ 上持续偏低，峰值仅为 $0.2015$（k=8 时）。这证实了即使是最优的 K-Means 配置也无法在数据中找到分离良好的球形簇。GMM 获得了明显更高的轮廓系数：k=2 时为 $0.4142$，k=5 时为 $0.4015$，约为最佳 K-Means 值的两倍。在 k=2 时，GMM 的轮廓系数 $0.4142$ 对比 K-Means 的 $0.1740$ 特别说明：数据中的二簇结构本质上是概率性的（重叠的高斯成分）而非离散的（质心定义）。
+
+\begin{table}[H]
+\centering
+\caption{完整聚类对比（k=2 到 k=8）。}
+\label{tab: clustering}
+\footnotesize
+\begin{tabularx}{\textwidth}{YYYYYY}
+\toprule
+$k$ & K-Means 惯性 & K-Means 轮廓 & GMM BIC & GMM AIC & GMM 轮廓\\
+\midrule
+2 & 1,092,962 & 0.1740 & $-$359,251 & $-$362,062 & \textbf{0.4142}\\
+3 & 1,018,587 & 0.1732 & $-$1,103,445 & $-$1,107,666 & 0.2977\\
+4 & 953,249 & 0.1808 & $-$1,938,815 & $-$1,944,446 & 0.3964\\
+5 & 889,285 & 0.1964 & $-$1,997,256 & $-$2,004,298 & 0.4015\\
+6 & 818,951 & 0.1768 & $-$2,349,766 & $-$2,358,217 & 0.2468\\
+7 & 777,658 & 0.1971 & $-$2,394,381 & $-$2,404,243 & 0.3110\\
+8 & 691,941 & \textbf{0.2015} & $-$2,510,221 & $-$2,521,493 & 0.1726\\
+\bottomrule
+\end{tabularx}
+\end{table}
+
+表~\ref{tab: clustering} 中的 GMM BIC 列显示随着 $k$ 增大 BIC 单调下降（变好），这是预期的，因为增加成分总能改善对训练数据的拟合。但 BIC 同时惩罚模型复杂度，所以下降速度会随 $k$ 增大而放缓，表明边际收益递减。K-Means 惯性曲线没有明显的"肘部"（转折点），表明数据中不存在自然聚类数——这进一步说明数据不具备分离良好的球形结构。总体而言，GMM 在大多数 $k$ 值上持续更高的轮廓系数表明：保险申请人确实形成了具有软边界的概率亚型。这验证了硬分配与软分配的概念区别：GMM 捕获了 K-Means 无法表示的风险档案重叠性质。重要的是，两种聚类方法都不打算替代监督分类器——它们服务于不同的目标，无监督分析完全是探索性的。
+
+\section{个性化改进反思}
+我的必做类别是 \textbf{类别 A：数据质量与缺失值处理}。在任何建模之前，我进行了 EDA，在多个列中发现了大量缺失值。五个缺失率最高的列分别是：net\_monthly\_income\_gbp、avg\_payment\_delay\_days、monthly\_investment\_gbp、prior\_debt\_products、account\_tenure（分别为 30.6、19.0、21.1、7.6、4.3 百分比）。我没有将缺失值视为噪声而简单使用中位数填充，而是添加了五个二进制缺失指示特征——每个上述列各一个——同时保留中位数填充。这种方法基于一个假设：缺失的\textit{模式}本身可能携带信息——缺失的收入值可能表明财务不稳定或失业，这在保险中是一个合法的风险信号。
+
+添加这五个缺失指示特征后，验证宏-F1 从 $0.8520$（Optuna 调参模型）提升至 $0.8529$（类别 A XGBoost）。收益较小（$+0.0009$），但考虑到调参后的模型已经很强并接近特征空间所隐含的性能上限，这一提升仍有意义。更重要的是，这一提升证实了假设：缺失值携带行为信号——在金融应用中，缺失的收入数据并非随机发生，因此具有合法的预测性。这也展示了一个重要的方法论教训：即使是小的改进，也应被审视以判断它们是反映了真实信号还是过拟合。
+
+对于可选类别，我通过软投票（平均预测类别概率）结合 Random Forest 和调参后的 XGBoost，实现了 \textbf{类别 D：软投票集成}。集成模型取得了验证宏-F1 $0.8510$，低于类别 A 模型（$0.8529$）和单独的调参 XGBoost（$0.8520$）。这一结果很有启发性：它表明模型多样性本身不足以带来集成改善。两个基学习器的预测画像差异很大——RF 严重过拟合而 XGBoost 校准良好——将它们结合反而稀释了 Boosting 模型的优势，而非补充它。在实践中，有效的集成通常要求基学习器既个体表现强又在错误模式上具有多样性。因此我的最终模型选择是基于严格验证宏-F1 证据，选择类别 A 的 XGBoost。
+
+所有建模步骤的关键前提是数据泄漏控制。在任何模型训练之前，我使用单特征决策树交叉验证对所有可用特征进行了筛查。特征 bureau\_risk\_index 取得了单特征宏-F1 $0.9999$——一个极高的分数，表明接近完美的类别分离。这立即触发了泄漏检测阈值（设为 $0.85$），该特征在所有进一步实验之前被移除。这一步至关重要：没有移除泄漏特征，表~\ref{tab: supervised-comparison} 中的所有高验证宏-F1 分数都将被人为夸大，每个模型比较都会失效。泄漏检测还说明了一个更广泛的重要原则：在应用机器学习中，即使某个特征看起来改善了性能，也必须先评估它与目标的关系，然后才能接受。
+
+\section{AI 使用声明}
+在整个课程作业中，AI 工具仅在有限的支持角色中使用：协助环境调试（解决包导入和 GPU 配置问题）以及 \LaTeX{} 格式设置以生成最终文档。实验设计、泄漏检测决策、受控模型比较、个性化改进策略以及对表格、图表和指标的所有书面解读均来自我自己的 Notebook 结果。未声称任何隐藏测试性能；CSV 文件（\texttt{test\_result\_1234560.csv}）遵循作业说明中要求的文件名和列顺序，仅因提交格式需要而生成。
+
+\end{document}
@@ -0,0 +1,942 @@
+# 机器学习个人课程作业需求分析与实现方案
+
+## 1. 文档目的
+
+本文档基于以下材料整理：
+
+- `外教课/原文要求.txt`
+- `外教课/课程作业实现方案分析.md`
+- `外教课/课程作业整合及任务拆解与时间规划清单.md`
+- `资料/DTS304TC_Assessment1_(word)_2026(1).pdf`
+- `资料/dataset_final(1).zip` 中的数据文件结构
+
+目标是输出一份面向实际执行的详细需求分析和实现方案，用于指导 `Jupyter Notebook + Theory and Reflection PDF + hidden-test CSV + 补充代码` 的完整完成过程。
+
+---
+
+## 2. 原始任务核心要求
+
+### 2.1 作业目标
+
+需要围绕一个虚构但接近真实场景的健康保险数据集，建立并改进一个多分类模型，用于预测申请人的保费风险等级：
+
+- `Low`
+- `Standard`
+- `High`
+
+该任务不是单纯追求 leaderboard 排名，而是要求展示完整、规范、可复现的机器学习工作流，包括：
+
+- 数据清理与预处理
+- 数据泄露特征识别与删除
+- 基线模型建立
+- `Random Forest` 与一种 `Boosting` 模型的公平对比
+- 至少一个主模型的高级超参数优化
+- 按学号末位完成一个必做改进类别，并额外完成至少一个可选改进类别
+- `K-Means` 与 `GMM` 的无监督探索
+- 基于验证集证据选择最终模型
+- 生成规定格式的 hidden-test 预测结果 CSV
+
+### 2.2 提交物
+
+必须提交以下文件：
+
+- 一个 `Jupyter Notebook`，格式为 `.ipynb`
+- 一个 `Coursework Answer Sheet / Theory and Reflection PDF`
+- 一个 hidden-test 预测结果 `CSV`
+- 如使用 notebook 外部的辅助脚本，也必须一并提交
+
+### 2.3 成绩结构
+
+根据 PDF 原文，整份作业分值结构如下：
+
+- `Question 1: Notebook-Based Coding Exercise`：`60` 分
+- `Theory and Reflection PDF`：`30` 分
+- `Coding Quality / Answer Sheet Quality / Submission Guidelines`：`10` 分
+
+这意味着不能只重模型分数，文档质量、实验组织、结果解释和提交格式同样直接影响成绩。
+
+---
+
+## 3. 从原始 PDF 提炼出的硬性约束
+
+以下内容属于正式要求中的关键硬约束，必须优先满足。
+
+### 3.1 评价指标约束
+
+- 主指标是 `macro-F1`
+- `accuracy` 只是辅助指标
+- 所有重要模型对比、调参结果、改进结果都应同时报告：
+  - `macro-F1`
+  - `accuracy`
+
+原因是类别不平衡明显，不能只用准确率判断模型优劣。
+
+### 3.2 数据泄露约束
+
+PDF 明确指出：
+
+- 数据中存在 `1` 个泄露特征
+- 必须先识别并移除，再进入后续分析
+- 如果没有删除，后续多个部分都会被视为无效或严重失真
+
+这说明“找出泄露特征”不是建议项，而是作业关键检查点。
+
+### 3.3 模型比较约束
+
+Notebook 中必须完成：
+
+- 一个 baseline pipeline
+- 一个 `Random Forest`
+- 一个 `Boosting` 模型
+
+并且需要满足“受控比较”原则：
+
+- 使用同一套预处理流程
+- 使用同一份训练/验证划分
+- 使用同一评价指标
+
+否则比较结论不成立。
+
+### 3.4 高级调参约束
+
+至少一个主模型需要使用真正的高级调参方法，例如：
+
+- `Optuna/TPE`
+- 贝叶斯优化
+- `Hyperopt`
+- `Ray Tune`
+
+PDF 明确说明：
+
+- 单独使用 `RandomizedSearchCV` 通常不足以进入高分档
+
+因此建议主方案使用 `Optuna`。
+
+### 3.5 个性化改进约束
+
+必须完成：
+
+- 根据学号末位对应的 `1` 个必做类别
+- 额外至少 `1` 个自选类别
+
+推荐但非强制：
+
+- 再额外完成第 `2` 个可选类别，以增强区分度
+
+### 3.6 无监督探索约束
+
+必须完成紧凑版的：
+
+- `K-Means`
+- `Gaussian Mixture Model (GMM)`
+
+重点不是聚类分数超过监督学习模型，而是体现：
+
+- 对无监督方法机制的理解
+- 对结果的谨慎解释
+- 对 `hard assignment` 和 `soft assignment` 区别的认识
+
+### 3.7 hidden-test 导出约束
+
+最终 CSV 必须：
+
+- 文件名为 `test_result_[your_student_id].csv`
+- 第一列必须是 `applicant_id`
+- 第二列必须是 `customer_key`
+- 第三列必须是 `premium_risk`
+- 预测标签只能使用：
+  - `Standard`
+  - `High`
+  - `Low`
+
+PDF 明确说明：
+
+- 命名或格式错误会在该部分自动扣 `4` 分
+- 不允许在 hidden test 上调参
+- 不允许声称 hidden test 性能
+
+### 3.8 PDF 写作约束
+
+`Theory and Reflection PDF` 必须：
+
+- 不超过 `4` 页
+- 不超过 `1200` 词
+- 不得简单重复 notebook 内容
+- 每一个理论问题都必须引用 notebook 中至少一个表、图或指标
+
+超过页数或字数限制会固定扣 `5` 分。
+
+### 3.9 AI 使用约束
+
+PDF 原文给出了比普通课程更严格的 AI 使用边界：
+
+- 不允许直接用 ChatGPT 生成作业答案
+- AI 只能作为有限支持工具：
+  - 代码理解
+  - 调试
+  - 语法润色
+- AI 不能替代：
+  - 方法设计
+  - 消融逻辑
+  - 定性分析
+  - 反思写作
+
+因此最终提交物必须明显体现“你自己做了实验和分析”。
+
+---
+
+## 4. 数据集层面的已知信息
+
+基于对 `dataset_final(1).zip` 的结构和文件头信息检查，目前已确认：
+
+### 4.1 数据文件
+
+压缩包内包含：
+
+- `dataset_final/train.csv`
+- `dataset_final/val.csv`
+- `dataset_final/test_features.csv`
+
+### 4.2 数据规模
+
+- `train.csv`：`74375 x 33`
+- `val.csv`：`13125 x 33`
+- `test_features.csv`：`12500 x 32`
+
+说明：
+
+- 训练集和验证集包含标签列 `premium_risk`
+- hidden-test 文件不包含标签
+
+### 4.3 字段结构
+
+训练集列名包括：
+
+- 标识类字段：`applicant_id`, `customer_key`, `applicant_ref_code`
+- 时间/类别字段：`application_month`, `employment_sector`, `prior_debt_products`, `debt_portfolio_quality`, `account_tenure`, `minimum_payment_only`, `spending_profile`
+- 数值特征：收入、负债、额度变化、查询次数、逾期情况、投资金额、余额等
+- 明显可疑字段：`bureau_risk_index`
+- 噪声字段：`noise_feature_1` 到 `noise_feature_5`
+- 标签：`premium_risk`
+
+### 4.4 类别分布
+
+训练集标签分布：
+
+- `Standard`: `39686`
+- `High`: `21586`
+- `Low`: `13103`
+
+结论：
+
+- 数据明显不平衡
+- 使用 `macro-F1` 作为主指标完全合理
+- 在个性化改进中，`Category C` 类的重采样、类权重、阈值逻辑会很自然
+
+### 4.5 缺失值概览
+
+当前观察到缺失值较多的字段包括：
+
+- `net_monthly_income_gbp`
+- `avg_payment_delay_days`
+- `monthly_investment_gbp`
+- `prior_debt_products`
+- `account_tenure`
+- `late_payment_count`
+- `credit_limit_change_pct`
+- `credit_inquiry_count`
+- `end_month_balance_gbp`
+
+这说明预处理部分不能仅做“简单删行”，更适合使用 pipeline 化的缺失值处理方案。
+
+---
+
+## 5. 对任务本质的理解
+
+这份作业本质上考查的不是“谁把模型调得最高”，而是以下四个能力：
+
+- 是否能建立规范的机器学习实验流程
+- 是否能识别不合理特征并避免数据泄露
+- 是否能做公平比较、合理调参和证据驱动分析
+- 是否能把理论概念和自己的实验结果一一对应起来
+
+因此，高分解法必须同时满足：
+
+- 模型结果合理
+- 实验过程规范
+- 分析论证充分
+- notebook 和 PDF 严格互相对应
+
+---
+
+## 6. Notebook 需求拆解
+
+下面按原始评分结构，对 notebook 部分做可执行拆解。
+
+### 6.1 A 部分：清洗、预处理与基线建模
+
+必须完成的内容：
+
+- 读取 `train.csv`、`val.csv`、`test_features.csv`
+- 明确 `X_train / y_train / X_val / y_val / X_test`
+- 识别并删除泄露特征
+- 处理脏值、缺失值、类别变量
+- 建立一个基线 pipeline
+- 报告 baseline 的：
+  - `accuracy`
+  - `macro-F1`
+  - confusion matrix
+- 确保 train、val、test 使用完全一致的预处理规则
+
+建议实现：
+
+- 使用 `ColumnTransformer + Pipeline`
+- 数值特征：
+  - `SimpleImputer(strategy='median')`
+- 类别特征：
+  - `SimpleImputer(strategy='most_frequent')`
+  - `OneHotEncoder(handle_unknown='ignore')`
+- baseline 模型：
+  - `LogisticRegression`
+  - 或 `HistGradientBoosting` 前的简单基线
+
+更稳妥的推荐是：
+
+- 用 `LogisticRegression(class_weight='balanced')` 作为 baseline
+
+原因：
+
+- 简单
+- 可解释
+- 适合作为后续树模型的比较起点
+
+### 6.2 泄露特征识别策略
+
+由于 PDF 强调必须自行识别，不建议在正式 notebook 中直接“拍脑袋认定”某列是泄露。
+
+建议采用下面的证据链：
+
+1. 先从业务语义初筛高风险列  
+   候选优先检查：
+   - `bureau_risk_index`
+   - 各类明显后验统计或接近标签定义的字段
+
+2. 做单变量或极简模型筛查  
+   例如：
+   - 单列训练一个简单模型
+   - 比较每列单独带来的验证集 `macro-F1`
+
+3. 检查“异常高”的预测能力  
+   若某列单独就能异常接近目标标签，则高度怀疑为 leakage
+
+4. 删除该特征后重新建立基线  
+   在 notebook 中明确说明：
+   - 删除前风险
+   - 删除后原因
+   - 为什么后续分析必须基于删除后的数据
+
+注意：
+
+- 从字段命名看，`bureau_risk_index` 是最值得优先怀疑的候选
+- 但正式提交中最好写成“通过字段语义 + 验证结果发现其构成泄露或近似泄露”
+
+### 6.3 B 部分：随机森林与 boosting 的受控比较
+
+必须完成：
+
+- 保持同一预处理
+- 比较：
+  - `RandomForestClassifier`
+  - 一个 boosting 模型
+
+推荐 boosting 模型优先级：
+
+1. `XGBoost`
+2. `LightGBM`
+3. `HistGradientBoostingClassifier`
+
+推荐理由：
+
+- PDF 明确点名推荐 `XGBoost`
+- 后续调参空间更丰富
+- 更容易做高质量的超参数优化
+
+本节输出建议至少包含：
+
+- 模型对比表：
+  - accuracy
+  - macro-F1
+  - 训练时间
+- 每个模型的 confusion matrix
+- 分类报告或按类别 F1
+- 简短解释 bagging 与 boosting 在本数据集上的差异
+
+重点：
+
+- 不是证明谁永远更强
+- 而是说明在当前数据集上，哪种偏差-方差特性更适合
+
+### 6.4 C 部分：高级超参数优化
+
+必须完成：
+
+- 至少选择一个主模型
+- 使用高级优化方法进行调参
+- 目标函数对准验证集 `macro-F1`
+
+推荐主模型：
+
+- `XGBoost`
+
+推荐优化器：
+
+- `Optuna` 的 `TPESampler`
+
+推荐搜索空间示例：
+
+- `n_estimators`
+- `max_depth`
+- `learning_rate`
+- `min_child_weight`
+- `subsample`
+- `colsample_bytree`
+- `gamma`
+- `reg_alpha`
+- `reg_lambda`
+
+建议输出：
+
+- 最优参数表
+- 前若干 trial 的结果摘要
+- 调参前后性能对比表
+- 关键超参数的重要性解释
+
+注意：
+
+- 需要简要说明为什么搜索空间这样设
+- 需要说明“原本预期哪些参数最关键”
+- 需要说明“调参结果是否符合预期”
+
+这部分内容将直接为 PDF 中第 2 题服务。
+
+### 6.5 D 部分：个性化改进
+
+这是 notebook 中占比最高、最容易拉开差距的部分。
+
+#### 学号末位对应关系
+
+- `0-1` -> `Category A`：数据质量与缺失机制
+- `2-3` -> `Category B`：特征表示与特征工程
+- `4-5` -> `Category C`：类别不平衡与目标设计
+- `6-7` -> `Category D`：鲁棒性、校准或集成
+- `8-9` -> `Category E`：公平性、诊断或可解释性
+
+#### 每类推荐落地方式
+
+`Category A` 可选实现：
+
+- `IterativeImputer`
+- 更细粒度缺失指示器
+- 异常值裁剪或 winsorization
+- 脏值统一清洗
+
+`Category B` 可选实现：
+
+- 特征交叉
+- 类别合并
+- 目标无监督编码之外的安全表示
+- log 变换、比例特征、债务收入比等衍生变量
+
+`Category C` 可选实现：
+
+- `class_weight`
+- `SMOTE` 或其他重采样
+- focal-like 思想的替代实现
+- 基于验证集的阈值调整
+
+`Category D` 可选实现：
+
+- 概率校准
+- soft voting
+- stacking
+- bootstrap 稳定性测试
+
+`Category E` 可选实现：
+
+- `SHAP`
+- permutation importance
+- 分组公平性检查
+- 错误样本分析
+- 高风险误判模式总结
+
+#### 强烈建议的写法
+
+无论学号末位落在哪类，都建议再加一个容易出效果的 optional 类别：
+
+- 若主模型是树模型，优先补：
+  - `Category E` 可解释性
+  - 或 `Category D` 集成/校准
+
+这是因为：
+
+- PDF 明确欢迎“具体洞见”
+- 可解释性和误差分析非常利于写 reflection
+- 这些内容能让 PDF 更有证据，不容易空泛
+
+#### 个性化改进部分必须有的证据
+
+- 紧凑版 ablation table
+- 改进前后 accuracy / macro-F1 对比
+- 必要时增加 class-wise F1
+- 简短解释：
+  - 做了什么
+  - 为什么这样做
+  - 结果是否提升
+  - 若未提升，为什么仍然有价值
+
+### 6.6 E 部分：K-Means 与 GMM 探索
+
+这一部分应控制篇幅，不宜做成主线。
+
+建议流程：
+
+1. 从清洗后的特征中取适合聚类的数值空间
+2. 必要时先做缩放
+3. 可以考虑降维到：
+   - PCA 2D 用于可视化
+4. 对 `k=2~8` 分别跑：
+   - `KMeans`
+   - `GaussianMixture`
+
+建议输出：
+
+- `K-Means`：
+  - inertia / SSE 曲线
+  - cluster size
+  - silhouette score（可选加强）
+- `GMM`：
+  - BIC / AIC 或 log-likelihood 趋势
+  - component size
+  - posterior probability / responsibility 统计
+
+最后做一个对比表或图，回答：
+
+- 为什么 `K-Means` 是 hard assignment
+- 为什么 `GMM` 是 soft assignment
+- 当前数据是否存在模糊边界
+- `GMM` 是否额外揭示了不确定性或重叠结构
+
+### 6.7 F 部分：最终模型与 hidden-test 导出
+
+必须满足的原则：
+
+- 只能依据验证集结果选最终模型
+- 不允许根据 hidden-test 结果回头调参
+
+建议流程：
+
+1. 锁定最终 pipeline
+2. 用 `train + val` 合并后重新训练
+3. 对 `test_features.csv` 预测
+4. 生成严格符合格式的 CSV
+
+推荐导出逻辑：
+
+- 从 `test_features.csv` 保留：
+  - `applicant_id`
+  - `customer_key`
+- 新增一列：
+  - `premium_risk`
+
+最终列顺序必须是：
+
+1. `applicant_id`
+2. `customer_key`
+3. `premium_risk`
+
+---
+
+## 7. 推荐的整体实现路线
+
+下面给出一条适合拿分且可操作的实现主线。
+
+### 7.1 技术栈建议
+
+- Python
+- `pandas`
+- `numpy`
+- `scikit-learn`
+- `xgboost`
+- `optuna`
+- `matplotlib`
+- `seaborn`
+- `shap`（如选择可解释性方向）
+- `imbalanced-learn`（如选择重采样方向）
+
+### 7.2 推荐项目结构
+
+```text
+coursework_ml/
+├─ notebook/
+│  └─ insurance_premium_risk.ipynb
+├─ src/
+│  ├─ data_utils.py
+│  ├─ features.py
+│  ├─ metrics.py
+│  ├─ tuning.py
+│  └─ export.py
+├─ outputs/
+│  ├─ figures/
+│  ├─ tables/
+│  └─ predictions/
+├─ report/
+│  └─ theory_and_reflection.pdf
+└─ README.md
+```
+
+如果想降低复杂度，也可以只保留：
+
+- 一个主 notebook
+- 一到两个辅助 `.py` 脚本
+
+关键是要保证复现性，而不是强行复杂化。
+
+### 7.3 Notebook 章节建议
+
+建议 notebook 按下面顺序组织：
+
+1. Introduction and Setup
+2. Data Loading
+3. Data Cleaning and Leakage Check
+4. Baseline Pipeline
+5. Controlled Comparison: Random Forest vs Boosting
+6. Advanced Hyperparameter Optimisation
+7. Personalised Improvement Work
+8. K-Means and GMM Exploration
+9. Final Model Selection
+10. Hidden-Test Export
+11. Conclusion
+
+优点：
+
+- 与评分结构高度对齐
+- PDF 写作时可以直接反向索引表格和图片
+
+---
+
+## 8. 推荐的模型方案
+
+### 8.1 baseline
+
+推荐：
+
+- `LogisticRegression`
+
+作用：
+
+- 给出最低可比较基线
+- 验证预处理链是否稳定
+- 提供线性模型与树模型的风格对照
+
+### 8.2 bagging 方案
+
+推荐：
+
+- `RandomForestClassifier`
+
+关注点：
+
+- 对类别变量经 one-hot 后的适应性
+- 是否更稳但不够激进
+- 是否在少数类上表现一般
+
+### 8.3 boosting 方案
+
+推荐首选：
+
+- `XGBoost`
+
+备选：
+
+- `LightGBM`
+
+若环境依赖受限可退而求其次：
+
+- `HistGradientBoostingClassifier`
+
+### 8.4 最终模型的推荐候选
+
+最可能的优胜路线通常是：
+
+- 删除泄露特征后
+- 用稳定的预处理 pipeline
+- 以 `XGBoost` 作为主模型
+- 叠加一个与你学号类别匹配的必做改进
+- 再叠加一个可解释性或校准类的 optional 改进
+
+一个较稳的最终候选组合是：
+
+- 主模型：调参后的 `XGBoost`
+- 必做改进：按学号末位选择
+- 可选改进：`SHAP + error analysis` 或 `probability calibration`
+
+---
+
+## 9. PDF 的写作映射方案
+
+为了避免 PDF 和 notebook 脱节，建议从 notebook 设计开始就准备好下面这些证据位。
+
+### 9.1 Bagging vs Boosting
+
+PDF 要回答：
+
+- bagging 和 boosting 的定义与性质
+- 两个模型的验证结果
+- 辅助分析
+- 与本数据集的结合解释
+
+Notebook 需要提前准备：
+
+- `RF vs XGB` 对比表
+- confusion matrix
+- class-wise F1
+- 调参前后稳定性或训练/验证表现
+
+### 9.2 Hyperparameter Optimisation
+
+PDF 要回答：
+
+- 优化器为什么合理
+- 搜索空间为什么合理
+- 哪些参数最重要
+- 调参结果是否符合预期
+
+Notebook 需要提前准备：
+
+- Optuna study 结果表
+- 最优参数
+- 调参前后指标变化
+- 若可能，参数重要性图
+
+### 9.3 K-Means vs GMM
+
+PDF 要回答：
+
+- hard vs soft assignment
+- 两者假设差异
+- 当前数据上的观察结论
+
+Notebook 需要提前准备：
+
+- 一张聚类比较图
+- 一张指标对比表
+- 一段关于重叠和不确定性的解释
+
+### 9.4 Personalised Reflection
+
+PDF 要回答：
+
+- 必做类别做了什么
+- 可选类别做了什么
+- 遇到的问题
+- 做出的努力
+- 学到了什么
+
+Notebook 需要提前准备：
+
+- ablation table
+- before/after 结果对比
+- 若有失败实验，也保留最关键的一两个作为证据
+
+### 9.5 AI-use Declaration
+
+PDF 中建议采用诚实、克制、可核验的表述，例如：
+
+- 使用 AI 协助理解报错、检查代码逻辑、润色语言
+- 所有方法设计、实验执行、结果核验和结论撰写均由本人完成
+- 所有表格、图和结论均以 notebook 实验结果为依据
+
+注意：
+
+- 不能写成“AI 帮我完成了模型设计”
+- 不能出现与 notebook 证据对不上的泛化陈述
+
+---
+
+## 10. 风险点分析
+
+### 10.1 最大风险：没有先删除泄露特征
+
+后果：
+
+- 分数看似很高
+- 但整个分析会被视为失真
+- 影响 baseline、比较、调参和最终模型选择
+
+### 10.2 常见风险：比较不公平
+
+表现：
+
+- baseline 和后续模型使用了不同预处理
+- 一个模型用 train，一个模型用 train+val
+- 某些模型调参后和默认模型直接横比
+
+后果：
+
+- 结论缺乏可信度
+
+### 10.3 常见风险：只报 accuracy
+
+由于类别不平衡：
+
+- 只看 accuracy 会掩盖少数类问题
+- 很容易丢掉对 `Low` 或 `High` 类的分析深度
+
+### 10.4 常见风险：个性化改进做成“试很多但没有逻辑”
+
+PDF 原文明确更看重：
+
+- meaningful diagnostic
+- concrete insight
+
+而不是：
+
+- 一大堆零散实验截图
+
+### 10.5 常见风险：PDF 空泛
+
+如果 PDF 只是教材知识复述，而没有引用 notebook 的具体图表或指标，会被明显扣分。
+
+### 10.6 常见风险：CSV 格式错误
+
+尤其要避免：
+
+- 文件名错误
+- 列顺序错误
+- 标签拼写错误
+- 导出时丢掉 `applicant_id` 或 `customer_key`
+
+---
+
+## 11. 建议的执行顺序
+
+建议按以下顺序推进，效率最高：
+
+1. 先读取并检查 train / val / test 的列结构
+2. 找出并删除泄露特征
+3. 建立统一预处理 pipeline
+4. 跑 baseline
+5. 跑 `Random Forest vs XGBoost` 初始比较
+6. 选一个主模型做 `Optuna`
+7. 根据学号末位完成必做改进
+8. 增加一个 optional 改进
+9. 做 `K-Means + GMM` 紧凑探索
+10. 选最终模型并重训
+11. 导出 `test_result_[student_id].csv`
+12. 最后再写 PDF
+
+原因：
+
+- PDF 的每个回答都依赖 notebook 证据
+- 先写 PDF 容易出现空泛和证据脱节
+
+---
+
+## 12. 建议的时间分配
+
+如果按一份完整高质量作业来做，建议分配如下：
+
+- 数据清洗与泄露识别：`15%`
+- baseline 与模型比较：`20%`
+- 高级调参：`20%`
+- 个性化改进：`25%`
+- K-Means / GMM：`10%`
+- 最终导出与提交检查：`5%`
+- PDF 写作：`5%`
+
+如果时间紧，最不能压缩的部分是：
+
+- 泄露识别
+- 受控比较
+- 个性化改进
+- PDF 与 notebook 的证据对应
+
+---
+
+## 13. 最推荐的落地方案
+
+在当前要求下，一套相对稳妥、兼顾分数与实现成本的方案如下：
+
+### 13.1 Notebook 主线
+
+- 删除泄露特征
+- 建立 `ColumnTransformer + Pipeline`
+- baseline 使用 `LogisticRegression`
+- 受控比较使用：
+  - `RandomForestClassifier`
+  - `XGBoost`
+- 用 `Optuna` 调 `XGBoost`
+- 个性化改进做：
+  - 学号对应的必做类别
+  - 再补一个 `Category E` 或 `Category D`
+- 无监督探索做：
+  - `KMeans`
+  - `GaussianMixture`
+- 最终模型大概率采用：
+  - 调参后的 `XGBoost` 或其增强版本
+
+### 13.2 PDF 主线
+
+围绕五个 compulsory prompt 分别写，每一题都强制绑定 notebook 中的至少一个：
+
+- 表
+- 图
+- 指标结果
+
+写法上坚持：
+
+- 先简短理论
+- 再接本次实验数据
+- 再给出数据集相关解释
+
+### 13.3 提交前检查清单
+
+- notebook 能从头运行到尾
+- 图表和表格在 notebook 中可见
+- PDF 中提到的数值和 notebook 完全一致
+- hidden-test CSV 命名正确
+- hidden-test CSV 列顺序正确
+- 标签名称拼写正确
+- 所有附加脚本一并提交
+
+---
+
+## 14. 当前仍待确认的信息
+
+以下信息目前仍需你确认，才可以把“最终实现方案”从通用版收束为定制版：
+
+- 你的 `学号末位`
+- 你是否打算使用 `XGBoost`
+- 你是否希望 optional 改进优先走：
+  - 可解释性方向
+  - 集成/校准方向
+  - 类别不平衡方向
+
+其中最关键的是：
+
+- `学号末位`
+
+因为它直接决定 `Category A/B/C/D/E` 中哪一类是你的必做项。
+
+---
+
+## 15. 结论
+
+这份机器学习作业的最优策略不是“盲目追求最高分模型”，而是：
+
+- 先保证实验规范
+- 再确保比较公平
+- 再通过高级调参和个性化改进拉开差距
+- 最后让 PDF 和 notebook 形成严格的证据闭环
+
+如果后续要真正动手实现，建议直接按照本文档第 `11` 节的执行顺序推进，并优先先确认学号末位，再定制个性化改进方案。
@@ -0,0 +1,251 @@
+        XJTLU Entrepreneur College (Taicang) Cover Sheet
+
+ Module code and Title       DTS307TC Reinforcement Learning
+ School Title                School of AI and Advanced Computing
+ Assignment Title            Coursework 1
+ Submission Deadline         04/May/2026 23:59
+ Final Word Count
+ If you agree to let the university use your work anonymously for teaching
+ and learning purposes, please type “yes” here.
+
+
+I certify that I have read and understood the University’s Policy for dealing with Plagiarism,
+Collusion and the Fabrication of Data (available on Learning Mall Online). With reference to this
+policy I certify that:
+
+    •   My work does not contain any instances of plagiarism and/or collusion.
+        My work does not contain any fabricated data.
+
+
+
+By uploading my assignment onto Learning Mall Online, I formally declare
+that all of the above information is true to the best of my knowledge and
+belief.
+                                         Scoring – For Tutor Use
+ Student ID
+
+ Stage of           Marker            Learning Outcomes Achieved （F/P/M/D）                          Final
+ Marking             Code                   (please modify as appropriate)                          Score
+                               A                      B                C
+ 1st Marker – red
+ pen
+ Moderation                        The original mark has been accepted by the moderator              Y/N
+                      IM                        (please circle as appropriate):
+ – green pen        Initials
+                                   Data entry and score calculation have been checked by               Y
+                                                another tutor (please circle):
+ 2nd Marker if
+ needed – green
+ pen
+ For Academic Office Use             Possible Academic Infringement (please tick as appropriate)
+ Date          Days Late                   ☐ Category A
+ Received late      Penalty                                                Total Academic Infringement Penalty
+                                            ☐ Category B                   (A,B, C, D, E, Please modify where
+                                                                           necessary) _____________________
+                                            ☐ Category C
+                                            ☐ Category D
+                                            ☐ Category E
+School of Artificial Intelligence and Advanced Computing
+Xi’an Jiaotong-Liverpool University
+
+
+
+
+                             DTS307TC Reinforcement Learning
+
+                               Coursework - Individual Report
+
+Due: 04/May/2026 23:59
+Weight: 40%
+Maximum score: 40 marks
+
+
+
+
+Overview
+
+The purpose of this assignment is to gain experience in Python programming and the design of
+reinforcement leaning algorithms. You are expected to implement an RL algorithm that solves a
+specific environment and provide an explanation of the algorithm’s methodology. You are expected
+to analyse your results, including challenges and your solutions.
+
+
+Learning Outcomes Assessed
+
+A: Systematically understand the fundamental concepts and principles of reinforcement learning
+B: Critically analyse real-life problem situations and expertly map them as reinforcement learning
+tasks.
+C: Mastery of Monte Carlo Methods and Temporal Difference Learning
+D: Proficiency in Deep Reinforcement Learning algorithms
+
+
+Late policy
+
+5% of the total marks available for the assessment shall be deducted from the assessment mark for
+each working day after the submission date, up to a maximum of five working days
+
+
+Avoid Plagiarism
+
+  • Do not submit work from other students.
+
+  • Do not share code/work with other students
+
+  • Do not use open-source code as it is or without proper reference.
+
+
+
+
+                                                2
+Risks
+
+   • Please read the coursework instructions and requirements carefully. Not following these instructions
+     and requirements may result in a loss of marks.
+   • The assignment must be submitted via Learning Mall. Only electronic submission is accepted
+     and no hard copy submission.
+   • All students must download their file and check that it is viewable after submission. Documents
+     may become corrupted during the uploading process (e.g. due to slow internet connections).
+     However, students are responsible for submitting a functional and correct file for assessments.
+   • Academic Integrity Policy is strictly followed.
+
+
+Individual Report (40 marks)
+
+The primary objective of this coursework is to familiarize students with the PPO algorithm using
+basic deep learning libraries, enabling them to improve their capability in transferring mathematical
+and theoretical knowledge into Python implementation, and further their understanding of the actor-
+critic algorithm.
+
+
+Algorithm Overview
+
+Proximal Policy Optimization (PPO) is a state-of-the-art reinforcement learning algorithm that optimizes
+a stochastic policy in an on-policy manner. To ensure stable training and avoid catastrophic performance
+collapse, PPO utilizes a clipped surrogate objective to prevent the policy update from stepping too
+far from the current behavior.
+
+
+The Environment: CarRacing-v3
+
+We will be using the Car Racing environment from the OpenAI Gymnasium. This environment
+features a top-down racing track where the agent must learn to navigate through tiles based on
+pixel inputs. You can find more details about this environment on their website.(https://gymnasium.
+farama.org/environments/box2d/car_racing/)
+Here’s a code snippet for you to get started:
+
+import gymnasium as gym
+env = gym . make ( " CarRacing - v3 " , render_mode = " rgb_array " )
+env . reset ()
+
+Since CarRacing-v3 is quite computationally expensive for a standard laptop (due to the pixel processing),
+you might want to consider using a gray-scaling or frame-stacking wrapper to speed up training.
+Alternatively, you can also use the lab computers, which have GPUs and have all the environment
+already set up.
+
+
+The PPO Agent
+
+You will implement an RL agent using PPO to play the CarRacing-v3 environment. The agent
+will use the standard observation and actions provided by the environment. You may edit the
+
+                                                      3
+environment to speed up your training, but your agent must still perform well in the standard
+environment. (i.e, removing the camera zoom at the beginning is allowed during training, but
+your agent should still be tested in the original environment.) You should record your training and
+evaluation process using Tensorboard. You should also record important losses and other data for
+your analysis later.
+
+
+The Report
+
+Upon completion of your implementation, you are required to submit a comprehensive technical
+report. The report should document your engineering decisions, the theoretical grounding of your
+code, and a critical analysis of the agent’s performance.
+
+  1. Introduction
+
+       • Provide a brief overview of Reinforcement Learning in the context of the CarRacing-v3
+         environment.
+       • Define the state space (pixels), action space (discrete commands), and the reward structure
+         of the task.
+
+  2. Methodology
+
+       • Mathematical Foundation: Formulate the PPO objective function. Explain the significance
+         of the clipping parameter and the probability ratio.
+       • Advantage Estimation: Describe your method for calculating advantages (e.g., standard
+         advantage vs. Generalized Advantage Estimation (GAE)).
+
+  3. Implementation Details
+
+       • Describe your implementation, including any challenges faced and how you addressed
+         them.
+       • Explain the structure of your policy and value networks.
+       • Detail the training process and hyperparameters used.
+
+  4. Results and Analysis
+
+       • Present your results (use graphs for better clarity).
+       • Discuss the performance of your agent and any trends observed.
+       • Briefly compare your custom implementation’s stability and sample efficiency against baseline
+         benchmarks (e.g., Stable-Baselines3).
+
+  5. Conclusion
+
+       • Summarize your key findings regarding the sensitivity of PPO to hyperparameter tuning
+         and the effectiveness of the actor-critic framework in continuous-input environments.
+
+     Note: All figures and plots must be clearly labeled with axes titles and legends. Raw code
+     snippets should be kept to a minimum in the report; focus on high-level logic and pseudo-
+     code where necessary.
+
+
+
+
+                                                 4
+Important Note
+
+   • Do NOT use Stable-baselines libraries or any other reinforcement learning specific libraries in
+     your implementation (You may use tensorboard for recording your results).
+
+   • Do NOT exceed the word count limit of 3000 words for each report, reference and appendix
+     excluded.
+
+   • Although you are allowed to use any generative AI tools to assist your work, please keep in mind
+     that you should be using them responsibly. (Good use: Improve your report after writing it
+     and always review its output to ensure that it is correct. Bad use: Copy-pasting an entire report
+     from AI without any effort of your own. )
+
+
+Submission Requirements
+
+Please prepare and submit the following documents:
+
+   • A cover page featuring your student ID. This page should be the first page of your report.
+
+   • A zip file containing all the source codes and your trained agent model, which should be named
+     using your full name and student ID in the following format: CW1_ID_Name.zip
+
+   • One PDF file for your report. The file should be separated from the zip file, which contains your
+     code. The files should be named in the following format: CW1_ID_Name.pdf
+
+Note that the quality of the code, the clarity of your writing, and the format/style of your report will
+be taken into consideration during the evaluation. The detailed rubric is outlined below.
+
+
+Rubric
+
+ CW1 (40 makrs)     Criteria                                                                                       Marks
+ Code Performance   Code runs without errors and performs tasks as specified.                                      6
+ Code Quality       Code is well-organized, includes meaningful comments, and uses appropriate variable names.     6
+ Methodology        Comprehensive coverage of topics with detailed explanations of approaches and methodologies.   6
+ Result analysis    Insightful analysis of results.                                                                6
+ Report Quality     Report is well-structured, formatted, and free of grammatical errors.                          6
+ Evidence of Work   All required elements are included and correct.                                                6
+ Submission         Follows all requirements for submission                                                        4
+
+
+
+
+                                                           5
+
@@ -0,0 +1,260 @@
+                                XJTLU Entrepreneur College (Taicang) Cover Sheet
+
+                                                                                                School of AI and Advanced
+         Module code           DTS304TC: Machine Learning                 School title
+                                                                                                Computing
+
+         Assessment title      Coursework Task 1                          Assessment type       Coursework
+
+         Submission
+                               01/May/2026 23:59
+         deadline
+
+
+I certify that I have read and understood the University's Policy for dealing with Plagiarism, Collusion and the Fabrication of Data
+(available on Learning Mall Online).
+My work does not contain any instances of plagiarism and/or collusion.
+My work does not contain any fabricated data.
+
+
+  By uploading my assignment onto Learning Mall Online, I formally declare that all of the
+            above information is true to the best of my knowledge and belief.
+                                              Scoring – For Tutor Use
+                             Student ID
+          Theory and Reflection PDF Word Count (Filled by
+                             Students)
+
+        Stage of Marking       Marker              Learning Outcomes Achieved （F/P/M/D）                           Final
+                               Code                                                                               Score
+                                                        (please modify as appropriate)
+                                                     A                   B             C
+         1st Marker – red
+               pen
+            Moderation                        The original mark has been accepted by the moderator                 Y/N
+                                  IM                       (please circle as appropriate):
+           – green pen         Initials
+                                             Data entry and score calculation have been checked by                   Y
+                                                          another tutor (please circle):
+           2nd Marker if
+         needed – green
+               pen
+          For Academic Office Use                  Possible Academic Infringement (please tick as appropriate)
+          Date      Days     Late                       ☐ Category A
+        Received     late  Penalty                                                        Total Academic Infringement Penalty
+                                                        ☐ Category B                       (A,B, C, D, E, Please modify where
+                                                                                          necessary) _____________________
+                                                        ☐ Category C
+                                                        ☐ Category D
+                                                        ☐ Category E
+                                              DTS304TC Machine Learning
+                                            Coursework - Assessment Task 1
+•     Percentage in final mark: 50%
+•     Assessment type: individual coursework
+•     Submission files: one Jupyter notebook (.ipynb), one Coursework Answer Sheet / Theory and Reflection PDF, and one
+      hidden-test CSV
+
+    Learning outcomes assessed
+•     A. Demonstrate a solid understanding of the theoretical issues related to problems that machine-learning methods try to
+      address.
+•     B. Demonstrate understanding of the properties of existing machine-learning algorithms and how they behave on practical data.
+
+
+
+    Notes
+•     Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may
+      result in a loss of marks.
+•     The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on Learning Mall will be provided
+      in due course. The submission timestamp on Learning Mall will be used to check late submission.
+•     5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the
+      submission date, up to a maximum of five working days.
+•     All modelling work must be completed individually. Discussion of general ideas is allowed, but code, experiments, and
+      notebooks must be independently developed.
+•     You may not use ChatGPT to directly generate answers for the coursework. High-scoring work must demonstrate your own
+      experimental design, controlled comparisons, failure analysis, and image-level interpretation. ChatGPT or similar tools may be
+      used only in a limited support role such as code understanding, debugging, or grammar support. They must not replace your
+      method design, ablation logic, qualitative analysis, or reflection. Generic AI-produced descriptions without matching evidence in
+      code, tables, figures, and discussion will not receive high marks.
+•     If you use AI tools or outside code in any meaningful way, you must fully understand, verify, and take ownership of every
+      method, number, figure, and written claim that appears in your submission.
+
+
+
+     Question 1: Notebook-Based Coding Exercise - Insurance Premium-Risk Classification (60
+     Marks)
+    In this coursework you will build and improve a multiclass classifier for a fictionalised health-insurance dataset. The task is
+    to predict whether each applicant belongs to a Low, Standard, or High premium-risk group before pricing a policy. The
+    dataset is intentionally realistic: it mixes numerical and categorical variables, contains missing values and dirty entries, and
+    includes some fields that require careful handling to avoid weak modelling practice or label leakage.
+    Your work should show a clear machine-learning workflow: build a sensible first pipeline, compare model families, apply
+    stronger hyperparameter optimisation, complete one compulsory improvement category plus at least one optional category,
+    carry out a compact K-Means/Gaussian Mixture Model (GMM) exploration, and then produce a hidden-test CSV using
+    validation evidence only.
+    The prediction target variable is ‘premium_risk’, and it has 3 imbalanced classes: Standard, High, Low. The dataset
+    contains 33 raw columns: admin/PII columns, synthetic noise features, 1 leakage feature, and genuine predictors.
+    Unless otherwise stated, macro-F1 is the primary validation metric because the dataset is imbalanced; accuracy is reported
+    as a secondary metric.
+    (A) Clean First Pipeline and Baseline Modelling (8 marks)
+•     Load the provided training and validation files and define a consistent target / feature setup.
+•     Handle leakage features, dirty values, missing values, and categorical variables sensibly. A compact sanity check is enough; a
+      long data-audit section is not required.
+      Important: The dataset contains a leakage feature. You must identify and remove it before proceeding to the next stage
+      of analysis; otherwise, the classification results will be severely biased by this leakage and will not be meaningful. If
+      this occurs, multiple parts of your Coursework 1 may be affected, which could significantly impact your marks.
+•     Build one baseline modelling pipeline.
+•     Report at least one validation result using accuracy and macro-F1 score and include a confusion matrix for the baseline model.
+•     Keep preprocessing consistent across train, validation, and hidden-test files.
+
+
+    (B) Controlled Comparison: Random Forest and One Boosting Model (8 marks)
+•     Using the same preprocessing pipeline, validation split, and evaluation metric (primary metric is macro-F1 also report accuracy),
+      carry out an initial controlled comparison between one Random Forest model and one boosting model.
+•     Default XGBoost is recommended because it provides a richer tuning space later, but others may also be used. Default settings
+      or only light sensible adjustments are acceptable in this section.
+•     In the notebook, report the validation result of each model and support the comparison with one or two additional analyses, such
+      as class-wise metrics, a confusion matrix, train-versus-validation behaviour, or stability / sensitivity after tuning.
+•     Your goal is not to prove that one model type always wins. Your goal is to compare the two models fairly, explain the high-level
+      learning difference between bagging and boosting, and use your own notebook evidence to give a careful, dataset-specific
+      interpretation. A generic textbook answer without reference to your own results will receive limited credit.
+    (C) Advanced Hyperparameter Optimisation (12 marks)
+•     At least one main model should be tuned with a genuinely advanced strategy such as Optuna/TPE, Bayesian optimisation,
+      Hyperopt, Ray Tune, or another comparably strong approach.
+•     Hyperparameter tuning should optimise macro-F1 score on the validation set, and the final tuned result should be reported
+      using both accuracy and macro-F1.
+•     RandomizedSearchCV alone is normally not enough for the top band.
+•     Explain briefly why your search space and optimiser are reasonable for the chosen model.
+    (D) Personalised Improvement Work (18 marks)
+    You must complete one compulsory category based on the last digit of your XJTLU student ID, plus at least one additional
+    optional category of your choice. A second optional category is recommended for stronger differentiation but is not compulsory.
+    You should report accuracy and macro-F1 for improved models and include class-wise metrics where helpful. A compact ablation
+    table should normally be included in the notebook for the personalized improvement work
+
+     Last digit                                                    Compulsory category
+                                0-1                                Category A - Data quality and missingness
+                                2-3                                Category B - Feature representation and engineering
+                                4-5                                Category C - Imbalance and objective design
+                                6-7                                Category D - Model robustness, calibration, or ensembling
+                                8-9                                Category E - Fairness, diagnostics, or interpretability
+                  Category                     Examples of what may be done                     What good evidence looks like
+                                             better missing-value strategy;              A concise before/after comparison with a short
+     A                                       MissForest or iterative imputation;         explanation of why the data handling changed the
+                                             sensible outlier handling; value cleaning   result
+                                             feature crosses; grouped categories;
+                                                                                         A compact ablation showing what representation
+     B                                       alternative encodings; modest feature
+                                                                                         changed and whether it helped
+                                             selection; transformations
+                                             class weighting; focal-style loss if
+                                                                                         Clear evidence of how minority or harder classes
+     C                                       relevant; sampling / resampling;
+                                                                                         changed, even if overall score moved only slightly
+                                             thresholding logic
+                                             bagging/boosting variants; calibration
+                                                                                         A meaningful diagnostic or comparison rather
+     D                                       checks; soft voting; stacking;
+                                                                                         than a large collection of loosely connected trials
+                                             robustness checks
+                                             SHAP / feature importance; subgroup-
+                                                                                         Concrete insight into model behaviour, not only
+     E                                       style fairness checks; error analysis;
+                                                                                         screenshots
+                                             model interpretation
+    (E) K-Means and Gaussian Mixture Model (GMM) Exploration (6 marks)
+    This is a compact exploratory section. It is not the main performance section, and it does not require clusters to match the class
+    labels exactly. The aim is to show your understanding of unsupervised learning methods and your ability to interpret their results
+    carefully.
+•     Use a sensible processed numeric feature space and briefly explain what you clustered on.
+•     Explore a small range of cluster/component numbers, such as 2-8.
+•     For K-Means, provide sensible supporting evidence, such as inertia (SSE), cluster sizes, or another simple analysis..
+•     For Gaussian Mixture Model (GMM), provide sensible supporting evidence, such as component sizes, posterior
+      confidence/responsibility, or overlap/uncertainty between components.
+•     Include at least one compact table or figure comparing K-Means and GMM.
+•     If class labels are used for reference, explain clearly that unsupervised structure does not need to align exactly with supervised
+      labels
+•     Stronger work may additionally use silhouette score, log-likelihood trends, or a simple visualization.
+
+
+    (F) Final Model Choice and Hidden-Test Export (8 marks)
+•     Choose the final model using validation evidence only.
+•     Retrain appropriately using both train and validation dataset and generate the hidden-test CSV in the required format.
+•     Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
+      column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
+      Low).
+      Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4 marks
+      from this section.
+•     Do not tune on the hidden test and do not claim hidden test performance.
+•     Note: Hidden test score contributes only a small portion of the final marks. High leaderboard rank alone cannot compensate for
+      weak experimental design or poor documentation.
+
+
+     Coursework Answer Sheet / Theory and Reflection (PDF) - all questions below are compulsory
+     (30 Marks)
+    The Coursework Answer Sheet / Theory and Reflection PDF should not repeat the notebook section by section. All prompt areas
+    below are compulsory. The PDF must be concise, directly linked to your own notebook evidence, and no longer than 4 pages /
+    1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from the PDF section. You should aim to
+    demonstrate both your theoretical or algorithmic understanding and your experimental findings or practical observations and
+    clearly link your understanding of the algorithms to your experimental analysis. At least one table, figure, or metric from the
+    notebook must be referenced in each theory answer.
+
+                            Prompt area                                                       What you should do
+                                                                     (1) Briefly state the definitions and key theoretical properties of bagging
+                                                                     and boosting models;
+                                                                     (2) report the validation results of each model;
+                                                                     (3) support your comparison with one or two additional analyses, such as
+                                                                     class-wise metrics, a confusion matrix, train–validation behaviour, or
+     1. Bagging versus boosting                                      stability/sensitivity after tuning; and
+                                                                     (4) provide a careful interpretation of what this comparison suggests
+                                                                     about this dataset and how it relates to the theoretical properties of
+                                                                     bagging versus boosting methods.
+                                                                     You are not expected to prove that one model type always performs
+                                                                     better.
+                                                                     Explain why your optimiser and search space were reasonable for the
+                                                                     chosen model, which hyperparameters you expected to matter most,
+     2. Hyperparameter optimisation
+                                                                     whether the tuned results matched that intuition, and what you learned
+                                                                     from the tuning process.
+                                                                     Explain hard versus soft assignment and the main assumption difference
+                                                                     between K-Means and GMM. Then use your own compact evidence to
+     3. K-Means versus Gaussian Mixture Model (GMM)                  discuss whether the results matched your intuition and whether GMM
+                                                                     revealed anything extra, such as soft membership, uncertainty, or a
+                                                                     better fit to partial cluster structure.
+                                                                     Reflect on the compulsory category and on every optional category you
+                                                                     implemented. Highlight any unique or interesting algorithm or strategy
+     4. Personalised reflection                                      you tried, the personal challenges you faced, the effort you made to
+                                                                     address them, and the key lessons you learned. Honest reflection on a
+                                                                     neutral or negative result is acceptable if the reasoning is concrete.
+                                                                     State briefly what forms of AI assistance, if any, were used. Generic AI-
+     5. AI-use declaration                                           written theory that does not match your notebook evidence will receive
+                                                                     limited credit.
+
+
+
+    Coding Quality, Coursework Answer Sheet Quality, and Submission Guidelines (10 marks)
+
+•     Submit your Jupyter Notebook in .ipynb format. It must be well organised, include clear commentary and clean code practices,
+      and show visible outputs. Do not write a second mini-report repeating notebook content.
+      •    The notebook should be reproducible from start to finish without errors. Results cited in the PDF should be visible in the
+           notebook and should match the reported values.
+      •    If you used supplementary code outside the notebook, submit that code as well so the full workflow remains reproducible.
+•     Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
+      column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
+      Low). Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4
+      marks from this section.
+•     Submit the Coursework Answer Sheet / Theory and Reflection in PDF format. All questions in that section are compulsory. The
+      Coursework Answer Sheet / Theory and Reflection PDF must answer every required prompt, refer to your own notebook
+      evidence, and remain within 4 pages and 1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from
+      the PDF section.
+•     Include all required components: Jupyter notebooks (code), any additional experimental scripts or custom code, the hidden
+      test-results CSV file, and the Coursework Answer Sheet PDF. Submit all files through the Learning Mall platform. After
+      submission, download your files to verify that they can be opened and viewed correctly to ensure the submission was
+      successful.
+
+    Project Material Access Instructions
+
+    To access the complete set of materials for this project, please use the links below:
+
+        •    OneDrive Link:
+             https://1drv.ms/f/c/18f09d1a39585f84/IgCXDMbXkFYSSZUZkkTyXyZzAQ1poX9mujUqF8N3JlL0GD0?e=uNhAHq
+        •    The same coursework materials have also been uploaded to Learning Mall.
+    When extracting the materials, use the following password to unlock the zip file: DTS304TC (case-sensitive, enter in
+    uppercase).
+