Add lecture materials for Model-Free, Control, and Value topics

- Added Lecture4 - ModelFree.pdf (3013 KB) - Added Lecture5 - Control.pdf (2575 KB) - Added Lecture6 - Value.pdf (3320 KB)
2026-04-28 20:28:00 +08:00
commit ceddbdd559
52 changed files with 117740 additions and 0 deletions
@@ -0,0 +1,260 @@
+                                XJTLU Entrepreneur College (Taicang) Cover Sheet
+
+                                                                                                School of AI and Advanced
+         Module code           DTS304TC: Machine Learning                 School title
+                                                                                                Computing
+
+         Assessment title      Coursework Task 1                          Assessment type       Coursework
+
+         Submission
+                               01/May/2026 23:59
+         deadline
+
+
+I certify that I have read and understood the University's Policy for dealing with Plagiarism, Collusion and the Fabrication of Data
+(available on Learning Mall Online).
+My work does not contain any instances of plagiarism and/or collusion.
+My work does not contain any fabricated data.
+
+
+  By uploading my assignment onto Learning Mall Online, I formally declare that all of the
+            above information is true to the best of my knowledge and belief.
+                                              Scoring – For Tutor Use
+                             Student ID
+          Theory and Reflection PDF Word Count (Filled by
+                             Students)
+
+        Stage of Marking       Marker              Learning Outcomes Achieved （F/P/M/D）                           Final
+                               Code                                                                               Score
+                                                        (please modify as appropriate)
+                                                     A                   B             C
+         1st Marker – red
+               pen
+            Moderation                        The original mark has been accepted by the moderator                 Y/N
+                                  IM                       (please circle as appropriate):
+           – green pen         Initials
+                                             Data entry and score calculation have been checked by                   Y
+                                                          another tutor (please circle):
+           2nd Marker if
+         needed – green
+               pen
+          For Academic Office Use                  Possible Academic Infringement (please tick as appropriate)
+          Date      Days     Late                       ☐ Category A
+        Received     late  Penalty                                                        Total Academic Infringement Penalty
+                                                        ☐ Category B                       (A,B, C, D, E, Please modify where
+                                                                                          necessary) _____________________
+                                                        ☐ Category C
+                                                        ☐ Category D
+                                                        ☐ Category E
+                                              DTS304TC Machine Learning
+                                            Coursework - Assessment Task 1
+•     Percentage in final mark: 50%
+•     Assessment type: individual coursework
+•     Submission files: one Jupyter notebook (.ipynb), one Coursework Answer Sheet / Theory and Reflection PDF, and one
+      hidden-test CSV
+
+    Learning outcomes assessed
+•     A. Demonstrate a solid understanding of the theoretical issues related to problems that machine-learning methods try to
+      address.
+•     B. Demonstrate understanding of the properties of existing machine-learning algorithms and how they behave on practical data.
+
+
+
+    Notes
+•     Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may
+      result in a loss of marks.
+•     The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on Learning Mall will be provided
+      in due course. The submission timestamp on Learning Mall will be used to check late submission.
+•     5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the
+      submission date, up to a maximum of five working days.
+•     All modelling work must be completed individually. Discussion of general ideas is allowed, but code, experiments, and
+      notebooks must be independently developed.
+•     You may not use ChatGPT to directly generate answers for the coursework. High-scoring work must demonstrate your own
+      experimental design, controlled comparisons, failure analysis, and image-level interpretation. ChatGPT or similar tools may be
+      used only in a limited support role such as code understanding, debugging, or grammar support. They must not replace your
+      method design, ablation logic, qualitative analysis, or reflection. Generic AI-produced descriptions without matching evidence in
+      code, tables, figures, and discussion will not receive high marks.
+•     If you use AI tools or outside code in any meaningful way, you must fully understand, verify, and take ownership of every
+      method, number, figure, and written claim that appears in your submission.
+
+
+
+     Question 1: Notebook-Based Coding Exercise - Insurance Premium-Risk Classification (60
+     Marks)
+    In this coursework you will build and improve a multiclass classifier for a fictionalised health-insurance dataset. The task is
+    to predict whether each applicant belongs to a Low, Standard, or High premium-risk group before pricing a policy. The
+    dataset is intentionally realistic: it mixes numerical and categorical variables, contains missing values and dirty entries, and
+    includes some fields that require careful handling to avoid weak modelling practice or label leakage.
+    Your work should show a clear machine-learning workflow: build a sensible first pipeline, compare model families, apply
+    stronger hyperparameter optimisation, complete one compulsory improvement category plus at least one optional category,
+    carry out a compact K-Means/Gaussian Mixture Model (GMM) exploration, and then produce a hidden-test CSV using
+    validation evidence only.
+    The prediction target variable is ‘premium_risk’, and it has 3 imbalanced classes: Standard, High, Low. The dataset
+    contains 33 raw columns: admin/PII columns, synthetic noise features, 1 leakage feature, and genuine predictors.
+    Unless otherwise stated, macro-F1 is the primary validation metric because the dataset is imbalanced; accuracy is reported
+    as a secondary metric.
+    (A) Clean First Pipeline and Baseline Modelling (8 marks)
+•     Load the provided training and validation files and define a consistent target / feature setup.
+•     Handle leakage features, dirty values, missing values, and categorical variables sensibly. A compact sanity check is enough; a
+      long data-audit section is not required.
+      Important: The dataset contains a leakage feature. You must identify and remove it before proceeding to the next stage
+      of analysis; otherwise, the classification results will be severely biased by this leakage and will not be meaningful. If
+      this occurs, multiple parts of your Coursework 1 may be affected, which could significantly impact your marks.
+•     Build one baseline modelling pipeline.
+•     Report at least one validation result using accuracy and macro-F1 score and include a confusion matrix for the baseline model.
+•     Keep preprocessing consistent across train, validation, and hidden-test files.
+
+
+    (B) Controlled Comparison: Random Forest and One Boosting Model (8 marks)
+•     Using the same preprocessing pipeline, validation split, and evaluation metric (primary metric is macro-F1 also report accuracy),
+      carry out an initial controlled comparison between one Random Forest model and one boosting model.
+•     Default XGBoost is recommended because it provides a richer tuning space later, but others may also be used. Default settings
+      or only light sensible adjustments are acceptable in this section.
+•     In the notebook, report the validation result of each model and support the comparison with one or two additional analyses, such
+      as class-wise metrics, a confusion matrix, train-versus-validation behaviour, or stability / sensitivity after tuning.
+•     Your goal is not to prove that one model type always wins. Your goal is to compare the two models fairly, explain the high-level
+      learning difference between bagging and boosting, and use your own notebook evidence to give a careful, dataset-specific
+      interpretation. A generic textbook answer without reference to your own results will receive limited credit.
+    (C) Advanced Hyperparameter Optimisation (12 marks)
+•     At least one main model should be tuned with a genuinely advanced strategy such as Optuna/TPE, Bayesian optimisation,
+      Hyperopt, Ray Tune, or another comparably strong approach.
+•     Hyperparameter tuning should optimise macro-F1 score on the validation set, and the final tuned result should be reported
+      using both accuracy and macro-F1.
+•     RandomizedSearchCV alone is normally not enough for the top band.
+•     Explain briefly why your search space and optimiser are reasonable for the chosen model.
+    (D) Personalised Improvement Work (18 marks)
+    You must complete one compulsory category based on the last digit of your XJTLU student ID, plus at least one additional
+    optional category of your choice. A second optional category is recommended for stronger differentiation but is not compulsory.
+    You should report accuracy and macro-F1 for improved models and include class-wise metrics where helpful. A compact ablation
+    table should normally be included in the notebook for the personalized improvement work
+
+     Last digit                                                    Compulsory category
+                                0-1                                Category A - Data quality and missingness
+                                2-3                                Category B - Feature representation and engineering
+                                4-5                                Category C - Imbalance and objective design
+                                6-7                                Category D - Model robustness, calibration, or ensembling
+                                8-9                                Category E - Fairness, diagnostics, or interpretability
+                  Category                     Examples of what may be done                     What good evidence looks like
+                                             better missing-value strategy;              A concise before/after comparison with a short
+     A                                       MissForest or iterative imputation;         explanation of why the data handling changed the
+                                             sensible outlier handling; value cleaning   result
+                                             feature crosses; grouped categories;
+                                                                                         A compact ablation showing what representation
+     B                                       alternative encodings; modest feature
+                                                                                         changed and whether it helped
+                                             selection; transformations
+                                             class weighting; focal-style loss if
+                                                                                         Clear evidence of how minority or harder classes
+     C                                       relevant; sampling / resampling;
+                                                                                         changed, even if overall score moved only slightly
+                                             thresholding logic
+                                             bagging/boosting variants; calibration
+                                                                                         A meaningful diagnostic or comparison rather
+     D                                       checks; soft voting; stacking;
+                                                                                         than a large collection of loosely connected trials
+                                             robustness checks
+                                             SHAP / feature importance; subgroup-
+                                                                                         Concrete insight into model behaviour, not only
+     E                                       style fairness checks; error analysis;
+                                                                                         screenshots
+                                             model interpretation
+    (E) K-Means and Gaussian Mixture Model (GMM) Exploration (6 marks)
+    This is a compact exploratory section. It is not the main performance section, and it does not require clusters to match the class
+    labels exactly. The aim is to show your understanding of unsupervised learning methods and your ability to interpret their results
+    carefully.
+•     Use a sensible processed numeric feature space and briefly explain what you clustered on.
+•     Explore a small range of cluster/component numbers, such as 2-8.
+•     For K-Means, provide sensible supporting evidence, such as inertia (SSE), cluster sizes, or another simple analysis..
+•     For Gaussian Mixture Model (GMM), provide sensible supporting evidence, such as component sizes, posterior
+      confidence/responsibility, or overlap/uncertainty between components.
+•     Include at least one compact table or figure comparing K-Means and GMM.
+•     If class labels are used for reference, explain clearly that unsupervised structure does not need to align exactly with supervised
+      labels
+•     Stronger work may additionally use silhouette score, log-likelihood trends, or a simple visualization.
+
+
+    (F) Final Model Choice and Hidden-Test Export (8 marks)
+•     Choose the final model using validation evidence only.
+•     Retrain appropriately using both train and validation dataset and generate the hidden-test CSV in the required format.
+•     Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
+      column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
+      Low).
+      Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4 marks
+      from this section.
+•     Do not tune on the hidden test and do not claim hidden test performance.
+•     Note: Hidden test score contributes only a small portion of the final marks. High leaderboard rank alone cannot compensate for
+      weak experimental design or poor documentation.
+
+
+     Coursework Answer Sheet / Theory and Reflection (PDF) - all questions below are compulsory
+     (30 Marks)
+    The Coursework Answer Sheet / Theory and Reflection PDF should not repeat the notebook section by section. All prompt areas
+    below are compulsory. The PDF must be concise, directly linked to your own notebook evidence, and no longer than 4 pages /
+    1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from the PDF section. You should aim to
+    demonstrate both your theoretical or algorithmic understanding and your experimental findings or practical observations and
+    clearly link your understanding of the algorithms to your experimental analysis. At least one table, figure, or metric from the
+    notebook must be referenced in each theory answer.
+
+                            Prompt area                                                       What you should do
+                                                                     (1) Briefly state the definitions and key theoretical properties of bagging
+                                                                     and boosting models;
+                                                                     (2) report the validation results of each model;
+                                                                     (3) support your comparison with one or two additional analyses, such as
+                                                                     class-wise metrics, a confusion matrix, train–validation behaviour, or
+     1. Bagging versus boosting                                      stability/sensitivity after tuning; and
+                                                                     (4) provide a careful interpretation of what this comparison suggests
+                                                                     about this dataset and how it relates to the theoretical properties of
+                                                                     bagging versus boosting methods.
+                                                                     You are not expected to prove that one model type always performs
+                                                                     better.
+                                                                     Explain why your optimiser and search space were reasonable for the
+                                                                     chosen model, which hyperparameters you expected to matter most,
+     2. Hyperparameter optimisation
+                                                                     whether the tuned results matched that intuition, and what you learned
+                                                                     from the tuning process.
+                                                                     Explain hard versus soft assignment and the main assumption difference
+                                                                     between K-Means and GMM. Then use your own compact evidence to
+     3. K-Means versus Gaussian Mixture Model (GMM)                  discuss whether the results matched your intuition and whether GMM
+                                                                     revealed anything extra, such as soft membership, uncertainty, or a
+                                                                     better fit to partial cluster structure.
+                                                                     Reflect on the compulsory category and on every optional category you
+                                                                     implemented. Highlight any unique or interesting algorithm or strategy
+     4. Personalised reflection                                      you tried, the personal challenges you faced, the effort you made to
+                                                                     address them, and the key lessons you learned. Honest reflection on a
+                                                                     neutral or negative result is acceptable if the reasoning is concrete.
+                                                                     State briefly what forms of AI assistance, if any, were used. Generic AI-
+     5. AI-use declaration                                           written theory that does not match your notebook evidence will receive
+                                                                     limited credit.
+
+
+
+    Coding Quality, Coursework Answer Sheet Quality, and Submission Guidelines (10 marks)
+
+•     Submit your Jupyter Notebook in .ipynb format. It must be well organised, include clear commentary and clean code practices,
+      and show visible outputs. Do not write a second mini-report repeating notebook content.
+      •    The notebook should be reproducible from start to finish without errors. Results cited in the PDF should be visible in the
+           notebook and should match the reported values.
+      •    If you used supplementary code outside the notebook, submit that code as well so the full workflow remains reproducible.
+•     Submit the hidden-test results as test_result_[your_student_id].csv. The first column must contain applicant_id, the second
+      column must contain customer_key, and the third column must contain the predicted premium_risk labels (Standard, High,
+      Low). Incorrect file naming or CSV formatting may prevent automated scoring and will result in an automatic deduction of 4
+      marks from this section.
+•     Submit the Coursework Answer Sheet / Theory and Reflection in PDF format. All questions in that section are compulsory. The
+      Coursework Answer Sheet / Theory and Reflection PDF must answer every required prompt, refer to your own notebook
+      evidence, and remain within 4 pages and 1,200 words in total. Exceeding either limit will incur a fixed deduction of 5 marks from
+      the PDF section.
+•     Include all required components: Jupyter notebooks (code), any additional experimental scripts or custom code, the hidden
+      test-results CSV file, and the Coursework Answer Sheet PDF. Submit all files through the Learning Mall platform. After
+      submission, download your files to verify that they can be opened and viewed correctly to ensure the submission was
+      successful.
+
+    Project Material Access Instructions
+
+    To access the complete set of materials for this project, please use the links below:
+
+        •    OneDrive Link:
+             https://1drv.ms/f/c/18f09d1a39585f84/IgCXDMbXkFYSSZUZkkTyXyZzAQ1poX9mujUqF8N3JlL0GD0?e=uNhAHq
+        •    The same coursework materials have also been uploaded to Learning Mall.
+    When extracting the materials, use the following password to unlock the zip file: DTS304TC (case-sensitive, enter in
+    uppercase).
+