d6860f1f15
- 更新 requirements.txt,添加 opencv-python-headless 并补充 uv 安装说明 - 修复 CSV 文件中的换行符格式(CRLF 转 LF) - 更新 TASK_PROGRESS.md,记录并行训练实现和 WSL 支持 - 优化 train_improved.py 代码格式,移除多余空行和注释 - 更新课程作业要求文档的字符编码 - 添加新的 TensorBoard 日志文件和训练模型
251 lines
11 KiB
Plaintext
251 lines
11 KiB
Plaintext
XJTLU Entrepreneur College (Taicang) Cover Sheet
|
||
|
||
Module code and Title DTS307TC Reinforcement Learning
|
||
School Title School of AI and Advanced Computing
|
||
Assignment Title Coursework 1
|
||
Submission Deadline 04/May/2026 23:59
|
||
Final Word Count
|
||
If you agree to let the university use your work anonymously for teaching
|
||
and learning purposes, please type “yes” here.
|
||
|
||
|
||
I certify that I have read and understood the University’s Policy for dealing with Plagiarism,
|
||
Collusion and the Fabrication of Data (available on Learning Mall Online). With reference to this
|
||
policy I certify that:
|
||
|
||
• My work does not contain any instances of plagiarism and/or collusion.
|
||
My work does not contain any fabricated data.
|
||
|
||
|
||
|
||
By uploading my assignment onto Learning Mall Online, I formally declare
|
||
that all of the above information is true to the best of my knowledge and
|
||
belief.
|
||
Scoring – For Tutor Use
|
||
Student ID
|
||
|
||
Stage of Marker Learning Outcomes Achieved (F/P/M/D) Final
|
||
Marking Code (please modify as appropriate) Score
|
||
A B C
|
||
1st Marker – red
|
||
pen
|
||
Moderation The original mark has been accepted by the moderator Y/N
|
||
IM (please circle as appropriate):
|
||
– green pen Initials
|
||
Data entry and score calculation have been checked by Y
|
||
another tutor (please circle):
|
||
2nd Marker if
|
||
needed – green
|
||
pen
|
||
For Academic Office Use Possible Academic Infringement (please tick as appropriate)
|
||
Date Days Late ☐ Category A
|
||
Received late Penalty Total Academic Infringement Penalty
|
||
☐ Category B (A,B, C, D, E, Please modify where
|
||
necessary) _____________________
|
||
☐ Category C
|
||
☐ Category D
|
||
☐ Category E
|
||
School of Artificial Intelligence and Advanced Computing
|
||
Xi’an Jiaotong-Liverpool University
|
||
|
||
|
||
|
||
|
||
DTS307TC Reinforcement Learning
|
||
|
||
Coursework - Individual Report
|
||
|
||
Due: 04/May/2026 23:59
|
||
Weight: 40%
|
||
Maximum score: 40 marks
|
||
|
||
|
||
|
||
|
||
Overview
|
||
|
||
The purpose of this assignment is to gain experience in Python programming and the design of
|
||
reinforcement leaning algorithms. You are expected to implement an RL algorithm that solves a
|
||
specific environment and provide an explanation of the algorithm’s methodology. You are expected
|
||
to analyse your results, including challenges and your solutions.
|
||
|
||
|
||
Learning Outcomes Assessed
|
||
|
||
A: Systematically understand the fundamental concepts and principles of reinforcement learning
|
||
B: Critically analyse real-life problem situations and expertly map them as reinforcement learning
|
||
tasks.
|
||
C: Mastery of Monte Carlo Methods and Temporal Difference Learning
|
||
D: Proficiency in Deep Reinforcement Learning algorithms
|
||
|
||
|
||
Late policy
|
||
|
||
5% of the total marks available for the assessment shall be deducted from the assessment mark for
|
||
each working day after the submission date, up to a maximum of five working days
|
||
|
||
|
||
Avoid Plagiarism
|
||
|
||
• Do not submit work from other students.
|
||
|
||
• Do not share code/work with other students
|
||
|
||
• Do not use open-source code as it is or without proper reference.
|
||
|
||
|
||
|
||
|
||
2
|
||
Risks
|
||
|
||
• Please read the coursework instructions and requirements carefully. Not following these instructions
|
||
and requirements may result in a loss of marks.
|
||
• The assignment must be submitted via Learning Mall. Only electronic submission is accepted
|
||
and no hard copy submission.
|
||
• All students must download their file and check that it is viewable after submission. Documents
|
||
may become corrupted during the uploading process (e.g. due to slow internet connections).
|
||
However, students are responsible for submitting a functional and correct file for assessments.
|
||
• Academic Integrity Policy is strictly followed.
|
||
|
||
|
||
Individual Report (40 marks)
|
||
|
||
The primary objective of this coursework is to familiarize students with the PPO algorithm using
|
||
basic deep learning libraries, enabling them to improve their capability in transferring mathematical
|
||
and theoretical knowledge into Python implementation, and further their understanding of the actor-
|
||
critic algorithm.
|
||
|
||
|
||
Algorithm Overview
|
||
|
||
Proximal Policy Optimization (PPO) is a state-of-the-art reinforcement learning algorithm that optimizes
|
||
a stochastic policy in an on-policy manner. To ensure stable training and avoid catastrophic performance
|
||
collapse, PPO utilizes a clipped surrogate objective to prevent the policy update from stepping too
|
||
far from the current behavior.
|
||
|
||
|
||
The Environment: CarRacing-v3
|
||
|
||
We will be using the Car Racing environment from the OpenAI Gymnasium. This environment
|
||
features a top-down racing track where the agent must learn to navigate through tiles based on
|
||
pixel inputs. You can find more details about this environment on their website.(https://gymnasium.
|
||
farama.org/environments/box2d/car_racing/)
|
||
Here’s a code snippet for you to get started:
|
||
|
||
import gymnasium as gym
|
||
env = gym . make ( " CarRacing - v3 " , render_mode = " rgb_array " )
|
||
env . reset ()
|
||
|
||
Since CarRacing-v3 is quite computationally expensive for a standard laptop (due to the pixel processing),
|
||
you might want to consider using a gray-scaling or frame-stacking wrapper to speed up training.
|
||
Alternatively, you can also use the lab computers, which have GPUs and have all the environment
|
||
already set up.
|
||
|
||
|
||
The PPO Agent
|
||
|
||
You will implement an RL agent using PPO to play the CarRacing-v3 environment. The agent
|
||
will use the standard observation and actions provided by the environment. You may edit the
|
||
|
||
3
|
||
environment to speed up your training, but your agent must still perform well in the standard
|
||
environment. (i.e, removing the camera zoom at the beginning is allowed during training, but
|
||
your agent should still be tested in the original environment.) You should record your training and
|
||
evaluation process using Tensorboard. You should also record important losses and other data for
|
||
your analysis later.
|
||
|
||
|
||
The Report
|
||
|
||
Upon completion of your implementation, you are required to submit a comprehensive technical
|
||
report. The report should document your engineering decisions, the theoretical grounding of your
|
||
code, and a critical analysis of the agent’s performance.
|
||
|
||
1. Introduction
|
||
|
||
• Provide a brief overview of Reinforcement Learning in the context of the CarRacing-v3
|
||
environment.
|
||
• Define the state space (pixels), action space (discrete commands), and the reward structure
|
||
of the task.
|
||
|
||
2. Methodology
|
||
|
||
• Mathematical Foundation: Formulate the PPO objective function. Explain the significance
|
||
of the clipping parameter and the probability ratio.
|
||
• Advantage Estimation: Describe your method for calculating advantages (e.g., standard
|
||
advantage vs. Generalized Advantage Estimation (GAE)).
|
||
|
||
3. Implementation Details
|
||
|
||
• Describe your implementation, including any challenges faced and how you addressed
|
||
them.
|
||
• Explain the structure of your policy and value networks.
|
||
• Detail the training process and hyperparameters used.
|
||
|
||
4. Results and Analysis
|
||
|
||
• Present your results (use graphs for better clarity).
|
||
• Discuss the performance of your agent and any trends observed.
|
||
• Briefly compare your custom implementation’s stability and sample efficiency against baseline
|
||
benchmarks (e.g., Stable-Baselines3).
|
||
|
||
5. Conclusion
|
||
|
||
• Summarize your key findings regarding the sensitivity of PPO to hyperparameter tuning
|
||
and the effectiveness of the actor-critic framework in continuous-input environments.
|
||
|
||
Note: All figures and plots must be clearly labeled with axes titles and legends. Raw code
|
||
snippets should be kept to a minimum in the report; focus on high-level logic and pseudo-
|
||
code where necessary.
|
||
|
||
|
||
|
||
|
||
4
|
||
Important Note
|
||
|
||
• Do NOT use Stable-baselines libraries or any other reinforcement learning specific libraries in
|
||
your implementation (You may use tensorboard for recording your results).
|
||
|
||
• Do NOT exceed the word count limit of 3000 words for each report, reference and appendix
|
||
excluded.
|
||
|
||
• Although you are allowed to use any generative AI tools to assist your work, please keep in mind
|
||
that you should be using them responsibly. (Good use: Improve your report after writing it
|
||
and always review its output to ensure that it is correct. Bad use: Copy-pasting an entire report
|
||
from AI without any effort of your own. )
|
||
|
||
|
||
Submission Requirements
|
||
|
||
Please prepare and submit the following documents:
|
||
|
||
• A cover page featuring your student ID. This page should be the first page of your report.
|
||
|
||
• A zip file containing all the source codes and your trained agent model, which should be named
|
||
using your full name and student ID in the following format: CW1_ID_Name.zip
|
||
|
||
• One PDF file for your report. The file should be separated from the zip file, which contains your
|
||
code. The files should be named in the following format: CW1_ID_Name.pdf
|
||
|
||
Note that the quality of the code, the clarity of your writing, and the format/style of your report will
|
||
be taken into consideration during the evaluation. The detailed rubric is outlined below.
|
||
|
||
|
||
Rubric
|
||
|
||
CW1 (40 makrs) Criteria Marks
|
||
Code Performance Code runs without errors and performs tasks as specified. 6
|
||
Code Quality Code is well-organized, includes meaningful comments, and uses appropriate variable names. 6
|
||
Methodology Comprehensive coverage of topics with detailed explanations of approaches and methodologies. 6
|
||
Result analysis Insightful analysis of results. 6
|
||
Report Quality Report is well-structured, formatted, and free of grammatical errors. 6
|
||
Evidence of Work All required elements are included and correct. 6
|
||
Submission Follows all requirements for submission 4
|
||
|
||
|
||
|
||
|
||
5
|
||
|