# Development Log — DTS307TC PPO Coursework This log summarises the project's incremental development. Each step records what was built, why, and the verification used. Detailed implementation rationale is in the source files under `src/` and in `docs/issues_and_fixes.md`. ## Step 0 — Project skeleton Built the project scaffold under `D:/projects/CW1_xxx/`: directories `src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created `requirements.txt` (10 dependencies including PyTorch, Gymnasium, OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for Section 4.3 baseline comparison). Verified GPU + Gymnasium availability on RTX 4060 Laptop with `torch.cuda.is_available() == True`. ## Step 1 — Environment exploration Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw observations and action space, established the random-policy baseline of **−54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3), uint8)` observation shape and `Discrete(5)` action space (noop, left, right, gas, brake). The reward structure is `+1000/N` per new tile and `−0.1` per frame, with a `−100` terminal penalty for off-track. ## Step 2 — Environment wrappers (`src/env_wrappers.py`) Implemented three Gymnasium wrappers applied innermost-first: `SkipFrame(k=4)` to repeat each action across 4 raw frames; `GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent 4 grayscale frames. Final observation passed to the agent is shape `(4, 84, 84) uint8`. Verified wrapped random baseline ≈ −37. ## Step 3 — Actor-critic network (`src/networks.py`) Implemented a shared-CNN actor-critic following Atari DQN topology: three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1 strides) plus a 512-unit FC layer, branching into a 5-logit actor head and a scalar critic head. All layers use orthogonal initialisation (gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206. Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy). ## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`) Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)` storing observations as `uint8` (4× memory saving versus float32). GAE recursion uses the standard backward-pass formulation `Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1}` with bootstrap from a critic forward pass on the post-rollout state. Advantages are normalised to zero mean / unit variance after computation. Verified with synthetic rollouts. ## Step 5 — PPO agent (`src/ppo_agent.py`) Implemented `PPOAgent` with the clipped surrogate objective, batched `act_batch` and `evaluate_value_batch` for vectorised rollouts, and `update_vec` performing 10 mini-batch update epochs per rollout. Includes value-function clipping (SB3-style), linear LR / entropy annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the *37 Implementation Details of PPO*. Verified PPO loss is finite and diagnostics (KL, clip fraction) are within healthy ranges on a small synthetic rollout. ## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv` with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop. Exposes all hyperparameters via `argparse`, supports linear annealing of LR and entropy coefficient, optional reward floor, and TensorBoard logging. Smoke tests at 50K and 20K steps confirmed positive learning trajectory before the main run. ## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m) Final production training: 8 parallel envs, 256 steps per env per rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005, reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode running mean reached **+843**. Saved 36 checkpoints; selected `iter_0700.pt` (training step ≈1.43M) as the submission via held-out per-checkpoint evaluation. ## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`) Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`, `plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation on unseen seeds (1000–1019) yielded **mean 830.17 ± 104.79**, min 436.81, max 914.90. ## Step 9 — SB3 baseline (`train_sb3_baseline.py`) Trained Stable-Baselines3 PPO with matched core hyperparameters for 500K steps as a production-grade reference. Final 20-episode evaluation: mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms on mean (+25%), std (−40%), and min (+41%). ## Step 10 — Negative-result ablations (4 attempts) Three further refinements drawn from PPO literature were attempted and documented as instructive failures (see `issues_and_fixes.md` §4): - KL early stopping triggered in 80% of iterations under our larger batch - RAD-style observation augmentation collapsed the policy at step 258K - γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K The original v3 configuration is the submitted production model. ## Final deliverables - `models/ppo_final.pt` — submitted model (1.69M params) - `runs/vec_main_v3/` — main training TensorBoard logs - `runs/sb3_baseline/run_1/` — SB3 baseline training logs - `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF - `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)