rl-atari/CW1_id_name/docs/development_log.md

# Development Log — DTS307TC PPO Coursework

This log summarises the project's incremental development. Each step
records what was built, why, and the verification used. Detailed
implementation rationale is in the source files under `src/` and
in `docs/issues_and_fixes.md`.

## Step 0 — Project skeleton

Built the project scaffold under `D:/projects/CW1_xxx/`: directories
`src/`, `notebooks/`, `models/`, `runs/`, `docs/`. Created
`requirements.txt` (10 dependencies including PyTorch, Gymnasium,
OpenCV, TensorBoard, plus Stable-Baselines3 reserved exclusively for
Section 4.3 baseline comparison). Verified GPU + Gymnasium availability
on RTX 4060 Laptop with `torch.cuda.is_available() == True`.

## Step 1 — Environment exploration

Notebook `01_explore_env.ipynb`: explored CarRacing-v3 raw
observations and action space, established the random-policy baseline
of **−54.19 ± 5.29** over 5 episodes. Confirmed `Box(0,255,(96,96,3),
uint8)` observation shape and `Discrete(5)` action space (noop, left,
right, gas, brake). The reward structure is `+1000/N` per new tile
and `−0.1` per frame, with a `−100` terminal penalty for off-track.

## Step 2 — Environment wrappers (`src/env_wrappers.py`)

Implemented three Gymnasium wrappers applied innermost-first:
`SkipFrame(k=4)` to repeat each action across 4 raw frames;
`GrayScaleResize(84)` for RGB→grayscale plus 96→84 downsampling via
OpenCV `INTER_AREA`; `FrameStack(k=4)` to concatenate the most recent
4 grayscale frames. Final observation passed to the agent is shape
`(4, 84, 84) uint8`. Verified wrapped random baseline ≈ −37.

## Step 3 — Actor-critic network (`src/networks.py`)

Implemented a shared-CNN actor-critic following Atari DQN topology:
three conv layers (32/64/64 channels with 8/4/3 kernels and 4/2/1
strides) plus a 512-unit FC layer, branching into a 5-logit actor head
and a scalar critic head. All layers use orthogonal initialisation
(gain √2 hidden, 0.01 actor, 1.0 critic). Total parameters: 1,687,206.
Verified initial entropy is `ln(5) ≈ 1.6094` (uniform policy).

## Step 4 — Rollout buffer + GAE (`src/vec_rollout_buffer.py`)

Implemented a vectorised rollout buffer of shape `(n_steps, n_envs, ...)`
storing observations as `uint8` (4× memory saving versus float32). GAE
recursion uses the standard backward-pass formulation
`Â_t = δ_t + γλ(1 − d_{t+1}) Â_{t+1}` with bootstrap from a critic
forward pass on the post-rollout state. Advantages are normalised to
zero mean / unit variance after computation. Verified with synthetic
rollouts.

## Step 5 — PPO agent (`src/ppo_agent.py`)

Implemented `PPOAgent` with the clipped surrogate objective, batched
`act_batch` and `evaluate_value_batch` for vectorised rollouts, and
`update_vec` performing 10 mini-batch update epochs per rollout.
Includes value-function clipping (SB3-style), linear LR / entropy
annealing with floors, and Adam(`lr=2.5e-4`, `eps=1e-5`) per the
*37 Implementation Details of PPO*. Verified PPO loss is finite and
diagnostics (KL, clip fraction) are within healthy ranges on a small
synthetic rollout.

## Step 6 — Training entrypoint (`train_vec.py`) + smoke tests

Implemented the full training driver using `gymnasium.vector.AsyncVectorEnv`
with 8 parallel workers. Tuned to ~95-130 sps on the RTX 4060 Laptop.
Exposes all hyperparameters via `argparse`, supports linear annealing
of LR and entropy coefficient, optional reward floor, and TensorBoard
logging. Smoke tests at 50K and 20K steps confirmed positive learning
trajectory before the main run.

## Step 7 — Main training: vec_main_v3 (1.5M steps, ≈ 4h 23m)

Final production training: 8 parallel envs, 256 steps per env per
rollout, batch 64, 10 epochs, γ=0.99, λ=0.95, clip=0.2, ent_floor=0.005,
reward floor at −1.0. Linear LR / entropy annealing. Final 100-episode
running mean reached **+843**. Saved 36 checkpoints; selected
`iter_0700.pt` (training step ≈1.43M) as the submission via
held-out per-checkpoint evaluation.

## Step 8 — Evaluation (`evaluate.py`, `notebooks/03_evaluate.ipynb`)

Built `src/eval_utils.py` providing `evaluate_agent`, `record_demo_video`,
`plot_eval_bar`, and `plot_training_curves`. Final 20-episode evaluation
on unseen seeds (1000–1019) yielded **mean 830.17 ± 104.79**, min
436.81, max 914.90.

## Step 9 — SB3 baseline (`train_sb3_baseline.py`)

Trained Stable-Baselines3 PPO with matched core hyperparameters for
500K steps as a production-grade reference. Final 20-episode evaluation:
mean 664.32 ± 173.93, min 309.40. Our custom implementation outperforms
on mean (+25%), std (−40%), and min (+41%).

## Step 10 — Negative-result ablations (4 attempts)

Three further refinements drawn from PPO literature were attempted and
documented as instructive failures (see `issues_and_fixes.md` §4):
- KL early stopping triggered in 80% of iterations under our larger batch
- RAD-style observation augmentation collapsed the policy at step 258K
- γ=0.995 + 5M steps reproduced the same collapse mechanism at step 278K

The original v3 configuration is the submitted production model.

## Final deliverables

- `models/ppo_final.pt` — submitted model (1.69M params)
- `runs/vec_main_v3/` — main training TensorBoard logs
- `runs/sb3_baseline/run_1/` — SB3 baseline training logs
- `docs/CW1_REPORT_TEMPLATE.docx` — Word source for the report PDF
- `docs/demo.mp4` — agent demo on seed 117 (return 925, 187 wrapped steps)