RL Foundations

CartPole — Deep Q-Network from Scratch

A clean-room DQN implementation on the classic CartPole control task — experience replay, target networks, and epsilon-greedy exploration built from first principles.

Role Developer

Domain Reinforcement Learning

Status Open Source

View on GitHub

Overview

A from-scratch implementation of Deep Q-Networks on OpenAI Gym's CartPole-v1. No Stable-Baselines3, no RLlib — the Q-network, replay buffer, target-network sync, and epsilon-greedy exploration schedule are all written directly in PyTorch. The goal was to internalize the moving parts of DQN and build a reference implementation small enough to read end-to-end.

The Problem

DQN is easy to pip-install but hard to understand unless you build it yourself. The subtle parts — why experience replay matters, why a separate target network is needed, why epsilon decay matters, how to diagnose a non-learning agent — only click when you've debugged each one. This project deliberately reinvents the wheel as a learning exercise.

My Role & Contribution

Implemented the full DQN algorithm — network, replay buffer, target network, training loop
Tuned hyperparameters (learning rate, buffer size, target sync frequency, epsilon schedule) to reach CartPole's solved threshold
Documented the implementation so it reads as a reference for others learning DQN

Approach

Small MLP Q-network in PyTorch — two hidden layers, ReLU activations, linear output over the action space
Replay buffer implemented as a fixed-size deque with uniform random sampling
Separate target network, soft- or hard-synced from the online network at a configured interval
Epsilon-greedy exploration with a decaying schedule from full exploration to near-greedy
Smooth-L1 (Huber) loss between predicted Q-values and Bellman targets
Matplotlib training curves showing reward, loss, and epsilon over episodes

Tech Stack

Python PyTorch OpenAI Gym / Gymnasium NumPy Matplotlib

Results & Impact

Agent reliably solves CartPole-v1 (sustained 500-step episodes) within a small training budget
Reference implementation short enough to read end-to-end

// TODO: add diagrams / screenshots

← Previous PPO — Iterated Prisoner's Dilemma Next → Amantranaa Caterers