Back to all projects
RL Foundations

CartPole — Deep Q-Network from Scratch

A clean-room DQN implementation on the classic CartPole control task — experience replay, target networks, and epsilon-greedy exploration built from first principles.

Role Developer
Domain Reinforcement Learning
Status Open Source

Overview

A from-scratch implementation of Deep Q-Networks on OpenAI Gym's CartPole-v1. No Stable-Baselines3, no RLlib — the Q-network, replay buffer, target-network sync, and epsilon-greedy exploration schedule are all written directly in PyTorch. The goal was to internalize the moving parts of DQN and build a reference implementation small enough to read end-to-end.

The Problem

DQN is easy to pip-install but hard to understand unless you build it yourself. The subtle parts — why experience replay matters, why a separate target network is needed, why epsilon decay matters, how to diagnose a non-learning agent — only click when you've debugged each one. This project deliberately reinvents the wheel as a learning exercise.

My Role & Contribution

  • Implemented the full DQN algorithm — network, replay buffer, target network, training loop
  • Tuned hyperparameters (learning rate, buffer size, target sync frequency, epsilon schedule) to reach CartPole's solved threshold
  • Documented the implementation so it reads as a reference for others learning DQN

Approach

  • Small MLP Q-network in PyTorch — two hidden layers, ReLU activations, linear output over the action space
  • Replay buffer implemented as a fixed-size deque with uniform random sampling
  • Separate target network, soft- or hard-synced from the online network at a configured interval
  • Epsilon-greedy exploration with a decaying schedule from full exploration to near-greedy
  • Smooth-L1 (Huber) loss between predicted Q-values and Bellman targets
  • Matplotlib training curves showing reward, loss, and epsilon over episodes

Tech Stack

Python PyTorch OpenAI Gym / Gymnasium NumPy Matplotlib

Results & Impact

  • Agent reliably solves CartPole-v1 (sustained 500-step episodes) within a small training budget
  • Reference implementation short enough to read end-to-end
// TODO: add diagrams / screenshots
← Previous PPO — Iterated Prisoner's Dilemma