Back to all projects
M.Tech Research

PPO in Matrix Games — Iterated Prisoner's Dilemma

Applying Proximal Policy Optimization to classical matrix games to study emergent cooperation, policy convergence, and game-theoretic stability between learning agents.

Role Researcher
Domain Reinforcement Learning / Game Theory
Status Open Source

Overview

A study of how Proximal Policy Optimization behaves in classical matrix games — the Iterated Prisoner's Dilemma in particular. Two PPO agents learn concurrently in the same environment, and the interesting question is what they converge to: mutual defection, tit-for-tat-like reciprocity, or something stranger. The project implements a custom Gymnasium environment, trains the agents, and analyzes the resulting policies through a game-theoretic lens.

The Problem

Classical game theory gives closed-form predictions for matrix games, but modern deep RL agents don't necessarily converge to those equilibria — especially when both agents learn at once. Understanding the gap between theoretical equilibria and empirically learned policies is important for any multi-agent system that will operate in mixed-motive settings.

My Role & Contribution

  • Built the custom Gymnasium matrix-game environment supporting arbitrary payoff matrices
  • Ran training sweeps across hyperparameters and analyzed convergence behavior
  • Compared learned policies against game-theoretic baselines (tit-for-tat, always-defect, always-cooperate)

Approach

  • Custom Gymnasium environment wrapping the iterated matrix game with configurable payoff matrix and history length
  • Stable-Baselines3 PPO as the learning algorithm, with recurrent and MLP policies for comparison
  • Self-play and fixed-opponent training regimes to isolate the effect of co-adaptation
  • Analysis of learned action distributions, cooperation rates over training, and stability under perturbation
  • Visualization of training dynamics and equilibrium regions with Matplotlib

Tech Stack

Python PyTorch Stable-Baselines3 NumPy Matplotlib Gymnasium

Results & Impact

  • Empirical characterization of where PPO converges vs. game-theoretic equilibria
  • Open-source code and environment others can build on for further matrix-game RL study
// TODO: add diagrams / screenshots
← Previous Generative AI Research