M.Tech Research

PPO in Matrix Games — Iterated Prisoner's Dilemma

Applying Proximal Policy Optimization to classical matrix games to study emergent cooperation, policy convergence, and game-theoretic stability between learning agents.

Role Researcher

Domain Reinforcement Learning / Game Theory

Status Open Source

View on GitHub

Overview

A study of how Proximal Policy Optimization behaves in classical matrix games — the Iterated Prisoner's Dilemma in particular. Two PPO agents learn concurrently in the same environment, and the interesting question is what they converge to: mutual defection, tit-for-tat-like reciprocity, or something stranger. The project implements a custom Gymnasium environment, trains the agents, and analyzes the resulting policies through a game-theoretic lens.

The Problem

Classical game theory gives closed-form predictions for matrix games, but modern deep RL agents don't necessarily converge to those equilibria — especially when both agents learn at once. Understanding the gap between theoretical equilibria and empirically learned policies is important for any multi-agent system that will operate in mixed-motive settings.

My Role & Contribution

Built the custom Gymnasium matrix-game environment supporting arbitrary payoff matrices
Ran training sweeps across hyperparameters and analyzed convergence behavior
Compared learned policies against game-theoretic baselines (tit-for-tat, always-defect, always-cooperate)

Approach

Custom Gymnasium environment wrapping the iterated matrix game with configurable payoff matrix and history length
Stable-Baselines3 PPO as the learning algorithm, with recurrent and MLP policies for comparison
Self-play and fixed-opponent training regimes to isolate the effect of co-adaptation
Analysis of learned action distributions, cooperation rates over training, and stability under perturbation
Visualization of training dynamics and equilibrium regions with Matplotlib

Tech Stack

Python PyTorch Stable-Baselines3 NumPy Matplotlib Gymnasium

Results & Impact

Empirical characterization of where PPO converges vs. game-theoretic equilibria
Open-source code and environment others can build on for further matrix-game RL study

// TODO: add diagrams / screenshots

← Previous Generative AI Research Next → CartPole — DQN from Scratch