Building an Oware Engine and a Ladder of RL Agents

/Machine Learning

Contents

Table of Contents

Oware[*](The variant we implement is Abapa, the tournament ruleset standardized across West Africa and the Caribbean. It is also called Ayoayo, Awale, Warri, Adji-Boto, or Awele depending on where you grew up.) is an ancient, two-player abstract strategy game belonging to the Mancala family, widely played across West Africa and the Caribbean. Two rows of six pits, forty-eight seeds, anti-clockwise sowing, first to twenty-five captured seeds wins. The branching factor never exceeds six, games end in a few hundred plies, and the state space is well within reach of a laptop. That smallness is what makes it a good target: you can stand up an engine, a search baseline, and three learned agents on a single machine, and watch each one fail in its own family-specific way.

This post is the engineering story behind that ladder: a pure-Python engine for the Abapa ruleset, Minimax with iterative deepening as the search baseline, then DQN, PPO, and a deliberately compute-starved AlphaZero-lite. A round-robin tournament scores all seven (three search depths + three learned + Random) on Elo, and a FastAPI + React deployment makes them playable in the browser.

You can play any of them right here:

Live deployment: Pick an opponent from the agent picker. Each agent's tournament Elo is shown beside its name. The AlphaZero-lite number, in particular, should be read with the caveat developed later in this post.

The Game as an RL Problem

Oware is played on a board of two rows of six houses, with a store at each end. The internal indexing the engine uses:

        Opponent (north)
   11  10   9   8   7   6
    0   1   2   3   4   5
        Player (south)

A move consists of scooping all seeds from one of your own houses and sowing them, one at a time, anti-clockwise. If the last seed lands in an opponent's house that now contains exactly 2 or 3 seeds, those seeds are captured; if the house immediately before it on the opponent's row also contains 2 or 3, those are captured too, and so on backwards along the row. There are two structural rules that complicate this: on a long lap of 12 or more seeds, the source house is skipped, and a move that would leave your opponent with no legal reply on their next turn is forbidden.[*](The "no-starvation" rule (sometimes called the grand slam rule in Awari literature): if every legal move available to you would capture every remaining seed on your opponent's side, you must play a move that doesn't capture, leaving them at least one seed to play. This single rule is responsible for most of the late-game complexity.)

The reward signal we expose to every agent is terminal-only:

rT={+1if the agent wins+0if the game is a draw (24–24)1if the agent losesr_T = \begin{cases} +1 & \text{if the agent wins} \\ \phantom{+}0 & \text{if the game is a draw (24--24)} \\ -1 & \text{if the agent loses} \end{cases}

No reward shaping. No partial credit for captures, for holding seeds, for board control. The discount factor is γ=0.99\gamma = 0.99. This is a deliberate constraint: reward shaping would speed up early learning but bake my own theory of Oware into every agent. With terminal-only reward, the agents are free to find their own.

The observation is small enough to write out in one line. Twelve pit counts plus two store counts, viewed from the side-to-move's perspective, plus a normalized ply counter:

sR15,si=piti48 for i[0,13],s14=ply200\mathbf{s} \in \mathbb{R}^{15}, \quad s_i = \tfrac{\text{pit}_i}{48} \text{ for } i \in [0, 13], \quad s_{14} = \tfrac{\text{ply}}{200}

The action space is A={0,1,2,3,4,5}\mathcal{A} = \{0, 1, 2, 3, 4, 5\}, indexing the six houses on the side-to-move's row. Illegal actions are masked to -\infty on the policy side and excluded from the argmax on the value side.[*]("Masked on the policy side" means before the softmax, not after. Zeroing out illegal probabilities post-softmax leaves gradient leaking into the masked logits, and the policy quietly learns to push probability mass toward moves it can never play. The PPO loss curves look fine and the agent gets weaker. It is the single easiest bug to ship in board-game RL.)

Architecture: One Engine, Many Brains

The system is layered so that the agents share absolutely everything except their decision logic.

1. Engine

A pure Python engine, no NumPy in the hot path (a list of twelve ints is faster than a tiny ndarray here). The public surface is six functions on an immutable State:

class State:
    pits: tuple[int, ...]            # length 12
    stores: tuple[int, int]
    to_move: int                     # 0 = south, 1 = north
    ply: int
    plies_since_capture: int

def initial_state() -> State: ...
def legal_moves(s: State) -> list[int]: ...
def step(s: State, action: int) -> tuple[State, int]: ...
def terminal(s: State) -> tuple[bool, int]: ...
def encode(s: State) -> np.ndarray: ...

Stateless because MCTS rollouts and self-play workers want to fork positions cheaply across threads, and an immutable State is trivially safe to share. This single property is what makes AlphaZero feasible at all on a CPU: each self-play worker can run its own MCTS without ever locking on the engine.

2. Gymnasium Env Wrapper

The engine is wrapped as a Gymnasium environment[*](Framework reference: Gymnasium, the maintained fork of OpenAI Gym.) where the opponent is configurable: Random, any Minimax depth, or a frozen NN snapshot. For self-play training the env exposes a two-agent mode that yields trajectories for both sides, with the value sign flipped at the boundary so the agent's reward is always from its own perspective.

3. Agents

Every agent (heuristic, search, or neural) implements one protocol:

class Agent(Protocol):
    def choose_move(self, state: State, *, time_budget_ms: int | None = None) -> int: ...
    name: str
    metadata: dict

The server, the tournament, and the eval scripts don't care which family a given agent belongs to. The eval scripts in particular are family-agnostic: python scripts/tournament.py discovers every agent that's currently present on disk and plays them round-robin.

4. Server + Frontend

FastAPI[*](Framework reference: FastAPI, the same async Python framework I've used in previous projects.) with one WebSocket endpoint (/play) and a tiny REST surface (GET /agents returns the agent ladder with Elo, GET /healthz). Inference for NN agents runs in a threadpool executor so a slow MCTS rollout never blocks the event loop. Minimax is synchronous but bounded by an iterative-deepening time budget, so the server always responds within UX limits.

Frontend is React + Vite + TypeScript. State management is useReducer for local game state plus a small Zustand store for the WebSocket connection. No Redux on a single-page game.

Search Baseline: Minimax

Before any learned agent, three Minimax variants at depths 2, 4, and 6. Alpha-beta pruning, a transposition table keyed on the canonical position hash, iterative deepening so the agent can always return a move within budget. The leaf evaluator is intentionally crude:

V(s)=(own_storeopp_store)+0.1(own_seedsopp_seeds)V(\mathbf{s}) = (\text{own\_store} - \text{opp\_store}) + 0.1 \cdot (\text{own\_seeds} - \text{opp\_seeds})

The first term rewards captured seeds, which is what wins the game. The second term breaks ties by preferring positions with more material on your side of the board, a weak proxy for "future capture potential."

These are not strong opponents in any absolute sense. Oware is weakly solved by retrograde analysis[*](Romein and Bal (2002), "Solving Awari"; they showed the game is a draw under perfect play, using a complete endgame database.) and a real engine like Awari is exact. But Minimax-d2/d4/d6 are calibrated, deterministic, and cheap to run, which is exactly what we want as a yardstick. The whole training protocol hangs off these three opponents: every learned-agent checkpoint is evaluated against d2, d4, and d6, and promotion to latest.pt requires winning at least 55% of a 400-game head-to-head against the previous best.

Agent 1: DQN as a Sanity Check

DQN exists in this project mainly to confirm the engine and the reward are wired correctly. An off-policy value method on a 15-vector input is not where the interesting research lives; it's where you find out whether your environment is broken.

The network is a three-layer MLP with 128 hidden units and ReLU activations, with a dueling head that decomposes the value:[*](Algorithm reference: Wang et al. (2016), "Dueling Network Architectures for Deep Reinforcement Learning".)

Q(s,a)=V(s)+(A(s,a)1AaA(s,a))Q(s, a) = V(s) + \left( A(s, a) - \tfrac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \right)

This is almost free on a six-action problem and helps the agent learn the position value V(s)V(s) independently from the action-advantages, which matters when most actions are equally good in early-game positions.

The training loop is Double DQN (target network refreshed every 2k env steps), with prioritized experience replay (PER, α=0.6\alpha = 0.6, β\beta annealed 0.41.00.4 \to 1.0).[*](Algorithm references: van Hasselt et al. (2016), "Deep Reinforcement Learning with Double Q-learning", and Schaul et al. (2016), "Prioritized Experience Replay".) Epsilon decays from 1.0 to 0.05 over 200k steps. The loss is Huber on the TD error, with illegal-action QQ-values masked out of the target's max\max:

L(θ)=E(s,a,r,s)B[ρHδ ⁣(r+γmaxaAlegal(s)Qθ(s,a)Qθ(s,a))]\mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{B}} \left[ \rho \cdot \mathcal{H}_\delta \!\left( r + \gamma \max_{a' \in \mathcal{A}_{\text{legal}}(s')} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right) \right]

where Hδ\mathcal{H}_\delta is the Huber loss with threshold δ=1\delta = 1 and ρ\rho is the PER importance-sampling weight. The masking in the target's max\max is non-optional. If you forget it, the agent learns to value illegal moves (because the target network thinks they exist) and saturates around Minimax-d2.

Training configuration:

  • Algorithm: Double DQN with dueling head and prioritized replay
  • Learning rate: α=104\alpha = 10^{-4} (Adam)
  • Discount factor: γ=0.99\gamma = 0.99
  • Timesteps: 1.3×1061.3 \times 10^{6} env steps
  • Batch size: 256
  • Replay buffer: 2×1052 \times 10^{5} transitions
  • Opponent schedule: Random (0–50k) → 50/50 Random + Minimax-d2 (50k–150k) → snapshot pool + Minimax-d4 anchor (150k+)

After 1.3M env steps:

DQN training curves

DQN learning curves. Left: Huber loss on the TD error (log scale); the EMA decays cleanly across two orders of magnitude. Right: eval winrate against the three opponents in the schedule. The agent saturates against Random, climbs to ~50% against Minimax-d4, but never touches Minimax-d2 in this run.

The Random curve hitting 1.0\sim 1.0 is the sanity check passing: masking is correct, reward direction is correct, the engine is not lying to the agent. The flat 0\sim 0 line against Minimax-d2 is the more interesting artifact. Minimax-d2 is not a brick wall (it plays a thin two-ply lookahead), and yet our DQN climbs to ~50% against d4 while staying at zero against d2.

The likely explanation: the curriculum forgets d2. The schedule rotates Minimax-d2 out of training at 150k steps, and from that point on the agent only practices against snapshot pool opponents and the d4 anchor. By 1.3M steps, the policy has specialized against the kind of mistakes d4's deeper but greedier search makes, and has lost the ability to handle d2's shallower, more stylistically varied moves. The fix is mechanical (keep d2 in the eval-time opponent rotation, and ideally in the training pool with a small weight) but I'd rather flag the failure mode honestly than re-run training to make the curve look monotonic.

Agent 2: PPO with League Play

PPO is the agent I expected to be useful, and broadly it is, though not as strong as I expected.

The network has a shared MLP trunk (two hidden layers of 256, ReLU) with separate policy and value heads. The PPO objective[*](Algorithm reference: Schulman et al. (2017), "Proximal Policy Optimization Algorithms".) with GAE[*](Algorithm reference: Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation".) is:

LCLIP(θ)=Et ⁣[min ⁣(ρt(θ)A^t,  clip(ρt(θ),1ε,1+ε)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \!\left[ \min\!\left( \rho_t(\theta) \hat{A}_t, \; \text{clip}(\rho_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat{A}_t \right) \right]

with ρt(θ)=πθ(atst)/πθold(atst)\rho_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_{\text{old}}}(a_t | s_t), clip ε=0.2\varepsilon = 0.2, GAE λ=0.95\lambda = 0.95, value coefficient 0.50.5, and entropy coefficient annealed from 0.05 down to 0.005 over training.

The entropy bonus is the part you cannot get wrong on board games. Action collapse (the policy converging to a single opening for every position) is the canonical PPO failure mode, and the symptom is the entropy diving below ~0.3 nats early. So entropy gets its own dedicated logging axis and the anneal schedule is slow on purpose.

The league is the structural piece I care most about. Each of the 8 parallel self-play envs picks an opponent at the start of every game from a mixture:

πopp=0.6πlatest+0.3Uniform({πtk}k=110)+0.1πMinimax-d4\pi_{\text{opp}} = 0.6 \cdot \pi_{\text{latest}} + 0.3 \cdot \text{Uniform}(\{\pi_{t-k}\}_{k=1}^{10}) + 0.1 \cdot \pi_{\text{Minimax-d4}}

The snapshot pool prevents the policy from forgetting how to beat its earlier selves (the rock-paper-scissors collapse that pure self-play is famous for), and the Minimax anchor pins the league to an external opponent so it can't drift into a closed cycle.

Training configuration:

  • Algorithm: PPO with shared trunk, separate policy + value heads, GAE
  • Learning rate: α=3×104\alpha = 3 \times 10^{-4} (Adam)
  • Discount factor / GAE: γ=0.99\gamma = 0.99, λ=0.95\lambda = 0.95
  • Rollout: 2048 steps × 8 parallel envs
  • Timesteps: 5×1065 \times 10^{6} env steps
  • PPO epochs / minibatch: 4 epochs, minibatch 256
  • Clip: ε=0.2\varepsilon = 0.2; value coef 0.5; entropy coef 0.050.0050.05 \to 0.005

Five million env steps:

PPO training curves

PPO learning curves. Left: policy and value losses (EMA); both decay smoothly, value an order of magnitude faster than policy. Middle: policy entropy (purple) tracking the annealed entropy coefficient (orange dashed). Entropy collapses from ~1.2 nats to ~0.7, healthy given the 6-action ceiling is log61.79\log 6 \approx 1.79. Right: eval winrate; Random saturates around 0.85, Minimax-d4 stays at zero. The faint purple verticals are checkpoint promotions to latest.pt.

The entropy curve is the success story. The EMA stays comfortably above 0.7 nats, the annealing coefficient does its job, and the policy keeps exploring across all six houses. No action collapse.

The not-success story is the winrate against Minimax-d4, which never leaves zero. PPO posted five promotion events (the faint purple verticals on the right panel) but each one was based on head-to-head improvement against prior snapshots, not against the d4 anchor. The agent gets better at beating versions of itself; it does not get measurably better at beating two-ply-deeper search.

Two things are likely going on:

  1. The league is too inward-facing. Minimax-d4 is one opponent out of eleven in the mixture (ten snapshots plus the anchor), so the policy gets relatively little gradient signal from games against it. Raising the d4 anchor weight from 10% to ~30%, and adding a low-weight d6 anchor, is the cheapest experiment that might unstick this.
  2. Terminal-only reward on a 300-ply game is a brutally sparse signal for a policy gradient method. PPO is doing well to discover anything coherent here at all. DQN's off-policy bootstrap from the replay buffer is, in this specific setting, a more sample-efficient way to propagate terminal credit backwards.

The Elo confirms both stories: PPO lands 245 points below DQN in the tournament.

Agent 3: AlphaZero-lite, or What Compute Starvation Looks Like

AlphaZero[*](Algorithm reference: Silver et al. (2018), "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play".) is the agent I expected to be strongest. It is the weakest. Worth being precise about why.

The architecture is the canonical recipe in miniature:

  • Body: a small ResNet (4 residual blocks of (Conv-BN-ReLU)×2, 64 channels) on the 2×6 board.
  • Policy head: 6 logits, π(as)\pi(a | s).
  • Value head: scalar v(s)[1,1]v(s) \in [-1, 1].

Total parameters ~150k. An MLP would work on a board this small, but the ResNet keeps the code reusable for larger boards.

MCTS at inference uses the PUCT formula:

a=argmaxa(Q(s,a)+cpuctπ(as)bN(s,b)1+N(s,a))a^* = \arg\max_{a} \left( Q(s, a) + c_{\text{puct}} \cdot \pi(a | s) \cdot \frac{\sqrt{\sum_b N(s, b)}}{1 + N(s, a)} \right)

with cpuct=1.5c_{\text{puct}} = 1.5. During self-play, the root prior is perturbed with Dirichlet noise, π(as0)=0.75π(as0)+0.25ηa\pi'(a | s_0) = 0.75 \cdot \pi(a | s_0) + 0.25 \cdot \eta_a with ηDir(0.5)\eta \sim \text{Dir}(0.5), to ensure the search explores moves the current policy thinks are bad. Default 200 simulations per move at training time; 400 at served inference.

The self-play loop is the standard three-actor setup:

  1. Self-play workers (CPU) play games using MCTS + current net and write (s,π,z)(s, \pi, z) tuples to a shared FIFO buffer, where π\pi is the visit-count distribution at the root and zz is the final outcome from that state's perspective.
  2. Trainer (GPU) samples minibatches and minimizes the AlphaZero loss:
L(θ)=(zvθ(s))2πlogpθ(s)+cθ2\mathcal{L}(\theta) = (z - v_\theta(s))^2 - \boldsymbol{\pi}^\top \log \mathbf{p}_\theta(s) + c \, \|\theta\|^2
  1. Evaluator plays the candidate against the current best every NN trainer steps and promotes if winrate 55%\geq 55\% over 100 games.

The training metrics, such as they are:

AlphaZero-lite training curves

AlphaZero-lite. Left: policy, value, and total loss across 13k trainer steps. Policy loss drops from 1.12 to 0.98; value loss is essentially flat at ~0.14; total loss tracks policy. Right: replay buffer fills from 1k to 52k positions, roughly a tenth of the configured 500k capacity.

The honest read of these curves is that this run never finished its warmup. The replay buffer never hit its 500k cap. The policy loss moved 12% in absolute terms, non-trivial but nowhere near the plateau where a learned prior starts to make MCTS dramatically more sample-efficient. The value loss is flat at 0.14, which means the value head has not yet learned to discriminate winning positions from losing ones beyond what you'd get from a near-constant predictor. With terminal-only reward and a tiny labelled buffer, that's exactly where you'd expect it to be after 13k trainer steps.

The tournament result reflects this:

Round-robin tournament Elo ratings

Tournament Elo. Round-robin across all present agents, 100 games per pair. Minimax-d6 anchors the top; DQN comes in second; PPO third; AlphaZero-lite below Random.

It is tempting to call this a bug. It is not. It is the predictable behaviour of an under-trained policy. With 13k trainer steps and a buffer that filled to a tenth of capacity, the network's prior π(as)\pi(a | s) is roughly uniform-with-noise, and MCTS at 200 simulations on top of a uniform prior is not meaningfully better than picking moves at random. The reason it ranks below Random in the round-robin is that its move distribution, while close to uniform, is biased toward specific pits by the noise in the small training set, and those biases happen to be exploitable by everything in the tournament, including Random's lack of bias.

The plan document for this project, written before training started, predicted this almost word for word:

AlphaZero will look weak for the first 100k self-play games then suddenly improve; this is normal. Don't kill the run too early.

This run killed too early, by a wide margin. The fix is not algorithmic; it is compute. AlphaZero is a method that pays back generously once the value head crosses a competence threshold, and gives you almost nothing before that. The threshold is somewhere between 100k and 500k self-play games on a board this size, and at the rate this run was generating data on CPU, that's days to a week of wall clock. I stopped at the first run because the other two agents had hit their interesting bits and I wanted to ship the tournament. But the AlphaZero number on the chart is "what an under-trained AlphaZero looks like," not "what AlphaZero on Oware looks like." Calling the difference would require another order of magnitude of compute, on a GPU, with the trainer and self-play workers properly scaled.

The Tournament

Round-robin, 100 games per pair, fixed time budgets per agent. The final Elo table:

AgentElo
Minimax-d61396
DQN1171
Minimax-d21064
Minimax-d41064
PPO926
Random777
AlphaZero-lite602

Three things in there are worth not glossing over.

Minimax-d2 and Minimax-d4 are tied. That is suspicious on the face of it: d4 sees two plies further. But the leaf evaluator is the same crude (store delta)+0.1(seed delta)(\text{store delta}) + 0.1 \cdot (\text{seed delta}) for both, and on Oware the marginal value of two extra plies is small when the judgment is this crude. d4 sees further but doesn't judge better; the extra search depth amplifies noise in the leaf evaluator as often as it sharpens the choice. A real strength gap between d2 and d4 would require either a better evaluator or much deeper search (d8+).

DQN sits between Minimax-d4 and Minimax-d6. That is the most surprising number in the table. A 15-vector MLP trained on terminal-only reward should not be beating two-ply-deeper alpha-beta. The most plausible explanation (and the first thing I'd test with a controlled rerun) is that DQN has learned to exploit the same evaluator-shaped blindspots that pin Minimax-d4 to a fixed Elo. The d4 search is greedy on a crude leaf signal; a value function that learned the distribution of outcomes, not just the immediate seed delta, will systematically beat a search that doesn't. This also explains why DQN's winrate against d4 climbs while its winrate against d2 stays at zero: d2's shallow search is more stylistically varied move-to-move and harder to model as a fixed exploitable opponent.

PPO underperforms DQN by 245 Elo. Annoying but consistent with the entropy/league diagnosis above: PPO got good at beating its league and never punched up against the external d4 anchor.

Inference Latency: The UX Knob

This is the table that determines what is actually playable in a browser:

AgentPer-move budgetNotes
Random< 1 mstrivially fast
Minimax-d250 msiterative deepening, alpha-beta, TT
Minimax-d4200 ms
Minimax-d6800 ms
DQN~5 mssingle forward pass
PPO~5 mssingle forward pass over masked logits
AlphaZero-lite400–800 ms200 MCTS sims (configurable)

These numbers double as a UX pacing knob. Too fast and the player feels the agent cheated; too slow and they feel the server died. The agents that move in 5 ms (DQN and PPO) get an artificial 300 ms delay on the wire before their move animates, just to make the game feel like a game. Minimax and AlphaZero don't need it.

What I'd Do Next

The postmortem, in order of expected payoff:

  1. Train AlphaZero properly. This is the single highest-value next step and the only one that requires real compute. A 500k-self-play-game run on a GPU, with the trainer and self-play workers properly parallelized and the buffer allowed to actually fill, is the difference between "AlphaZero on Oware" and "AlphaZero with the trainer turned off early." This was a deliberate scope cut for v1, not a design flaw. But the chart needs the asterisk until that compute is spent.
  2. Reweight the PPO league. Keep the snapshot pool but raise the Minimax-d4 anchor weight from 10% to ~30%, and add a Minimax-d6 anchor at low weight (5–10%). The policy needs gradient signal from opponents it actually has to beat, not opponents it can already dominate.
  3. Keep Minimax-d2 in the DQN curriculum past 150k steps. The current schedule rotates it out early; the eval-time winrate-vs-d2 staying at zero suggests the agent stops practicing against the d2 distribution and never recovers. Add d2 to the late-curriculum opponent mix with a small weight (~10%) and re-run.
  4. Replace the Minimax leaf evaluator. The d2/d4 Elo tie in the tournament is a leaf-evaluator artifact. A better static evaluator (even just the DQN's value head used heuristically) would unstick those two from each other and give every learned agent a less flat ceiling to push against.

Closing

The point of building three agent families on the same engine was never to pick a winner. It was to see what each method's failure mode looks like when you give it a small board, a terminal-only reward, and modest single-machine compute. What we got:

  • DQN: works as a sanity check; surprisingly competitive when paired against a crude search, with a curriculum bug that's easy to fix.
  • PPO: stable, well-behaved entropy, but bottlenecked by a league it dominates without learning to beat the anchor.
  • AlphaZero-lite: predictably weak before the compute threshold, and we stopped before that threshold. The chart is the chart.

Watching three different families of RL agent grow up against the same game, fail in three completely different ways, and produce three completely different shaped curves on the same axes. That was the actual joy of this project. The tournament numbers are interesting; the shape of how each one got there is more so.