Building an Oware Engine and a Ladder of RL Agents
Contents
Table of Contents
▼
Oware[*](The variant we implement is Abapa, the tournament ruleset standardized across West Africa and the Caribbean. It is also called Ayoayo, Awale, Warri, Adji-Boto, or Awele depending on where you grew up.) is an ancient, two-player abstract strategy game belonging to the Mancala family, widely played across West Africa and the Caribbean. Two rows of six pits, forty-eight seeds, anti-clockwise sowing, first to twenty-five captured seeds wins. The branching factor never exceeds six, games end in a few hundred plies, and the state space is well within reach of a laptop. That smallness is what makes it a good target: you can stand up an engine, a search baseline, and three learned agents on a single machine, and watch each one fail in its own family-specific way.
This post is the engineering story behind that ladder: a pure-Python engine for the Abapa ruleset, Minimax with iterative deepening as the search baseline, then DQN, PPO, and an AlphaZero-lite run that was cut short by limited compute. A round-robin tournament scores all seven (three search depths + three learned + Random) on Elo, and a FastAPI + React deployment makes them playable in the browser.
You can play any of them right here:
Live deployment: Pick an opponent from the agent picker. Each agent's tournament Elo is shown beside its name. The AlphaZero-lite number, in particular, should be read with the caveat developed later in this post.
The Game as an RL Problem
Oware is played on a board of two rows of six houses, with a store at each end. The internal indexing the engine uses:
Opponent (north)
11 10 9 8 7 6
0 1 2 3 4 5
Player (south)
A move consists of scooping all seeds from one of your own houses and sowing them, one at a time, anti-clockwise. If the last seed lands in an opponent's house that now contains exactly 2 or 3 seeds, those seeds are captured; if the house immediately before it on the opponent's row also contains 2 or 3, those are captured too, and so on backwards along the row. There are two structural rules that complicate this: on a long lap of 12 or more seeds, the source house is skipped, and a move that would leave your opponent with no legal reply on their next turn is forbidden.[*](The "no-starvation" rule (sometimes called the grand slam rule in Awari literature): if every legal move available to you would capture every remaining seed on your opponent's side, you must play a move that doesn't capture, leaving them at least one seed to play. This single rule is responsible for most of the late-game complexity.)
The reward signal we expose to every agent is terminal-only:
No reward shaping. No partial credit for captures, for holding seeds, for board control. The discount factor is . This is a deliberate constraint: reward shaping would speed up early learning but bake my own theory of Oware into every agent. With terminal-only reward, the agents are free to find their own.
The observation is small enough to write out in one line. Twelve pit counts plus two store counts, viewed from the side-to-move's perspective, plus a normalized ply counter:
The action space is , indexing the six houses on the side-to-move's row. Illegal actions are masked to on the policy side and excluded from the argmax on the value side.[*]("Masked on the policy side" means before the softmax, not after. Zeroing out illegal probabilities post-softmax leaves gradient leaking into the masked logits, and the policy quietly learns to push probability mass toward moves it can never play. The PPO loss curves look fine and the agent gets weaker. It is the single easiest bug to ship in board-game RL.)
Architecture
The system is layered so that the agents share absolutely everything except their decision logic.
1. Engine
A pure Python engine, no NumPy in the hot path (a list of twelve ints is faster than a tiny ndarray here). The public surface is six functions on an immutable State:
class State:
pits: tuple[int, ...] # length 12
stores: tuple[int, int]
to_move: int # 0 = south, 1 = north
ply: int
plies_since_capture: int
def initial_state() -> State: ...
def legal_moves(s: State) -> list[int]: ...
def step(s: State, action: int) -> tuple[State, int]: ...
def terminal(s: State) -> tuple[bool, int]: ...
def encode(s: State) -> np.ndarray: ...
Stateless because MCTS rollouts and self-play workers want to fork positions cheaply across threads, and an immutable State is trivially safe to share. This single property is what makes AlphaZero feasible at all on a CPU: each self-play worker can run its own MCTS without ever locking on the engine.
2. Gymnasium Env Wrapper
The engine is wrapped as a Gymnasium environment[*](Framework reference: Gymnasium, the maintained fork of OpenAI Gym.) where the opponent is configurable: Random, any Minimax depth, or a frozen NN snapshot. For self-play training the env exposes a two-agent mode that yields trajectories for both sides, with the value sign flipped at the boundary so the agent's reward is always from its own perspective.
3. Agents
Every agent (heuristic, search, or neural) implements one protocol:
class Agent(Protocol):
def choose_move(self, state: State, *, time_budget_ms: int | None = None) -> int: ...
name: str
metadata: dict
The server, the tournament, and the eval scripts don't care which family a given agent belongs to. The eval scripts in particular are family-agnostic: python scripts/tournament.py discovers every agent that's currently present on disk and plays them round-robin.
4. Server + Frontend
FastAPI[*](Framework reference: FastAPI, the same async Python framework I've used in previous projects.) with one WebSocket endpoint (/play) and a tiny REST surface (GET /agents returns the agent ladder with Elo, GET /healthz). Inference for NN agents runs in a threadpool executor so a slow MCTS rollout never blocks the event loop. Minimax is synchronous but bounded by an iterative-deepening time budget, so the server always responds within UX limits.
Frontend is React + Vite + TypeScript. State management is useReducer for local game state plus a small Zustand store for the WebSocket connection. No Redux on a single-page game.
Minimax
Before any learned agent, three Minimax variants at depths 2, 4, and 6. Alpha-beta pruning, a transposition table keyed on the canonical position hash, iterative deepening so the agent can always return a move within budget. The leaf evaluator is intentionally crude:
The first term rewards captured seeds, which is what wins the game. The second term breaks ties by preferring positions with more material on your side of the board, a weak proxy for "future capture potential."
These are not strong opponents in any absolute sense. Oware is weakly solved by retrograde analysis[*](Romein and Bal (2002), "Solving Awari"; they showed the game is a draw under perfect play, using a complete endgame database.) and a real engine like Awari is exact. But Minimax-d2/d4/d6 are calibrated, deterministic, and cheap to run, which is exactly what we want as a yardstick. The whole training protocol hangs off these three opponents: every learned-agent checkpoint is evaluated against d2, d4, and d6, and promotion to latest.pt requires winning at least 55% of a 400-game head-to-head against the previous best.
Agent 1: DQN
DQN earns its place in this project mostly by being unfussy. An off-policy value method on a 15-vector input is not where the interesting research lives, but it is cheap to stand up and forgiving to tune, which is what you want from the first learned agent on a new environment.
The network is a three-layer MLP, 128 hidden units, ReLU, with a dueling head[*](Algorithm reference: Wang et al. (2016), "Dueling Network Architectures for Deep Reinforcement Learning".) that splits the position value from the per-action advantages:
The split is almost free on six actions and matters in early-game positions where most moves are equally fine. The trunk is Double DQN[*](Algorithm references: van Hasselt et al. (2016), "Deep Reinforcement Learning with Double Q-learning", and Schaul et al. (2016), "Prioritized Experience Replay".) with a target net refreshed every 2k steps, prioritized replay (, ), and epsilon decayed from 1.0 to 0.05 over 200k steps. The loss is Huber on the TD error with illegal actions masked out of the target's :
That masking is non-optional. Forget it, and the target net keeps quietly assigning value to moves the agent can never play; the policy drifts toward them and saturates somewhere around Minimax-d2.
Trained for env steps with Adam at , batch 256, , replay buffer of 200k transitions. Opponents rotate on a schedule: Random up to 50k, a 50/50 Random + Minimax-d2 mix from 50k to 150k, then a snapshot pool with a Minimax-d4 anchor after that.
The Random curve pinned at is the sanity check passing: masking works, reward direction is correct, the engine is not lying to the agent. The flat zero line against Minimax-d2 is the more interesting artifact, because d2 is not a brick wall and the same agent climbs to ~50% against the strictly deeper d4.
The most likely cause is that the curriculum forgets d2. The schedule rotates it out at 150k steps, and from then on the agent only sees snapshot opponents and the d4 anchor. By 1.3M steps the policy has specialized against the mistakes d4's deeper-but-greedier search makes, and has lost the ability to handle d2's shallower, more varied play. The fix is mechanical: leave d2 in the late-curriculum opponent mix at a small weight. I'd rather flag the failure mode honestly than re-run training to make the curve look monotonic.
Agent 2: PPO
PPO was the agent I expected to be broadly useful, and broadly it is. Just not as strong as I'd hoped.
The network is a shared MLP trunk (two hidden layers of 256, ReLU) with separate policy and value heads. The objective[*](Algorithm reference: Schulman et al. (2017), "Proximal Policy Optimization Algorithms".) is the standard clipped surrogate with GAE[*](Algorithm reference: Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation".):
with , clip , GAE , value coef 0.5, and entropy coef annealed from 0.05 to 0.005.
That entropy term is the piece you cannot get wrong on a board game. Action collapse, where the policy converges to one opening for every position, is the canonical PPO failure mode, and the symptom is entropy nosediving below ~0.3 nats early. So entropy gets its own logging axis and a deliberately slow anneal.
The other structural piece is the league. Each of the 8 parallel self-play envs picks an opponent at the start of every game from a mixture:
The snapshot pool keeps the policy from forgetting how to beat its earlier selves (the rock-paper-scissors collapse that pure self-play is famous for), and the Minimax anchor pins the league to an external opponent so it cannot drift into a closed cycle.
Trained for env steps with Adam at , rollouts of 2048 steps across 8 envs, 4 PPO epochs per update, minibatch 256, .
latest.pt.The entropy curve is the success story. The EMA stays comfortably above 0.7 nats, the anneal does its job, and the policy keeps exploring across all six houses. No action collapse.
The not-success story is the winrate against Minimax-d4, which never leaves zero. PPO posted five promotion events (the faint purple verticals) but each one was head-to-head improvement against prior snapshots, not against the d4 anchor. The agent gets better at beating versions of itself and does not get measurably better at beating two-ply-deeper search.
Two things are likely in play. The league is too inward-facing: Minimax-d4 is one opponent out of eleven in the mixture, so the policy gets little gradient signal from games against it. The cheapest experiment is raising the d4 anchor weight from 10% to ~30% and adding a low-weight d6 anchor. The second is structural: terminal-only reward on a 300-ply game is a brutally sparse signal for a policy gradient method, and PPO is doing well to discover anything coherent at all. DQN's off-policy bootstrap from a replay buffer happens to be a more sample-efficient way to propagate terminal credit backwards in this specific setting.
The Elo confirms both stories. PPO lands 245 points below DQN in the tournament.
Agent 3: AlphaZero-lite
This was supposed to be the headline act. AlphaZero[*](Algorithm reference: Silver et al. (2018), "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play".) on a board this small ought to demolish a depth-6 minimax. Instead it finished dead last, below Random.
The setup is the canonical recipe at laptop scale: a 150k-parameter ResNet (four residual blocks, 64 channels) reading the 2×6 board, with a six-logit policy head and a scalar value head .
At move time the net feeds MCTS, which picks children by PUCT:
, 200 simulations per move in training, 400 at serve time. The root prior is roughed up with Dirichlet noise, , , so the search keeps probing moves the current policy dismisses.
The loop is the textbook three-actor split. CPU workers self-play with the current net and stream tuples into a shared FIFO buffer. The trainer pulls minibatches off it and descends the AlphaZero loss,
so the policy mimics MCTS visit counts and the value head predicts the outcome. An evaluator periodically promotes the candidate past a 55% bar over 100 games.
What those three actors actually produced:
These curves say the run never finished its warmup. The buffer filled to a tenth of its 500k cap, policy loss moved 12%, and the value head sat flat at 0.14, which is what a predictor looks like before it learns to tell winning positions from losing ones. With terminal-only reward on a 300-ply game, 13k steps lands you exactly there.
The tournament hands in the verdict:
There is no bug to chase. This is what an under-trained AlphaZero looks like. The prior is essentially uniform-with-noise, and 200 MCTS sims on a uniform prior is not meaningfully better than a coin flip. The reason it lands below Random is the twist: the policy is almost uniform but slightly biased toward specific pits by noise in the tiny training set, and every other agent in the tournament happily exploits a consistent bias that a true random opponent does not have.
The fix is not algorithmic. It is compute. AlphaZero gives you almost nothing until the value head crosses a competence threshold, and pays back generously on the other side. On this board that threshold sits between 100k and 500k self-play games, which at CPU-worker rates was days-to-a-week of uninterrupted wall clock I did not have. I stopped at the first run because DQN and PPO had already produced something worth writing about and I wanted the tournament shipped. So the AlphaZero bar is "what an under-trained AlphaZero looks like," not "what AlphaZero on Oware looks like." Closing that gap is a GPU and a long weekend, not a redesign.
The Tournament
Round-robin, 100 games per pair, fixed time budgets per agent. The final Elo table:
| Agent | Elo |
|---|---|
| Minimax-d6 | 1396 |
| DQN | 1171 |
| Minimax-d2 | 1064 |
| Minimax-d4 | 1064 |
| PPO | 926 |
| Random | 777 |
| AlphaZero-lite | 602 |
Three rows in that table are worth not glossing over.
The first is that Minimax-d2 and Minimax-d4 land at the same Elo. On the face of it that should not happen: d4 sees two plies further. But both depths share the same crude leaf, and on Oware the marginal value of two extra plies is small when the judgment underneath is this rough. d4 sees further; it does not judge better, and the deeper search amplifies noise in the evaluator as often as it sharpens the choice. A real strength gap between the two would need either a better leaf eval or much deeper search (d8+).
The second is DQN sitting between Minimax-d4 and Minimax-d6, which is the most surprising number in the table. The honest answer is that I don't have a controlled experiment isolating why, only the observation that DQN's eval-time winrate climbs against d4 while staying flat at zero against d2. One reading is that d4, with its deeper but still crude leaf eval, ends up more predictable than d2's shallower play, and a learned value function can converge on whatever shape "predictable" takes. Confirming that would need a rerun against a stronger leaf evaluator on the Minimax side; until then the gap is interesting more than explained.
The third is PPO trailing DQN by 245 Elo. Annoying but consistent with the entropy and league diagnosis above. PPO got good at beating its league and never punched up against the external d4 anchor.
Inference Latency
This is the table that determines what is actually playable in a browser:
| Agent | Per-move budget | Notes |
|---|---|---|
| Random | < 1 ms | trivially fast |
| Minimax-d2 | 50 ms | iterative deepening, alpha-beta, TT |
| Minimax-d4 | 200 ms | |
| Minimax-d6 | 800 ms | |
| DQN | ~5 ms | single forward pass |
| PPO | ~5 ms | single forward pass over masked logits |
| AlphaZero-lite | 400–800 ms | 200 MCTS sims (configurable) |
These numbers double as a UX pacing knob. Too fast and the player feels the agent cheated; too slow and they feel the server died. The agents that move in 5 ms (DQN and PPO) get an artificial 300 ms delay on the wire before their move animates, just to make the game feel like a game. Minimax and AlphaZero don't need it.
What I'd Do Next
A short postmortem, ordered by expected payoff.
The highest-leverage move is to train AlphaZero properly. A 500k-self-play-game run on a GPU, with the trainer and workers actually parallelized and the buffer allowed to fill, is the difference between "AlphaZero on Oware" and "AlphaZero with the trainer turned off early." It was a scope cut for v1, not a design flaw, but the chart needs the asterisk until that compute is spent.
After that, reweight the PPO league. Keep the snapshot pool, raise the d4 anchor from 10% to ~30%, and add a d6 anchor at 5–10%. The policy needs gradient signal from opponents it actually has to beat, not opponents it already dominates. In the same spirit, keep Minimax-d2 in the DQN curriculum past 150k steps. The zero winrate against d2 is the agent forgetting a distribution it stopped seeing, and a small (~10%) weight in the late-curriculum mix should be enough to fix it.
Finally, replace the Minimax leaf evaluator. The d2/d4 Elo tie in the tournament is a leaf-evaluator artifact, not a search-depth one. A better static eval (even the DQN's value head used heuristically) would unstick those two and give every learned agent a less flat ceiling to push against.
Closing
The point of building three agent families on the same engine was never to pick a winner. It was to see what each method's failure mode looks like on a small board, a terminal-only reward, and modest single-machine compute. What I got was three completely different shapes on the same axes: DQN held up as a sanity check and turned out surprisingly competitive against a crude search, PPO stayed stable and well-behaved on entropy but got trapped inside a league it could already dominate, and AlphaZero-lite did exactly what an under-trained AlphaZero does, which is sit at the bottom of the chart until the compute arrives.
Watching three families of RL agent grow up against the same game and fail in three completely different ways was the actual joy of this project. The tournament numbers are interesting. The shape of how each one got there is more so.