Making an open source engine for Magic the Gathering in Python

/ home / posts / Making an open source engine for Magic the Gathering in Python

July 02, 2026

The gist
What makes Magic particularly hard?
How other people have “hacked” Magic?
What I did
Results
What I did with my account
GitHub

The gist

Months ago, I participated in a hackathon where the prompt was to make a programming language and then make a game in that language. There I built a few DSLs specifically for card games. Since then, I’ve been curious as to what it would look like or take to have a Stockfish for Magic and went through a tumultuous journey to put together the pieces I have. Hope you enjoy :)

What makes Magic particularly hard?

Magic the Gathering is one of the most popular collectible card games and literally the only reason the company that makes Monopoly is still alive. Due to being an incomplete information game as well as one whose mechanics are Turing complete, it is one of the hardest games in the world.

How can information be incomplete?

Unlike games like chess or Go, Magic operates with incomplete information like poker or mafia. By “incomplete information”, there is knowledge relevant to the game not publicly available to all players. The simplest example of this would have to be rock-paper-scissors; knowing what an opponent is going to select makes all the difference from the game being “solveable” versus a disguised Monty Hall problem.

In the case of the Monty Hall problem, the “solution” can be determined by broadening out beyond what’s at face value. Let’s say you selected the first door and then learned behind the third door is a goat.

graph LR; first_door["First door (Selected)"] --- second_door[Second door] --- third_door["Third door (Goat)"]

This leaves us with two options: either the first door, or the second door.

graph LR; first_door["First door (Selected)"] --- second_door[Second door]

On first impression, this would lead one to think there’s a 50% chance of being right whether or not you swich. But, two options does not mean two equal possibilities. You have a 2/3 chance of choosing a door with a goat in the beginning, and then a 100% chance of winning when you switch (since you’d always be getting the car in that scenario).

A technique for incomplete information

Similar to the Monty Hall problem, with ReBeL, the expansion from hidden to public knowledge is the key technique. In the ReBeL paper, the authors suggest a version of rock-paper-scissors with a twist: whenever you win using scissors, you get two points and, whenever you lose with scissors, you lose two points.

Should we only consider what’s immediately in front of us (a player about to play a shape you don’t know), then there’s no reason to deliberate and you may as well see what goes.

Diagram of blind rock-paper-scissors player from the ReBeL paper

If we consider the possible responses from the opponent player, then we can work out an optimal policy for how often to make certain plays. This is thanks to being able to weigh the different expected points or rewards from these different outcomes.

Diagram of aware rock-paper-scissors player from the ReBeL paper

With poker, you have ranges, the different probabilities of winning with different hands in different seats. The important detail about ranges is there is a finite number of probabilities to determine, whether or not you map exact permutations (ie a king of hearts with jack of spades) versus general hands (ie a seven-deuce offsuit). Unlike rock-paper-scissors or poker, there does not exist a finite ceiling to the amount of total information that could be contained in a game of Magic. This is thanks to it being Turing complete and there being an infinite amount of information that could emerge in a game.

Turing complete whatnow?

Back in 2019, it was shown Magic itself could run a computer like a Redstone computer in Minecraft. By “computer”, it was the simplest version of a computer (the Turing machine). How it ties into the complexity of a game has to do with the Halting problem which gives a limitation to what can be done with computers. For instance, with physics, you can’t run into a wall and expect to pass through. (Source to below animation)

Unlike a game like chess or poker where players make one decision per turn (moving one piece in chess, a call or fold in poker), you can have a sequence of interactions in a single turn in Magic. Coupling this together with the large number of cards published, you can end up with infinite combos that may or not ruin the friend group you’re playing with. Unfortunately, infinite storage does not exist so we can lean on a claim made by the original authors where the game itself may not be computably decidable but it may be transition computable.

They leave this as something they believe in more than they can formally prove given the 20,000+ corpus of cards at the time. However, I think it can be safely assumed given the following:

It’s possible to establish the static rules of the game for non-infinite situations
It’s possible to define how cards and effects apply to each other for non-infinite situations
There are rules for players to assign finite numbers to infinite loops

While it does not rigorously cover all cases, the amount of gameplay around the world and work done by employees at Wizards of the Coast would leave me to think we’ve exhausted most of the obvious ones. Therefore, we should be able to have some software that handles moving from one state in the game to the next and not fret over the technical un-computability of it all.

As a final note on the Turing completeness, while Powerpoint and other games such as Minecraft do share the ability to have a computer run inside its environment, Magic is the only game that is both Turing complete and requires more than a single player by design. With this, it’s the closest thing to a capture-the-flag for people who choose fantasy over sci-fi.

How other people have “hacked” Magic

Magic the Gathering has been around for over 30 years and, in that time, plenty of softwares have been built around and for the game. Below are some of those different products and projects:

Official Products

Arena is the latest app distributed by Wizards of the Coast and it covers all the great stuff from cards to mechanics, and even Universes Beyond! While their first attempt at programming the game was infamously riddled with bugs, Arena stands strong at millions of downloads plus players. In the game, they have a bot who’s available to play against named “Sparky”. While not the strongest in performance, Sparky does act as their effective benchmark for a codified Magic player.

Relevant to one of the other projects and what I ended up building is Arena has a setting to write game events to a log file in real time. What this means is, rather than have to OCR the entire screen to get data relevant to game, you could parse predictable strings (this is also how Untapped.gg is able to “replay” historical games).

Java-based

For a reason that’s not entirely clear to me, a lot of card game development in and outside of research uses Java. The two big contendors here in open source are Forge and Mage. Both consist of engines to handle the game as well as UIs for local playing experiences. When I was attempting some early experiments, I found that its headless mode was not behaving in an actually “headless” mode without UI methods being discoverable in the stack trace. And so, running the engine in headless mode was like running a browser in headless mode, it may be running some less stuff but it kinda doesn’t work without the rendering parts.

Both Forge and Mage having this UI overhead (and Forge eating up my memory faster than an old person’s Alzheimer’s), ultimately contributed to my decision later on to develop a new engine. However, they do have extensive (or exhaustive depending on your views) heuristics already programmed so these two can also act as benchmarks for playing against.

MageZero

MageZero is, sadly, not a shipped bot or even an collection of bots. Instead, it’s a toolkit (remember when people wouldn’t shut up about LangGraph?) for training your own deck-specific RL agents. The purpose behind this project is simple: different decks of cards affect the game as much as the game itself does. When you train a chess or Go bot, you’re always starting with the same pieces on the same board with the same rules. In Magic, there are cards that make people draw extra cards at the start of their turn or prevent players from casting any spells so the cards involved in a player matchup matter a great deal.

While the implication of game-changing by cards is striking, it’s built on top of Mage which means it’s dependent on its Java-based engine; ergo another not-so-preferred target for me.

Deck building bot

I found this through a comment on Hacker News where a draft player built a bot to both choose cards during drafting as well as consolidate his final deck. While this did yield a comparable win rate to when he’s not using algorithms for deckbuilding, he was the one playing those games, not the bot.

Additionally, it’s unclear whether the sets he trained against were the same as the ones in the drafts he played. Even in the case that it’s true, it still presents an interesting result to be able to map his preferences onto a neural network based on his historical gameplay (one may have expected a more intricate model to be needed for a game like Magic).

Simplified Engines

Legends of Code and Magic

For a couple years in a row, there was an online competition based on Legends of Code and Magic, a card game similar to Magic but designed so that bot matches would be fair. By having a simpler game to work with, this meant more people with less compute would be able to participate as well. Over the years, various techniques from heuristics to neural networks have all been employed but either battling over marginal improvements or scoping each improvement into modular strategies that are plug-and-play.

open-mtg

The open-mtg project might be the closest thing to what I was looking for in the first place but it falls short in a few ways.

Firstly, the last commit was seven years ago and the game has changed quite a bit since then. Secondly, it operates on a limited subset of the entire rules of the game which may or may not be misleading with respect to observed results (maybe there exists a ceiling for the number of rules that suffice in some given architecture in a non-obvious way).

What I did

Frustrations which led to a new engine

After reading papers and trying experiments, the one I’m most disappointed didn’t work was making a DSL for game strategies (imagine COBOL for describing patterns or plays) and then letting NEAT improve the underlying syntactic graph of a program. Maybe I didn’t add enough richness to the DSL or the NEAT implementation wasn’t granular enough but a lot of failures and crashes pointed to one common frustration: the Java-based engine was more than I needed and computationally expensive.

While Forge’s DSL for cards is neat, it exposes an inherent issue with keeping the engine up to date. In Magic, there are both rules and cards which may alter the game; all of these being relevant across an engine’s stack. Let’s take a look at what happens when a new set of cards is released with changes to the rules:

graph LR; subgraph LR forge[Engine with UI and Card DSL] engine[Engine] --- ui[UI] engine --- dsl[Card DSL] end new_set[New cards] -->|"New cards to apply"| engine new_set -->|"New cards to present"| ui new_set -->|"New cards to handle"| dsl changes_to_rules[Changes to rules] -->|"Rules to be added/updated"| engine changes_to_rules -->|"Changes to what/how is displayed"| ui changes_to_rules -->|"Rules that affect how cards are handled"| dsl linkStyle 0 stroke-width:4px,stroke:red linkStyle 1 stroke-width:4px,stroke:red linkStyle 2 stroke:blue linkStyle 3 stroke:blue linkStyle 4 stroke:blue linkStyle 5 stroke:green linkStyle 6 stroke:green linkStyle 7 stroke:green

New work needing to be done across the stack isn’t too terrible since there is a finite number of cards out there and only a finite number of new sets or rule changes happening each year. However, the intertwining of the pieces inside the engine does mean a “vibe refactor” like what Bun did going from Zig to Rust becomes a lot trickier. In fact, a “vibe refactor” for either of the Java engines would be more non-trivial since, unlike Bun, the test suites are written in the same language as the project.

However, building a new engine with a focus on performance would help in a number of ways. First, it’d help wrestle with the question of “if we looked far enough ahead, could we generally find wins?”. Second, it’d only use the compute necessary for games so training models on more data (synthetic or self-played) becomes more realistic.

How I vibed the engine

So, if not a refactor, I need some way to vibe code a new engine (there are tens of thousands of cards out there and I am but one mere mortal). Each time I was reviewing a possible solution for verification to check agent output, the pattern I was looking for was: take the latest rules (which are available on the Wizards of the Coast website as TXT or DOCX), loop over each rule and convert it to some spec, append to our verification tool (building up a test suite rule by rule).

graph LR; rules[Get latest rules] --> next_rule[Get next rule] next_rule --> translate[Translate to spec] translate --> add[Append to verification to eventually be used by agents] add --> next_rule

At one point, I was looking into preparing another swarm with formal verification like TLA+ to verify the behavior in the same way that integration test suites are used for verifying “vibe refactors”. The one I had the most hope for was dafny since it officially outputs to Python. However, it wasn’t clear if the language would be able to represent enough logic to handle the rules relating to Magic. If involved, the purpose of the verification piece is to be the checkbox that matters with respect to completion, otherwise, it may as well be AI psychosis to think anything succeeded.

But then, another realization sunk in: however I’m interpreting the text in the rules to be converted into a formal verification language is in of itself going to be easily argued or prone to faults. Then, I remembered that rules are privy to change for interpretation, an example being when a player named a card that was not the one on the board but understood to be implicitly (“Borborygmos Incident”). Arguments over the US Constitution aren’t about the content itself so much as they’re about the interpretation of said content. Given this as well as the general complexity in Magic, it seemed like the answer could be simpler:

graph LR; rules[Get latest rules] --> interpret[Interpret into engine code] cards[Get latest cards] --> interpret

Instead of architecting a new engine with its own complexity and costs as the game expands, boiling down the game into a means of interpreting the English texts from the rules and cards makes it more adaptable to changes to the game (which must be somewhat legalese given millions play the game and some compete for millions). Maintaining this “English interpreter” is less a problem with a codebase and more reviewing the interpretation of grammatical patterns in texts.

graph LR; rules[Get latest rules] --- r_can_interpret[Can interpret based on the last grammar] rules --- r_cannot_interpret[Cannot interpret based on the last grammar] cards[Get latest cards] --- c_can_interpret[Can interpret based on the last grammar] cards --- c_cannot_interpret[Cannot interpret based on the last grammar] r_can_interpret --> interpret[Interpret into engine code] c_can_interpret --> interpret r_cannot_interpret --> to_maintain[Incrementally novel English grammar is used] c_cannot_interpret --> to_maintain

Which makes the choice of output language more significant. If they were more functionally complete, I’d have gone for the DSLs I made at the hackathon months ago, which were solely for this type of problem. There is, fortunately, a solution that not only follows what’s been used historically with games academically but also offers a C++ build in the end (making an integration with Python similar to how PyTorch uses Python to drive a Torch C++ backend).

Datalog

In the world of academic work relating to games, there is a tool used by the name of Game Description Language (or GDL for short) which is a variant of Datalog. While it was originally developed in the pursuit of general game playing, its base language is suited rather well for what we’re looking for. Unlike its parent language, Prolog (which was used to write the first version of Erlang!), Datalog is not Turing complete. You never have to worry about “what if this goes off to infinity?”. This is thanks to the underlying bottom-up architecture where it builds up truths from known facts in contrast to Prolog which will start with a query and then inquire into its truth-iness based on established rules.

Following the end of the Turing complete section where I describe Magic as being “transition computable” but not “game computable”, there cannot exist a deterministic program that accepts a Magic game in its entirety and strictly computes whether or not the game ends. But, there can be a program that would accept a game and tell you what happens next. This is what we would look for in an engine that describes the state of the game in order to provide to a bot which drives decisions according to wherever the engine is currently at.

Another variant of Datalog, Souffle, is the one I’m using and it also happens to have been used to find vulnerabilities in the Java JDK plus used by Amazon to verify their VPN connections. To remedy changes to larger game states (ie late in the game or a large number of creatures are in play), I had Claude translate the work from a 2021 paper and branch to a fork of Souffle. This is so our engine can handle editing the state versus refreshing from scratch and being able to discern which case is more optimal. This is useful for when a single creature kill does nothing versus when a single creature kill requires untangling a bunch of effects.

Transpiling English to Datalog

There is an old fashioned way of looking at sentences called the Reed-Kellog sentence diagram where you deconstruct a sentence into a tree where the branching describes structure and modifiers.

Programmatically, there are tools that provide “dependency parsing” where you take in some text and get back a graph like the ones shown above. The one I used is spaCy and, once you have a graph containing relations among words or tokens, then you have an abstract syntax tree!

Why this is powerful is programming languages from Rust to Odin all under the hood start by taking the source code (which is really just text), shuffling it around a graph structure, then finally producing your output. As an example in our application, let’s consider the following rule:

“If a creature has toughness 0 or less, it’s put into its owner’s graveyard.”

Right away, we know this is a conditional with the “If” at the beginning. We can prepare the conclusion “it’s put into its owner’s graveyard” as the effect from satisfying what is entailed in the condition clause. Here the only real value that matters is the “creature” and it having the “toughness” that is, in this case, “0 or less”, which maps to <= 0. One comment about the below diagram is the “O”s down in the final Souffle are uppercase o’s and not zeros (which have zero fills), this is the convention for referring to “object” variables in Souffle.

Souffle being a logic oriented programming language also makes aspects of encoding the game rather neat, such as expressing some rules as assertions rather than a chain of conditions and function calls.

As a caveat, I tried to keep it faithful to this architecture, albeit some subagents wrote in some straight regex replacements as the “intrepreting”. As funky as that choice was, I did some refactoring to backtrack and tidy up the English-to-Souffle pipeline. Ideally, regex would be optionally used structurally (ie replacing URLs with strings to get picked up as a single word by a parser) before being fed into the more grammatically aware English-to-Souffle pipeline more utilizing spaCy.

Simple API

One of the first things I had wanted when I started looking into a “Stockfish for Magic” was there being something similar to python-chess where you could install as easily as:

$ pip install chess

Then play like so:

# Import the library
import chess

# Create a board (you need one to play a game of chess)
board = chess.Board()

# First move e4
board.push_san("e4")

mtg and a couple other good names were already taken so I went for python-mtg for simplicity sake. Now you can install as simply as:

$ pip install python-mtg

Then play like so:

# Import the library
from mtg import Game, mountain

# Create a game (you need players and cards to play a game of magic)
game = Game.new(
  [mountain] * 40, # Every player has a basic deck
  starting_hand=lambda: [mountain], # Every player has at least a Mountain in their starting hand
)

# Play a land
game.play(mountain)

Bot

For making the bot, I set up a Player class that would make it easier to define and read programmatic Magic players. Rather than approach defining bots as an implementation problem, this helps heuristics be more written like how you would instruct someone to play your deck (ie when a friend borrows your cards to play a game). It also enables the below bot where all it does is play lands, cast spells, swing with everything, and never blocks.

from mtg import Player, PriorityOption as Do

class BlindAggroPlayer(Player):
  def choose_move(self, game):
    return game.prioritize(
      Do.LANDS,
      Do.SPELLS,
      Do.ATTACKS,
      Do.SKIP
    )

Or, to have more calculated preferences for different decision paths (ie if you were to MCTS):

from mtg import Mover, Player, PriorityOption as Do

class HeuristicPlayer(Player):
  def choose_move(self, game) -> Move | None:
    self.bind(game)  # refresh self.creatures, self.opponent, self.life

    return game.prioritize(
      Do.LANDS.prefer(self.land_choice),

      Do.RESOLVE_TRIGGER.prefer(self.resolve_choice, floor=0.0),
      Do.SPELLS.prefer(self.develop_choice, floor=0.0),
      Do.ABILITIES.prefer(self.develop_choice, floor=0.0),

      Do.ATTACKS.prefer(self.attack_choice),
      Do.BLOCKS.prefer(self.block_choice),
      Do.SKIP,
    )

Or, my personal favorite, being able to search for a winning path and taking it (verified against Forge with turn-1 kills). While the test of being able to perform a Thassa’s Oracle win is forced with a certain starting hand, solving the problem of finishing a game is often the last problem in minmaxing before optimizing or pruning tree search.

Vision interaction

It’s easy to assume that wrangling an entire screen with OCR would be difficult and that’s generally correct. However, we can take advantage of something from web scraping. If we took a look at the home page for Hacker News and ask how to get the titles of the top posts.

Extracting the text itself from this screen would be a challenge but if we note that each of the elements of interest match the CSS selector span.titleline then we can go from that string to the strings we’re interested in.

By going bottom-up we’re able to go from the substance to the content of interest. In the case of Arena where there are screens consisting of a lot of text.

The thing we’re then more interested in are the general shapes appearing on the screen (ie orange or blue buttons) rather than trying to find the button that corresponds to a “Play” or “Pass” action. As such, automation here can be the effect of going from content to substance.

I use Moondream for this and it improves the problem from impossible to somewhat dependable similar to spaCy. A necessary disclaimer is both of these tools under the hood are stochastic models and therefore should not be treated as formalizations but applicable tools.

Results

After getting a minimal working setup to connect the “blind aggressive” bot defined at the top of the bot section to some simulated interactions, I was able to let it drive my seat autonomously and beat Arena’s Sparky following those minimal heuristics (play lands, cast spells, swing with everything, never block).

Following that, I began working on extending the simulated interactions further for a heuristic bot that can target cards (so blocking can be translated from the engine to Arena) and got it to the point where it drove its own seat autonomously and beat Sparky three times in a row. The cards chosen for that white life-gain deck were intentional so it had an easier time with targeting; call it convenience or laziness but it got the job done.

I then began iterating on a more improved heuristic player to drive a deck I had successfully played with against others online (mono-red with Bayo, Irritable Instructor as the commander). On more than one occassion, it would drive the seat through 10+ moves and hit a snag with a new interaction with Arena that was not yet handled (ie warp casting), I’d then take over the seat only to win using actions already validated with the bot (ie it could cast spells targeting creatures owned by the opponent or myself).

While the cards in my red deck were both competitive as well as annoying (some opponents may resign out of hinderance rather than surrender), the end of the third game in the video shows the bot ignoring casting any spells and goes straight into combat since its lookahead identified a path to win the game. This validates the player framework being able to both develop a board following established heuristics, and finish a game to completion when a winning sequence of moves is identified.

Had I modified the player evaluation to always load in the heuristics before making a move, then it could have had a similar developer experience to live coding except for Magic (where I could edit the file and the player’s strategy updates in real time). But, after a couple cycles of setting up a game, letting the bot take over and drive my seat, and hitting more snags with missing interactions, it felt less like I was validating the tech as much as I was developing a cheat software. Since that’s not my interest here, I decided to let my work reach a halting point to tidy up the code and put together this write-up.

The latest inthearena work may be buggy or not even working and I’d think that’s possibly for the best to be that way. Sharing nonetheless for the sake of sharing work.

What I did with my account

Did I feel bad about doing this project with no transparency about being a bot? Yes. I tried setting up a new account solely for doing this but I already had built up a collection of cards in my first account and the onboarding took forever. Having played Magic since middle school, the game has a special place in my heart so I certainly wasn’t the most proud to be ruining a potentially fun experience for other players. At the same time, playing any game online does entail a bit of figuring out how natural or artificial your opponent’s gameplay is.

If you’re reading this, I’ve already submitted a ticket to delete my Wizards account and I have no interest in abusing this software (I also have already spent enough time playing the game, kudos to the folks who made it). By sharing publicly, I hope to share both content with folks who’d be interested in the engine work I did as well as potentially to folks working at Wizards of the Coast so they maybe have an example of what to block with anticheat work.

My hope is that Wizards of the Coast does not remove the detailed logging setting or delay it by some time window (1), but, instead, sees this as a new API or game in of itself. Lichess used to have bot tournaments and there are technically games you can code but these usually tend to be simple games or knockoffs from real ones. Universes Beyond can happen because the IP they borrow (ie Marvel, Doctor Who, Lord of the Rings) don’t have an involvement with card games (2). As such, I think even a separate tournament could be neat (don’t repeat Legends of Code and Magic where it’s just a research project, put up a small cash prize to attract attention).

Thank you for reading and, as always, hack the planet!

(1) A few-minute delay to the detailed log output would prevent it from being usable for automated gameplay but allow it to still be used by services like Untapped.gg

(2) Otherwise, I’d have really wanted to see a One Piece theme

GitHub

https://github.com/yevbar/witchcraft

Contents