Can AI Beat Stockfish?

/ home / posts / Can AI Beat Stockfish?

December 07, 2025

Context
Planning
Experimenting
Results

Context

Chess, a game that’s not quite the oldest, has been representative of the peak of human intellect for centuries. Whether with being competitively skilled or inspiring advents in AI, it’s difficult for new advancements to not get recognition.

Along the way, folks have developed ensembles of chess engines and even pit the strongest against each other like BattleBots but for plastic medieval figures. Or, for the less competitively inclined, developing chess bots can a way to have fun with different languages.

So, a question immediately arises: is there a best programming language for describing the logic in a chess engine? As in a Whorf hypothesis for this particular task? Someone interested in data science may pick up Python or R. Someone interested in infrastructure may pick up Go or Terraform. Someone interested in an isomorphic or quickly iterable web application may pick up Rails or Next.js. While, technically, you could make an app solely in Assembly or any language that can handle a set of bytes, some paradigms and libraries are more equal than others.

Subsequently, could finding the “right language” for the problem of a chess engine result in something that’s comparable to the many hours of human optimizations that were done for Stockfish? While rewriting is an infamous directional mistake and there are several Medium articles about “handling legacy code”, if there were a tool that allowed for rapid invention of new programs, then perhaps something evolutionary could ensue. And all the better if this tool is able to quickly pivot between “languages” like one would when producing a markdown file with code snippets…

Obviously, LLMs seem like a particular fit for this. Yes, evolving with LLMs and coding is not new per se. Yes, folks have already hyperfocused on making the best chess models. What I’m interested in seeing specifically is if chess engines that are defined symbolically in different programming languages can result in different resulting “engine performance”. Beating Stockfish along the way would certainly be a cherry on top.

Planning

With the below questions:

Can chess engines be better by being written in another programming language?
Can LLMs iteratively improve a chess engine till it’s better than Stockfish?

We can see this done as an evolutionary problem which means we need to follow this structure:

graph TD; gen_n[Generation N]-->population[Randomly assemble population] population-->evaluation[Evaluation and Scoring] evaluation-->taking_best[Taking best scoring for Generation N+1] taking_best-->gen_n

Conveniently, for the step of populating new “generations”, hosted LLMs inherently add non-determinism. However, for evaluation, rather than apply a whole list of quantities (including algorithmic complexity, lint/type checking) that’d be needed if I wanted to truly make better programs, I will be sticking with just totaling -1 for every loss, 0.5 for every draw, and 1 for every win then comparing the score results among “versions” of engines made in a specific language.

For the very first “population” of programs, folders are copied from the base_templates folder.

graph TD; subgraph base [Base templates] end subgraph first_pop [Gen 0] python[Python] ruby[Ruby] other[Other code files...] end base-->first_pop

For every generation, copies are made of each language for the number of established coding agents that will be running against them. Subsequently, respective coding agents are fired off at those newly created directories.

graph LR; subgraph current [Generation N] cur_python[Python] cur_ruby[Ruby] cur_other[Other...] end subgraph for_agents [For agents] python_0["Python #1"] python_1["Python #2"] python_2["Python #3"] python_3["Python #4"] ruby_0["Ruby #1"] ruby_1["Ruby #2"] ruby_2["Ruby #3"] ruby_3["Ruby #4"] other[Other...] end cur_python-->python_0 cur_python-->python_1 cur_python-->python_2 cur_python-->python_3 cur_ruby-->ruby_0 cur_ruby-->ruby_1 cur_ruby-->ruby_2 cur_ruby-->ruby_3 cur_other-->other

Once done making changes, the engines all play against one another and the best scoring engines for each language (after totaling -1 for every loss, 0.5 for every draw, 1 for every win) are then copied into the starting population for the next generation.

graph LR; subgraph for_agents [For agents] python_0["Python #1"] python_1["Python #2"] python_2["Python #3"] python_3["Python #4"] ruby_0["Ruby #1"] ruby_1["Ruby #2"] ruby_2["Ruby #3"] ruby_3["Ruby #4"] _other[Other...] end subgraph next_gen [Generation N+1] python[Python] ruby[Ruby] other[Other] end style python_0 fill:#4caf50,stroke:#0d47a1,color:#fff style ruby_1 fill:#4caf50,stroke:#0d47a1,color:#fff python_0-->python ruby_1-->ruby _other-->other

After making a simple first version which only ran Ruby and Python with one generation: Python won but there were a few things that had to be improved prior to letting it go for more generations or incorporating more languages:

Coding agents do not consistently share security features around access to certain folders
Servers are run directly which can lead to hanging processes

The solution for both of these being some means of containerizing. I didn’t think this was the right project for doing something with Unikernels so I simply wrapped the code generation portion as well as the players into Docker containers. The players are standalone web servers that expect to only play one game and listen at random ports with the “orchestrator” node being effectively a tournament master.

After containerizing, I also encountered an interesting problem when implementing a PHP version: some languages will implement chess engines by wrapping around python-chess and others implement a new chess implementation all together. While this may sound like something that shouldn’t matter, it does when a distinct implementation treats 3-fold repetition as an illegal move rather than a draw.

Due to the above so I don’t overinvest in dealing with different chess engines or having to make new ones, I ultimately capped the scopes of this experiment to:

Languages: PHP, Python, R, Ruby
Coding agent: Claude Code
# of Generations: 5
# of Code Gens: 4
Stockfish players: One with minimal strength, one with some mild strength

Results

After shuffling through some failure tolerance features, running the end to end experiment was unexpectedly smooth. There are some cases where the implementation doesn’t account for all the possible errors (e.g. one chess implementation treating a three-fold repetition as an illegal move rather than a draw, a chess engine simply timing out due to move exploration).

If you’d like to see the game results, you can find them at https://yev.bar/chess-colliseum/.

If you’re just interested in the results for the questions I started with:

Do different languages yield different strengths for a task like a chess engine?
Can AI evolve a program to eventually beat stockfish?

For the first question, here’s what the overall breakdown looks like:

======================================================================
FINAL RANKINGS - CHESS COLLISEUM
======================================================================
Rank   Player                                   Score     W/D/L Equiv
----------------------------------------------------------------------
1      STOCKFISH                                112.0        ~112W/0L
2      ruby #2                                   17.0         ~17W/0L
2      ruby #3                                   17.0         ~17W/0L
4      python #0                                 16.5         ~16W/0L
5      RANDOM                                    14.5         ~14W/0L
6      ruby #0                                   14.0         ~14W/0L
7      r #0                                      13.0         ~13W/0L
8      ruby #1                                   12.5         ~12W/0L
9      python #1                                  8.0          ~8W/0L
10     python #3                                  5.5          ~5W/0L
11     php #0                                     4.0          ~4W/0L
12     r #2                                       3.0          ~3W/0L
13     php #2                                     2.5          ~2W/0L
13     php #3                                     2.5          ~2W/0L
15     php #1                                    -3.5          ~0W/3L
16     python #2                                 -4.5          ~0W/4L
17     r #3                                      -5.0          ~0W/5L
18     r #1                                      -6.0          ~0W/6L
======================================================================

So, as you can see, Ruby came out to be the best relative language for defining a chess engine. When viewing the code that was generated, it looks as though the simplest degree of minmax’ing did the trick and some of the engines never advanced beyond the random move generator (as indicated by RANDOM standing in fifth place).

Before speculating too much on what occurred there, the result of the second bullet point (in terms of getting to the point of beating Stockfish) are as follows:

PHP: 3 wins
Python: 3 wins
Ruby: 2 wins
R: 0 wins

The numbers above are with the caveat that these “wins” were when Stockfish failed to provide a legal move in reponse. So, rather than saying any of these implementations turned out to beat the famous engine, it might be more appropriate to conclude that these engines were able to stump Stockfish at different points.

As with any project, there are numerous places to work on and improve should I continue delving into this project such as the error handling but - if we’re to extrapolate some things from this experiment as it’s turned out:

Sometimes the Bogosort level complexity can yield an advantage
Perhaps adding more granularity to code validation could have given PHP a better chance
The “data science” languages Python or R do not magically stand up against options like Ruby

The code can also be found here -> https://github.com/yevbar/chess-colliseum/

Contents

Context

Planning

Results