Exploring ProgramBench

/ home / posts / Exploring ProgramBench

May 22, 2026

Prelude

Meta recently published a benchmark named ProgramBench where the goal is simple: replicate a project without the internet (more details in their blog post). In contrast to rewriting from one language to another, this benchmark for rewriting projects without source code comes at a funny time. Extending the idea that engineers will be replaced in a couple years, software companies will begin to look more like hedge funds where models perform trades instead of people and agents write code instead of people.

Hypothesis

Rather than tackle all the problems, which may require a combination of techniques, we focused on the ones which were related to programming languages such as interpreters or compilers. LLMs are good at coding because code is just text, therefore shouldn’t a PLT toolkit help an agent across the finish line?

The original benchmark prohibits using the internet since an agent could cheat and clone the source code from GitHub. Unfortunately, this also prevents the agent from installing packages; nobody’s expected engineers to write software from scratch since the days of Bell Labs! To preserve the original intent of not letting the agent find an easy solution off the web, this experiment sticks to one language and toolkit.

Approach

We built a toolkit with OCaml, which already has an extensive history of being used for problems related to programming languages. The toolkit was given to mini-swe-agent, the same one used by Meta when publishing the original scores, as well as our harness intended for greenfield development.

Interestingly, the former performed much better than the latter. A note on that is shared after the results table.

Results

Below shows a comparison of the progress made in the original publication using Sonnet 4.6 and mini-swe-agent. veldt is the name of the OCaml PLT toolkit and the zagent captain is what was used to make sterling, our open source OpenAPI-to-SDK generator.

Note: Both veldt and zagent were run with Sonnet 4 (claude-sonnet-4-20250514) rather than 4.6

Task	Sonnet 4.6 (original)	veldt	zagent captain
jqlang/jq	1.0%	55.6%	9.6%
lua/lua	34.7%	46.7%	12.1%
luajit/luajit	71.5%	44.0%	14.8%
tree-sitter/tree-sitter	37.2%	35.9%	26.5%
paradigmxyz/solar	42.9%	33.7%	24.9%
parcel-bundler/lightningcss	49.9%	27.1%	9.6%
tinycc/tinycc	9.3%	4.9%	3.4%
typst/typst	0.0%	8.0%	4.6%
bellard/quickjs	0.0%	0.8%	0.7%
php/php-src	0.0%	0.6%	2.3%
Average	24.5%	25.7%	10.8%

Note 2: Developing veldt was done with knowledge of the sorts of abstractions that would be useful in these tasks (as shown with jq/lua results being overfit due to there being more methods available for parsing). The validation is in other technically different problems showing benefits like typst.

zagent failures stood out considering its intention for greenfield work. My suspicion is it’s the same problem as internet data not containing reasoning data and only final outputs. Were an engineer to approach these problems from scratch (again, not a new problem), they’d maybe start by deconstructing the requirements and rationales behind design decisions. As an analogy, chimpanzees can memorize numbers but we do not expect them to abstract away addition of numbers; coding agents can implement software but we’re not expecting them to explain the business requirements.

Conclusion

We’re not yet at the point where software engineering from first principles is solved. Although, simply giving coding agents access to relevant tools makes it more productive, especially when they’re in the direction of abstractions and services related to the problem being solved. Otherwise, it’s comparable to making a web app in Assembly instead of Python, or sending SMS messages without Twilio, or using kubernetes in your stack without being able to explain why.

However, this does not mean agents can’t improve your productivity at a small or larger scale. If you’re interested in having your own fleets of coding agents working for you, then we’ve got the platform for you. Go ahead over to our docs to learn more and get started!