Yev Barkalov

Bringing Hermes to WebAssembly

2026-05-06T08:00:00+00:00

What’s in this post?

We took Hermes Agent, developed by the folks at Nous Research, and brought it to WebAssembly in two different ways, detailed below, and share what our general takeaways from this experiment are.

Should I replace my Hermes with one of these?

Probably not. If doing things with Python in WebAssembly is of interest to you, then continue reading!

Where did Hermes come from?

Having taken the world by storm, OpenClaw is an AI assistant which uses the computer you run it on similarly to a person (so it can do more than just recite Wikipedia articles). Following this, Hermes was written in Python on top of mini-swe-agent (unlike OpenClaw which is written in TypeScript on top of pi).

graph LR subgraph outer[" "] direction LR subgraph inner["More tools + use computer"] A["Coding agent"] end inner -->|"Same as"| B["General-purpose agent"] end style A fill:#1a1a2e,stroke:#58a6ff,stroke-width:2px,color:#c9d1d9 style inner fill:#161b22,stroke:#58a6ff,stroke-width:1px,color:#8b949e style B fill:#238636,stroke:#2ea043,stroke-width:2px,color:#ffffff style outer fill:none,stroke:none

Both are “general-purpose agents” which are really batteries-included coding agents. The “magic” of their utility comes from the assembling of the “harness” the agent sits inside of when handling users’ prompts; which gives it the capability to do things like click around a browser or send an email.

Why WebAssembly?

I’m admittedly biased when it comes to WebAssembly but, having worked on multi-agent projects before, I was interested in seeing if WebAssembly would give a win with regard to isolation (ie spinning up multiple separate agents in parallel) or granular configurability (ie assembling agents precisely with certain tools for different “modes”).

Additionally, since Hermes is written in Python, I wanted to see if eagerly compiling to WebAssembly would offer closer-to-native performance.

How to WASM Hermes

Pyodide

Run Hermes in pyodide yourself

This app serves an index.html with the full agent running client-side in Pyodide. To run the below command, you may need to install the vers CLI first.

vers run-commit f46a2b21-73fe-4835-ad79-8eccd523fc07 \
  --format json --wait \
  | sed -n 's/.*"vm_id"[: ]*"\([^"]*\)".*/https:\/\/\1.vm.vers.sh/p'

Public Vers VM Commit: f46a2b21-73fe-4835-ad79-8eccd523fc07

How it works with Pyodide

graph LR subgraph outer[" "] direction LR subgraph pyodide["Pyodide"] A["Hermes agent"] end subgraph browser["Web browser"] W["WebAssembly"] end pyodide --> W end style A fill:#1a1a2e,stroke:#58a6ff,stroke-width:2px,color:#c9d1d9 style pyodide fill:#161b22,stroke:#58a6ff,stroke-width:1px,color:#8b949e style W fill:#1a1a2e,stroke:#f0883e,stroke-width:2px,color:#c9d1d9 style browser fill:#161b22,stroke:#f0883e,stroke-width:1px,color:#8b949e style outer fill:none,stroke:none

The first approach is by using Pyodide, a Python runtime that’s ported to WebAssembly so Python programs can be interpreted and run in the browser. You can think of this as being similar to the approach that was taken with bringing Postgres to WebAssembly:

graph LR subgraph outer[" "] direction LR subgraph buildroot["Linux VM created with Buildroot"] P["Postgres"] end subgraph browser2["Web browser"] W2["WebAssembly"] end buildroot --> W2 end style P fill:#1a1a2e,stroke:#58a6ff,stroke-width:2px,color:#c9d1d9 style buildroot fill:#161b22,stroke:#58a6ff,stroke-width:1px,color:#8b949e style W2 fill:#1a1a2e,stroke:#f0883e,stroke-width:2px,color:#c9d1d9 style browser2 fill:#161b22,stroke:#f0883e,stroke-width:1px,color:#8b949e style outer fill:none,stroke:none

Postgres itself doesn’t get run in WebAssembly but instead a Linux emulator in WASM runs a modified version of Postgres so the whole thing can actually work together inside a browser.

Hermes in Pyodide source

You can view and modify the source code here: https://github.com/hdresearch/hermes-pyodide

pywasm

Run Hermes in pywasm yourself

This app serves the hermes_agent.wasm binary. Hit the “Run” button in the UI to execute it live.

vers run-commit f83df1ac-0a53-4ca6-bc26-205584fe65a3 \
  --format json --wait \
  | sed -n 's/.*"vm_id"[: ]*"\([^"]*\)".*/https:\/\/\1.vm.vers.sh/p'

Public Vers VM Commit: f83df1ac-0a53-4ca6-bc26-205584fe65a3

How it works with py2wasm

This second approach works by using py2wasm, a Python-to-WebAssembly compiler and the pywasm split design keeps the security boundary clean:

graph LR A["Hermes"] --> B subgraph B["WASM"] B1["• Prompt
• Loop
• Context
• Local tools"] end B -->|"JSON in/out"| C subgraph C["Host"] C1["• Calling API
• Tool dispatch
• API keys"] end

The host extracts real schemas from the Hermes’ ToolRegistry at startup before sending them to the WASM binary via init protocol. The LLM always sees the same parameter names as the actual handlers (ie path instead of file_path or old_string instead old_text).

Hermes in pywasm source

You can view and modify the soure code yourself here: https://github.com/hdresearch/hermes-pywasm

Benchmarks

Below are benchmarks obtained from running on a M4 macbook. As admitted earlier, this probably won’t meaningfully replace running Hermes on your laptop. However, if porting the harness itself to alternative environments (ie a browser) is of interest to you, then you can see some of the tradeoffs between pyodide and py2wasm.

Metric	Native Python	Pyodide (browser)	pywasm (WASI)
Cold start	750 ms	2–5 s	110 ms
Single turn	840 ms	~850 ms	100 ms
20-turn conversation	3,280 ms	~3,300 ms	110 ms
50 parallel agents	4,566 ms	N/A (browser)	611 ms (wasmtime)
Worker pool throughput	9 q/s	N/A (browser)	81 q/s (wasmtime)
Deployment size	733 MB	~20 MB + packages	26 MB
Pip packages	171	171 (via Pyodide)	0
Runs in browser	❌	✅	⚠️ needs WASI polyfill
API key exposure	server-side	client-side	Stored in host

Takeaways

While the founder of Docker years ago suggested WASM+WASI was the missing sandboxing solution, it’s evidently not a magic bullet considering the missing capabilities from a full-fledged computer or container. If having a full but branchable VM with incredibly fast startup times sounds like what you’re looking for, then go on over to Vers and get started!

Taking MemPalace to 100%

2026-05-06T08:00:00+00:00

Overview

We took MemPalace and extended its techniques to close the gap in the LongMemEval recall@5 retrieval benchmark to get a reproducible 100% score using only local compute (no LLM or API calls).

What this is not

Not a LongMemEval leaderboard score. The full LongMemEval benchmark is end-to-end and involves generating answers plus GPT-4 judging. This experiment is strictly about the same retrieval metric that MemPalace was tackling.
Not a strong metric. The metric is recall_any@5, the softer variant. recall_all@5 (requiring every gold session in the top 5) would be a harder bar.
Not an novel algorithm. Iterating on failures from the dataset, the patches made are general NLP patterns. A new benchmark could be put together with different heuristics required but that just continues the cat-and-mouse game of developing “human-comparable intelligence”.

These caveats aren’t intended to steer your attention away but more set the expectation of an interesting result. The central takeaway of grammatical patterns in text being applicable to vector stores still deserves some acknowledgement.

What we did do

We achieved 100% recall@5 retrieval on all 500 LongMemEval questions. The system uses no language model, makes no API calls, and requires no GPU. The MemPalace baseline on the same metric is 96.6%, so the +3.4% improvement represents a real engineering output. Shared in a project dubbed Retaining, it does:

500/500 R@5 (100% recall at rank 5)
500/500 R@10 (100% recall at rank 10)
Fully deterministic and reproduced across multiple runs

Context

On April 6th, Ben Sigman shared that Milla Jovovich had fun with coding agents and built a solution for long-term memory named “MemPalace”. For those who are fans of sci-fi movies, you may recognize Jovovich as the one who played the Fifth Element as well as Alice in Resident Evil. The cherry on top is, at the end of the Resident Evil series, Alice is enabled to tackle the antagonist after her childhood memories were uploaded to her; a rather similar message to enabling agents after giving them a “memory palace”.

Originally proclaiming it to score 100% with optional Haiku rerank before backtracking, it’s racked up a good volume of attention and validation so it’s not totally “viral slop”. By both compressing content and making historical context navigatable, it highlights the efficacy of simple NLP techniques when applied with LLMs.

Improving MemPalace

What worked

If you’ve seen structured note taking like the Cornell Note Taking System or backlinks in Obsidian, then you know there’s more to outlining text than just indexing when or where words occur. With spaCy and named entity recognition, we can extend the existing pipeline by including noun phrases or other grammatical relations that give a more detailed picture of the “ontology” representing the content at hand.

Below is a table of newly added techniques and how much they contributed to the recall@5 performance:

Technique	Measurement	Net Δ R@5	Net Qs Fixed
NER-enriched synthetic documents	individual	+1.6%	+8
Keyword overlap re-ranking	individual	+1.2%	+6
Time-based date matching	individual	+0.8%	+4
Logic engine scores	individual	+0.4%	+2
Theme detection	individual	+0.2%	+1
NP embeddings + LogicKB rewrite	cumulative	+0.4%	+2
Rank preservation injection	cumulative	+0.6%	+3
Temporal-NP bridge	cumulative	+0.2%	+1

Individual: technique alone added to the baseline. Cumulative: technique added on top of prior ones. Deltas overlap and do not sum to total.

The top three contributors are all simple re-ranking heuristics. The logic engine contributes modestly and actually causes the most regressions. The finding from this experiment: enrichening NLP extraction in a retrieval pipeline can produce more than improving the logic engine that queries them. (damn you bitter lesson!)

In more detail

1. spaCy-based extraction

Every session gets processed through spaCy’s en_core_web_sm pipeline. We extract entities, noun phrases, relations (subject-verb-object triples), time-related markers, and quoted phrases. This takes ~5 seconds per question’s haystack when run on my Macbook.

2. Pure-Python logic engine

A LogicKB Python class that stores extracted facts as inverted indexes. For each query, it looks up matching objects across all sessions, returning a weighted score per each one. This replaced an earlier Prolog approach with the same idea but much less complexity and no IPC overhead.

3. NER-enriched synthetic documents

For each session, we create an document containing its extracted facts and details. These get indexed alongside the raw session text, giving the embedding model a richer retrieval surface. This is the single biggest contributor to accuracy.

4. Noun-phrase embedding bridge

We embed each session’s extracted objects into a separate ChromaDB collection and query it with the question’s noun phrases. This bridges gaps that neither keywords nor full-document embeddings can cross. “Battery life phone” → “portable power bank” has a close enough embedding distance in the noun phrase space to pick up the right session.

For time-related questions (“What did I buy 10 days ago?”), we first identify all sessions in the date window, then run the noun phrase bridge within that filtered set. This discriminates between 14 sessions that all share the same date by finding the one whose noun phrases are topically closest to the question.

What didn’t work

When people complain about LLMs not being able to answer questions or hallucinating false information, what nobody complains about is the LLMs’ ability to identify the question it needs to answer (we can depend on AI to write code that does a thing rather than depend on it to end-to-end handle a task). In the subject of answering questions or digging through “long context problems”, I first attempted to have the LLM use Prolog for storing and retrieving facts.

However, the semantic fuzziness (ie synonyms or finding similar topics to a query) ended up hurting the overall score more than helping. The approach in MemPalace to depend on a vector store actually showed to be “more correct” in this experiment.

Nevertheless, I do think there may be types of problems where realistic input queries (ignoring cases where people are funny and test jailbreaking support agents) would be usable with a more structured and queryable store of relations between objects. Prolog just may not be a low-hanging fruit solution for long-term memory problems where semantic similarity is something worth indexing.

Running yourself

First, clone the repo and install dependencies.

git clone https://github.com/hdresearch/retaining
cd retaining
python3 -m venv .venv && source .venv/bin/activate
pip install spacy chromadb
python -m spacy download en_core_web_sm

Next, download the dataset for the benchmark.

# Download LongMemEval data (~265MB)
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json

Lastly, run the benchmarks.

# Vector-only baseline: 96.6% R@5, ~5 min
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode vector

# Full hybrid: 100% R@5, ~50 min
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode hybrid

No API keys. No GPU. Python 3.9+ and ~300MB of disk.

Conclusion

AI famously hit “winters” in the past when some wall prevented computers from becoming sufficiently intelligent. Interestingly, the problem in the past was that “symbolic” approaches to AI would fall short when it came to the last mile of complexity. Similarly, LLM-maximalist approaches also run into a “last mile problem” when it comes to ensuring accuracy of details (ie hallucination).

By incorporating older NLP techniques to tackle the “last mile problems” with modern approaches involving LLMs, there are rather interesting results to be found! Albeit, the implementation used here to game recall@5 is, certainly, by no means a complete solution for knowledge retrieval.

The beauty of the finding is that the problem of “if only someone had sat down long enough to write every NLP grammar rule” now becomes somewhat negligible in a world with coding agents. So, rather than continue to see human text as black boxes, know that a richer pipeline may get the sufficient amount of complexity for some information to be adequately indexed.

A coding agent with direction

2026-04-29T08:00:00+00:00

What is this?
What is this not?
The background
zagent terms
Takeaways

What is this?

This is an overview of the principles I used to assemble zagent, a coding harness for getting more progress out of a single “shot”. I’ll be both describing the topics I worked on top of as well as the structure behind what I put together.

If you’re looking for a post to read that gives copy-and-paste’able commands, this isn’t for you. If you’re alright with reading something more explanatory, then continue on!

What is this not?

zagent is not going to replace your Claude code or pi (which I predominately use) but the ideas below should be high level enough you can implement it in your own harnesses or coding agent systems.

The background

I didn’t by any means invent a new model or algorithm, I simply applied together some existing concepts which are accessible yourself. Being transparent, this is my way of building up towards a general purpose version of what Google accomplished with AlphaEvolve.

To break down what the heck is going on with zagent, there are three “primitives” in the area of coding agents that would be useful to know.

Coding agent

From editors like Cursor to headless systems like Devin, there’s a large variety of offerings that all fall under the notion of “coding agents”. Simplifying it to the bare minimum, a coding agent is an AI that can take a prompt from someone and write code to accomplish some goal. However it may be accessed by a user (could be tagging in a Slack workspace, sending a message on Telegram, writing a prompt from a UI, etc), the underlying step from general agents is that it can write and run code.

graph LR; Prompt-->Agent Agent["Coding agent\n(Can write/run code)"]-->Output

Sometimes underappreciated in domains other than literally writing software, the power of coding agents is in how much is built on top of code, making them immediately ‘effective’ in the world around us today. It could be a “short lived” agent that only runs to solve a specific problem before exiting or a “long running” agent with a growing memory. Depending on the particular use case you’re looking for, one may be better than another.

In the context of writing software that delivers something, I’ve personally found the philosophy of short lived agents to be better suited.

Ralph loops

Ralph Wiggum loops, named literally after the Simpsons character, is a technique for working with coding agents that looks something like the below pseudo-code:

while not done:
    fire coding agent at task(s)
    repeat until done

For instance, the Claude code plugin would run a while true loop in bash until the LLM outputted a specific string indicating it had actually completed the task rather than said things which sounded nice. In pseudo-code that’d look something like:

while "DONE" not in last_output:
    fire claude code and tell it to say "DONE" when finished
    continue until done

To avoid running out of context (and to keep productive when one goes to sleep), folks would run “ralph loops” since the while true serves as a way to reset the context over and over, letting it run ‘infinitely’. In the case of problems where the task is going through a large bullet point list of items (ie meticulously writing unit tests across a large codebase), it works well since the tokens that filled up the context about prior solved items isn’t relevant to the context needed for solving problems moving forward.

graph LR; Prompt["Send prompt before going to sleep"]-->Ralph Ralph["Ralph loop\n(Resetting and repeating over and over till it's done)"]-->Goal["Completed goal"] Ralph-->Ralph

However, in the case of problems where you do lose something by resetting the context (ie a complex integration which requires knowing about all the pieces involved to be useful), then ralph loops can fall short. While still a useful technique, it’s no longer meme’d as a solution for “solving programming” for this reason.

RLMs

An idea popularized from a blog post and then published to arXiv, RLMs broadly solve the problem of “running out of context” but in an importantly different way. Rather than place the “infinite loop” above the LLM (like done in the ralph loop), what if the loop were conceptually brought into the agent loop itself? In RLMs, this is done by letting the agent recursively call itself or other agents before coming back with a final answer.

graph LR; Prompt-->Agent Agent-->Sub["Sub-agent"] Sub-->Web["Web request"] Sub-->Code["Run code"] Sub-->Process["Process results"] Sub-->Agent Agent-->Result

Explaining how this works with LLMs but with an analogy: suppose you wake up to a text message asking you to research something that you have five minutes to respond to but you haven’t had the chance to even have coffee yet. Lacking the energy to Google around, you text someone else who you think either knows the answer already or wouldn’t mind finding it, they get back with the answer, you forward to the first person, and then all’s done.

A profound utility from this is being able to “stretch” your context window since spawned sub-agents can go through their context windows exploring something rather than the top-level agent you provided the original prompt to. Nowadays, in conjunction with stuff like memory, some of the older problems with arbitrarily large context windows have tools for tackling them.

Where “infinite context” can fall short can be broadly explained by how “completely illuminating a house such there no shadows” makes it uninhabitable. It’s no secret LLMs can be convincing whether to themselves to users falling into AI psychosis. As a result, letting an agent ruminate on some goal or task (even if it’s rational like programming), can lead to adverse results which are seen as unproductive to the person hoping to finish an app or such.

zagent terms

Inspired by my experience with herding coding agents, there are three layers I’ve assembled into zagent that apply the above ideas. Before you ask, yes, the names are inspired by One Piece.

Code cannon

In my prior projects using “code cannons” like rewriting git in zig or developing a modern toolkit between Elixir and WebAssembly, what I was really doing was leveraging Vers VMs as the RLM environments in which sub-agents were working on scoped problems. To differentiate from the ideal of a code factory, this RLM pattern is what I’ve referred to as a “code cannon”.

graph TD; Agent-->Sub1["Sub-agent"] Agent-->Sub2["Sub-agent"] Agent-->Sub3["Sub-agent"] subgraph cannon[" "] Sub1 Sub2 Sub3 Sub1-->RF1["Read file"] Sub1-->WF1["Write file"] Sub1-->RP1["Run program"] Sub2-->RF2["Read file"] Sub2-->WF2["Write file"] Sub2-->RP2["Run program"] Sub3-->RF3["Read file"] Sub3-->WF3["Write file"] Sub3-->RP3["Run program"] end

In the case of rewriting the git CLI, there are several subcommands which can be worked on in parallel (and on different files which can prevent conflicts when merging changes). You can think of this like how, at a hackathon, you may have one person working on the backend, one person working on the frontend, and one person working on the slideshow presentation; each of them can work on their piece of the overall project without stepping on each others’ toes.

Code pirate

Taking a step back and contemplating what I was really doing when “firing code cannons”: I would see what the progress or status of changes were, break down the next wave of changes I wanted to see, provisioning new agents with their respective prompts, and letting it run for a while before coming back to my laptop and repeating.

Enter the “code pirate”, a ralph loop that works from a markdown file firing code cannons until it finishes more substantial progress.

graph LR; Pirate["Code pirate"]-->Pirate Pirate-->SA1 Pirate-->SA2 Pirate-->SA3 Pirate-->SA4 subgraph pair1["Code cannon"] SA1["Sub-agent"] SA2["Sub-agent"] end subgraph pair2["Code cannon"] SA3["Sub-agent"] SA4["Sub-agent"] end

By bridging together the context-resetting of the Ralph loop (the pirate) and the context-mindfulness of the RLMs (the cannons), it establishes a coding harness which is able to accomplish larger diffs like building out sterling (if it’s still private, it’s coming soon!).

When I come back to my computer to review a captain result, it’s less about knitting knots in feature intentions and more about steering the army of coding agents overall. Making sterling with the code pirate was less about firing it over and over at a goal but more setting goal(s), it finishes them through, and then setting new goals to be implemented (like making a peanut butter jelly sandwich).

Code captain

Everything up to this point I can say truthfully has yielded a real result that would have taken more time or effort if I used a different tool. This next “layer” is something I’ve been tinkering with and have not yet found something that feels like I “cracked it”. However, I’m sharing here in case the concepts are of use to someone else facing similar problems.

When tackling projects where it “working” is non-negotiable (ie it meets a test coverage quota, an ambiguity that would lead some agents to giving up early), totally depending on the LLM to come back with a result can be anticlimatic.

To remedy this while tinkering with Lean, I’ve started working on a “code captain” which behaves like a code pirate but, rather than let the agent exit when it’s gone astray, I added a gate which prevents the pirate from exiting until all conditions are met.

graph LR; subgraph pirate["Repeat until complete"] Pirate["Code captain"]-->Pirate end pirate-->SA1 pirate-->SA2 pirate-->SA3 pirate-->SA4 subgraph pair1["Code cannon"] SA1["Sub-agent"] SA2["Sub-agent"] end subgraph pair2["Code cannon"] SA3["Sub-agent"] SA4["Sub-agent"] end

If the gate’s not well defined, then the agent can find a way to exit early. If the gate’s redefinable (ie learning about new objectives or constraints over time) or even appendable, then the agent may still find a way to exit early. So, ultimately, software engineering’s a game of scoping objectives well.

Takeaways

Training employees versus hiring interns is like the difference between vertical and horizontal scaling. Likewise, the difference between leveling up a single person versus spinning up agents to fill in certain tasks is like the difference between vertical and horizontal scaling but for responsibilities. The underlying problem with coding harnesses is boiling down the responsibilities of a software engineer into horizontally scalable skills.

It’s already the case in some hedge funds that folks will develop models for executing strategies but aren’t picking up the phone and placing orders themselves. While there are still some firms which rely on old fashioned methods, the analog to software is that there will eventually be categories of products where the code defining these products isn’t governed by people but instead by the systems established by them.

Until the day coding’s finally solved, we shall still have problems to solve. Hack the planet!

Ziggit

2026-04-02T08:00:00+00:00

Digest

We rewrote git in zig and:

Sped up bun by 100x
Got 4x faster than git on an arm Macbook
Compiled to WASM to be 5x smaller with 8.5x more exports
- Check out this demo to clone a repo in your browser!

Rather than start with the theory behind the “swarming”, we’ll share how to code cannon yourself, describe how our zig rewrite of git went, and then dive into some of our theory behind why this works.

How to code cannon yourself

Install vers CLI

First you’ll need the vers CLI installed.

$ curl -fsSL https://raw.githubusercontent.com/hdresearch/vers-cli/main/install.sh | sh

After you’ve installed it, log in.

$ vers login

Now you have a working vers CLI ready to prepare your swarm infrastructure.

Configure environment variables

With the vers CLI you can define environment variables which get injected to all the VMs you create, making authentication for some CLIs a breeze. Here we’ll walk through the environment variables we included for this project.

First, create a new GitHub repository and then follow the instructions for creating a fine-grained personal access token. You’ll want to create one that has Read and write access to content for the repository you’re going to work on.

We configured it to have access to only the repos we’re interested in code cannoning at for this project. Our rationale being we don’t want one or multiple agents to get creative and start integrating other projects that aren’t relevant. Once you have that API key then set it like so:

$ vers env set GITHUB_API_KEY github_pat_...

Next, from the vers dashboard click on the API Keys tab and create a new API key. After you’ve written it down someplace you won’t lose it, you can set it to your environment variables (so an agent running in a VM would be able to on its own spawn further agents).

$ vers env set VERS_API_KEY abc123...

Finally, since we’ve been driving this using Claude, let’s set an ANTHROPIC_API_KEY so any coding agent running in a VM works out of the box.

$ vers env set ANTHROPIC_API_KEY sk-ant-...

Write your initial plan

We’ve shared the plan.md file we used for the zig rewrite of git, you’re welcome to copy it and tweak for your project. Once you have it written, simply point your coding agent at it.

$ pi "Read plan.md and let me know when I can quit this session"

The prompt specifies that contributing agents should be spun up in VMs so you can close your laptop and know things are still progressing.

Let it start running

Eventually pi or your coding agent will tell you agents are working and you’re good to quit the session. Congrats, you’ve successfully created a code cannon to work on some problem.

Check where it’s at

Regardless of the size of the project, since there may be small features or nits you’d like to include anyways, it’s good to check in after agents began working to verify what it is they’re working on. If you find yourself glossing over the agent descriptions and more crossing your fingers than walking away knowing your progress, you’ve likely depended on the agents too much for your goal.

Repeat running and checking in

We found it useful to check in on the swarm similar to checking in with a team during standup but on an admittedly more frequent basis. Rather than provide a prompt to scale up/down the swarm after certain checkpoints, being more hands-on with steering allowed us to also get a clearer understanding of the scope of this project as well.

How we rewrote git in zig

Anthropic took a stab at rewriting the C compiler and Cursor took a stab at rewriting a web browser. It’s not that hard for you to do the same and here’s how we went about rewriting a big open source project with the help of agents!

Environment

The Vers VMs spawned had the environment variables injected at startup (so running pi with instructions will always work).

ANTHROPIC_API_KEY: For the LLM powering the coding agent
VERS_API_KEY: For further orchestration
GITHUB_API_KEY: A strictly scoped API key for just one repository

The initial plan

Below was literally the one markdown file used for the original agent to spin up a swarm.

The goal is to make a modern version control software like git or jj but written in zig

ALL SYSTEMS AND AGENTS MUST use this github -> https://github.com/hdresearch/ziggit.git

For each of the below goals, create a VM and run code like the following

```bash
while true do
  pi -run "GOAL"
end
```

NOTE - pi is running on the VM itself rather than running on the host machine and then ssh'ing commands. This should be done so we can quit this pi session

So agents are just infinitely running since there is always something to improve in a piece of software. Include pi-vers extension so each infinite loop can provision further VMs or agents.

- first person like jj but does not have a `jj git` subcommand and instead is drop in replaceable with `git` so `ziggit checkout` not `ziggit git checkout`
- feature compatibility with git (copy over test suite from git source)
- can compile to webassembly
- can yield performance improvements to oven-sh/bun codebase by using directly with zig integration instead of libgit2 or git cli

Maybe wait for some progress before starting on replacing bun's usage of the git cli (which they use over libgit2 for performance reasons, our suspicion is that a modern solution in zig could be better). Every VM should have the env vars `VERS_API_KEY`, `ANTHROPIC_API_KEY`, `GITHUB_API_KEY`. Also use the hdresearch/bun fork with changes so a real PR can be created pointing at oven-sh/bun BUT DO NOT MAKE THIS PR YOURSELF. Provide instructions for a person to validate the benchmark results with ziggit usage first

We copied over the plan.md used for firebird and the -run argument is not a real argument, the correct one is -p but the top-level agent figures it out anyways.

The produced agent loop

From the markdown plan, our local pi agent created a golden image for the VMs working on the ziggit codebase to use and configured each agent to have different git commit authors so progress would be identifiable.

Every agent additionally got a /root/prompt.txt file with that agent’s specific prompt. The agent tasked with covering git’s test suite would have that file populated with contents like "You are the CORE agent. Run git's test suite and fix CLI bugs." and the agent tasked with improving certain git index functionality would have that file with contents like "You are the NET-SMART agent. Rewrite idx_writer.zig to be 10x faster.".

Finally, every VM runs the exact same bash loop encompassing the coding agent itself as well as the git cleanups referenced earlier. The below was generated by the top-level pi agent orchestrating these coding processes in VMs for how to define a given agent.

#!/bin/bash
set -a; source /etc/environment 2>/dev/null; set +a
export HOME=/root
export NODE_OPTIONS="--max-old-space-size=256"

cd /root/myproject || exit 1

while true; do
    echo "$(date): === Starting agent run ==="

    # 1. SYNC — save dirty work, pull latest from other agents
    git add -A
    git diff --cached --quiet || git commit -m "auto-save before sync"
    git fetch origin master
    git rebase origin/master || {
        git rebase --abort
        git reset --hard origin/master  # nuclear option on conflicts
    }

    # 2. BUILD — rebuild the project
    zig build  # or whatever your build command is

    # 3. RUN PI — the actual agent work
    pi --no-session -p "$(cat /root/prompt.txt)"

    # 4. PUSH — commit and push whatever pi did
    git add -A
    git diff --cached --quiet || git commit -m "auto-save after pi run"
    for attempt in 1 2 3; do
        git pull --rebase origin master || {
            git rebase --abort
            git reset --hard origin/master
        }
        git push origin master && break
        sleep 5
    done

    sleep 10
done

It executes every loop by saving work from the prior loop run, pulling in latest changes, rebuilding the project, running the pi agent, and then repeating the same git operations at the end with also pushing. The agent prompts themselves also mention to use git operations for auditability but these git failguards around the agent itself help ensure the agent loop doesn’t get stuck along the way.

Meta note

To reiterate a point at the end of a another section, the sub-agents aren’t doing anything differently from if you were to be manually starting new agents with their respective prompts yourself. These shouldn’t be doing anything you can’t directly understand, whenever we found ourselves starting with an initial research goal (ie understanding a point of integration before beginning new loops) and letting the agent in front of me handle the rest, we’ve ended up with a mess to clean up.

Similar to how LLMs can be poor at writing configuration files, we’d guess complex integrations fall under a similar category of “problems LLMs do a lot better with a human around” and, should you be working on one of these tasks, make sure every detail relevant to your intended prompt or plan for an agent to carry out is in the context you hit Enter on.

What it cost

At the end of this crunch which consisted of nearly a week and ~13 billion tokens, we successfully created a rewrite of git in zig. If you were to do that as a human, say writing a new git of your own, and you were to work towards 100% test coverage, you’d be in for a world of pain.

The git CLI test suite consists of 21,329 individual assertions for various git subcommands (that way we can be certain ziggit does suffice as a drop-in replacement for git). If it took a person four minutes to write enough functionality to pass each test (overlooking some tests being more complex than others), that’d amount to 85,316 minutes total, or about two months! And that’s without sleeping or eating included in the number.

While we only got through part of the overall test suite, that’s still the equivalent of a month’s worth of straight developer work (again, without sleep or eating factored in).

The final results

bun improvements

Operation	macOS arm64 (M4)	x86_64 Linux VM
`findCommit`	85.4x win	6.3x win
`cloneBare`	7.3x win	34.3x win
`cloneBare` + `findCommit` + `checkout`	~10x win	~30x win

The bun team has already tested using git’s C library and found it to be consistently slower hence resorting to literally executing the git CLI when performing bun install. With ziggit, it becomes possible to see upward of 100x speedups for some git operations.

Tested on an M4 Macbook with 24gb of RAM across multiple runs, it scored an average of 85.4x speedup for findCommit, 7.3x speedup for cloneBare, and a ~10x speedup for the entire workflow comprising of git operations. In a x86_64 Linux VM with 8gb of RAM, it scored an average of 6.3x speedup for findCommit, 34.3x speedup for cloneBare, and a ~30x speedup for the full workflow.

When evaluating the complete bun install improvements, it came out speed-wise to about the same as the existing git usage (due to networking being the big bottleneck time-wise despite more cases being slightly faster with ziggit over multiple benchmarks). Except, it’s done in 100% zig and those internal improvements pile up as projects consist of more git dependencies. All in all, it seems like a sensible upstream contribution.

git drop-in

Benchmark	ziggit vs git
arm64 Mac (small repos)	>4x win
arm64 Mac (large repos)	>4x win
Best commands	up to 10x win

In addition to covering enough functionality to replace bun’s usage of the git CLI, ziggit covers enough subcommands and arguments to be a viable drop-in replacement for git with numerous performance improvements. While there are codepaths where the two are at 1x performance comparisons, it’s remarkable that a modern rewrite in a modern programming language was able to reach that level and even get up to 10x speedup for some commands!

While git itself has had much more development and optimizations for x86_64 Linux, ziggit’s performance really outshines git when measuring on an arm64 Macbook. On our macbook, it’s across the board more than 4x faster than git in both smaller repositories as well as larger ones.

Of course, ziggit comes with git-lfs support as well and a useful succinct mode meant for agents working in new or existing git projects to save significantly in tokens!

WebAssembly

Metric	ziggit	wasm-git	Result
Binary size	148kb (55kb compressed)	806kb	5.4x win
Named exports	68	8	8.5x win

Currently, there’s a wasm-git project which compiles git’s C library directly to WASM and comes out to 806kb large. ziggit, when compiled to WASM, produces a binary that’s only 148kb big. That’s 5.4x smaller already on its own and then it can get down to just 55kb when compressed, making it more portable and accessible.

Additionally, ziggit’s WebAssembly binary provides 68 named distinct exports (ziggit_init, ziggit_clone_bare, ziggit_diff, ziggit_log, etc) in contrast to wasm-git’s 8 obfuscated exports (X, Y, Z, _, $, aa, ba, ca) which are Emscripten-compiled C bindings. Nonetheless, talk’s cheap so you can go ahead and clone an open source repository in our web demo.

Succinct mode

Inspired by rtk, a CLI proxy which reduces LLM token consumption by 60-90%, ziggit also includes a “succinct mode” that’s enabled by default and dramatically slims down outputs. For example, the below:

$ git commit -m "chore: add another file"
[master b6eeb42] chore: add staged file
1 file changed, 1 insertion(+)

Becomes the below:

$ ziggit commit -m "chore: add another file"
ok master 640fe38 "chore: add another file"

Or compare the below difference between git status and ziggit status:

--- normal ---                              --- succinct ---
On branch master                            * master
                                             + Staged: 1 files
Changes to be committed:                      staged.txt
  (use "git restore --staged ..." ...)       ~ Modified: 1 files
        new file:   staged.txt                 README.md
 Changes not staged for commit:
  (use "git add ..." ...)
  (use "git restore ..." ...)
        modified:   README.md

Succinct mode is turned on by default and can be toggled off by passing --no-succinct like so.

$ ziggit --no-succinct status

Or by setting the GIT_SUCCINCT environment variable.

GIT_SUCCINCT=0 ziggit status

Theory

Now, why does any of this work? Here’s our guess having done a similar thing before when making a modern toolkit to bridge Elixir and WebAssembly.

Agent spawned agents is like being a manager of managers

Having direct reports who work with you is vastly different from working with reports who themselves have reports.

Normally, when you’re working in an organization of people, you need to be mindful of the balance and delegation of tasks; this has to do with everyone’s experiences as well as APMs. When you work with coding agents, you could sit and create a coding agent for every individual task or you could have an agent (which itself has a high APM) be the one doing the orchestration:

But, really, this wasn’t a “hands off the wheel” project where we hit Enter once and left the laptop; although we got sleep in the process. Instead, this was more like doing exactly what we would have done if we had a row of laptops on a table and we’re typing on each one except there’s an agent to do the menial part of setting up subsequent coding agents:

For the early part of the work, we prompted the top-level agent to create certain agents for the initial scaffold (in this case: core git functionality as well as identifying where to place the zig code in Bun’s codebase). Once there was enough groundwork laid out, we directed the top-level agent to spawn different agents we knew could work in parallel (ie one was focusing on WebAssembly capability, one was focusing on the exact git functionalities to rewrite to 100% Zig for Bun).

For scenarios where we figured one agent was not going to fulfill some capability in a reasonable amount of time (mind you, this stuff is eating up billions of tokens so not like it’s absurdly unreasonable in the first place), we’d have multiple agents working in the same part of the codebase except the logic wrapping the agent itself (both in the prompt and in literal shell scripts), we use git to rebase or stash or push changes along the way. This both ensures agents don’t tunnel vision themselves into stuff that’s never pushed and agents can be failure tolerant when one gets a task that was already handled by another agent.

Why we think this works

We’ve successfully applied this approach before when bridging Elixir and WebAssembly and have a guess as to why this works. To explain, let’s talk about making a peanut butter and jelly sandwich.

For context, one of our favorite examples for introducing computer science is the exercise of writing instructions for how to prepare a peanut butter and jelly sandwich. It’s a staple I remember from Harvard’s CS50 and have done enjoyably a number of times when I was teaching others how to code pre-LLMs.

The way it goes is you have all the ingredients and tools you’d use to prepare a PB&J (bread, peanut butter, jelly, plates, so on) as well as something to write on and something to write with (such as blackboard, whiteboard, paper, text editor). You begin by instructing the group to provide (so it can be written down) the instructions for preparing a PB&J while, along the way, you follow instructions extremely literally such that a sandwich never gets made (unless you’re nice about it). The goal isn’t to demoralize your students into thinking they can’t define steps but more to emphasize how “dumb” computers can be and how explicit code needs to be for a program to do what you expect.

If you prompt an LLM to make a PB&J, assuming it has access to whatever’s needed in the real world with robot arms plus all the cool hijinks, you’ll likely end up with something much like how you can prompt a coding agent to make some program and it will likely end up with something. If you want to ensure that every sandwich made uses apricot jam, that’s something to specify in the instructions. If you want to ensure some web app generation always uses a certain component library, that’s something to specify in the instructions as well. LLMs are great because they can do things but whichever details you care about must be specified similar to how a human doing the PB&J exercise would need the orientation of the knife and so on to be specified.

The peanut butter and jelly sandwich example works for standard coding because computers need programs to be precise. The example also works for LLMs coding because agents need prompts to be precise. To tie together how one could see that coding agents have the potential to solve a hefty number of engineering problems, let’s consider two things that we know today LLMs are able to do:

1) Build out an initial MVP or prototype

While this was an early critique for coding applications of LLMs (since they can’t do “real engineering work”), it’s worth admitting this does knock off legitimate work that’d otherwise take a person time to do. 2) Targeted optimizations that are verified by the LLM
Google showed this already with AlphaEvolve and, in a more broad way, the Deepseek moment shows this point further. Rather than throwing hands up in defeat and running LLMs over and over like monkeys on typewriters, giving LLMs access to the metrics a human would be trying to steer towards in the first place lets them self-guide till they get the job done.

By being able to both legitimately start a project as well as improve it in the directions desired, putting aside the verbosity needed in the prompt or time needed to process, LLMs and coding agents have the capability of tackling a “real” number of engineering problems. It’s not about replacing humans or finding things humans can’t do at all, it’s about overall coordination in the vein of enriched productivity.

At this point, we have all the fundamental pieces for why this approach is productive: meaningfully organizing and directing coding agents with a “top-level” agent doing the administrative work for you. Being able to work with the top-level agent and improve sub-agent prompts or loops also let a deployed agent not be the end all be all but instead iterative.

What was funny about steering this system of agents is it was reminiscent of seeing demands of engineering teams evolve over time like the startups we’ve been at; when the group needs to focus on a refactor or tasks can be divided in parallel, agents can be redirected towards something or spawned/killed according to the codebase’s demands. The point here being there wasn’t a single organizational structure or scaffold which was the “best”, our orchestration was more dynamic as I went along with the project.

An important note about organizations of these agents we’ll add is Kernighan’s Law.

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

If you point the top-level agent at the task of figuring out the most clever tricks possible, you’ll end up with a mess of agents and a lot of token burn for no good reason.

We don’t yet have a prescriptive solution for this but the rule of thumb we’d state is, at any given point in time, you should be able to see a list of running agents and understand the progress they’re making. If you find yourself in a spot where you wouldn’t know where to begin steering, you’ve likely leaned too much on the agents to do something you were responsible for.

Hack the planet.

Writing the best dev blog with headless browser automation (scraping via emacs)

2026-03-21T08:00:00+00:00

Who are you
Why read this
What is emacs
Pen Pineapple AppleScript
Browsers… in the cloud!
Key takeaways

Who are you

I’ve been writing on this personal blog for five years and, while I’ve never had a single post “make it” or go viral, I wanted to know how I could improve my writing. Rounds of sharing drafts among friends could certainly help with prose but surely there’s something else I could be doing better.

I’m assuming you’re either wondering how automating web browsers was helpful at all for writing a blog post or you’re wondering how emacs came up in web scraping. If you’d instead like to read about a more useful application of headless browsers running in the cloud, I have another post where I have agents QA an app to improve its UX.

Why read this

Candidly, I am deaf and wear cochlear implants to hear, effectively mimicking one of the five basic senses people could take for granted. I believe technology is the closest thing we have to magic. As the meme goes, if you were to describe cat videos on YouTube or agents posting on Molthub to the pilgrims landing in the Americas, they’d figure you were actually insane.

While originally stated for crypto,

There is $10,000,000 stuck inside of your laptop right now, you just need to figure out how to get it out

There is an inherit truth in how access to the world’s information and cloud compute meaningfully make a lot of tasks people would be interested in possible. So, whether that’s attaining enough money to retire your parents or accumulating datapoints on developer oriented blogs, I think code can help accomplish awesome things.

What is emacs

emacs, aka the holy editor, is just a highly configurable text editor. Rather than come out of the box with a lot of tooling for a certain language like an IDE, it starts out rather “vanilla” so whichever specific tools a developer wants can be included incrementally.

If you’re a web developer, you can install packages that give syntax highlighting for JSX or convenient lint and style hooks. If you’re a Clojure or Python developer, then there are packages that give elegant REPL environments from inside the editor you’re writing the very code you’re testing.

Unlike more extensible editors like VS Code, you won’t find many YC companies starting as forks of emacs. That lends to VS Code having more of an ecosystem around published editor extensions whereas you’ll find more people publishing their entire emacs configurations.

For years, it was a common joke that emacs was unusable since its lisp looks drastically different from the languages that actually pay to know them in industry. Now, thanks to coding agents, modifying your emacs to work the way you want to is a prompt away.

Pen Pineapple AppleScript

In order to identify how to write what would be the best blog post, I decided to break this down into three components:

Getting the best blog post URLs from r/devblogs
Spawning a bunch of headless browsers to get their content with readability.js
Letting Claude summarize the articles that were successfully scraped and write what makes a good blog post.

For the first step, Reddit is notoriously difficult to scrape so I opted to control my local Chrome instance where I’m already logged in to fan through the top posts.

To get the content, I went with the backwards-compatible old.reddit.com domain as it renders in static HTML pages instead of a JavaScript SPA.

For programmatically controlling my Chrome browser where I’m signed in (instead of a temporary “testing” profile that tends to trip up bot detection), I use AppleScript, a scripting language which, quoting from their website:

It allows users to directly control scriptable Macintosh applications… You can create scripts—sets of written instructions—to automate repetitive tasks

Fed through the osascript CLI, I can simply “tell” Chrome to navigate to a given link:

tell application "Google Chrome"
  activate
  set URL of active tab of front window to "https://yev.bar"
end tell

For something you can paste into your terminal to watch it in action (requires you run it on a Mac):

$ osascript -e 'tell application "Google Chrome"
  activate
  set URL of active tab of front window to "https://yev.bar"
end tell'

After putting the AppleScript commands behind an interactive elisp method, I can invoke them through M-x (the emacs version of a quick switcher menu). Shown below is a screen recording of me running M-x scrape-devblogs to control my Chrome instance where I’m already signed in to navigate to the page for viewing top posts in the subreddit.

The method above will paginate through all of the top posts from that subreddit and then write a list of the scraped URLs to a text file which can be used in the next step.

Browsers… in the cloud!

Next, I’ll call a second method where emacs spawns multiple headless browsers to fetch the content for each of those blogs. The advantage to doing it this way is that I don’t have to sit and scan with my local browser sequentially through hundreds of individual URLs if I don’t care about them all returning content.

As added rationale, after clicking on a few of the submissions in the subreddit, there are some blog posts which are gone and may only be findable in the Wayback machine. Plus, it saves my RAM so I don’t see my computer freeze up from lots of browsers running in the background.

For hosting the headless browsers, I created in them in Vers VMs and I also put the general flow for headless browsers on the platform in this repository.

Key takeaways

After running the third method to analyze the scraped blogs, these were the takeaways I got suggested from Claude:

Lead with vulnerability and honesty - Posts like #465 (parenting struggles) and #466 (game that made no money) perform well because they share authentic developer experiences, not just successes.

Solve real problems with depth - The best technical posts (#402, #405, #461) don’t just explain what they built, but why it was challenging and how they solved complex problems other developers face.

Combine technical content with narrative structure - Posts like #415 and #457 succeed by framing technical challenges as problem-solving stories rather than dry tutorials.

Provide behind-the-scenes insights - Content that pulls back the curtain on development processes (like #405’s EVE Online infrastructure or #457’s procedural generation philosophy) consistently engages readers.

Avoid complaint posts and minimal content - The worst-performing posts (#25, #46, #99) either complain without providing value or have essentially no content. Focus on what you learned or built, not what went wrong with external services.

If I were to do this over again, I’d choose a more noble choice than “the best blog post” as my research target but hopefully this gives you an idea of one way to leverage public content on the web!

If you’d like to check out or use the emacs package yourself, here’s the GitHub: https://github.com/hdresearch/devblogs

First startup recap

2026-03-21T08:00:00+00:00

Intro
The idea
What I did
The realization
The attempted pivot
What went well
Learnings

Intro

Following my goal to do 24 startups in 12 months, I spent the past two weeks working on something that ultimately ended up nowhere. Here’s what happened.

The idea

At first, the thinking was to provide a web app where a user could generate either a song or a stream playing music which would be clear of any copyright or licensing concerns as it would be AI generated. Under the hood, this was to be done with LLMs producing strudel.cc programs so live coding but written by agents rather than humans.

Whether it’s song parodies or other videos I work on, the audio or instrumentals is an important part of that. I figured this could be of interest to streamers, content creators, or developers working with multimedia in some form.

What I did

I vibe coded a simple Hono app that would use a secure vm to process the generated strudel (since they’re technically untrusted programs). After a few manual iterations I got the stream and song generations to sound almost like real music; as though it was in this uncanny valley of sounding more like 8-bit than something produced by a human.

The realization

I spent bit more than a week drilling into a consumer product that was not yet reaching the threshold where I’d prefer it over something like Suno. Pausing to imagine a world where I do finish something viable, it occurred to me that I’d be frantically swinging it left and right rather than know who would be wanting the sort of product I was working towards.

Having alrady bought the domain timetomake.music, it was a bit disheartening that it was doomed to go no further than perhaps a presentational demo. However, I got audio generation to sound alright for various downtempo or EDM genres which gave me the idea of assembling long-form music compilations.

The attempted pivot

Having attempted to do more with live coding tools pre-LLMs, I was a bit tentative of the viability of programming and audio engineering (at least with my limited knowledge of music theory or music production). After searching through YouTube for 16 particular sub-genres and descriptions for each of them, I got to work on LLMs generating song programs…

Claude eventually had a setup using SuperCollider but it took more than ten times as long to produce audio as the audio itself was defined to be. Once it started suggesting restarting the computer and crossing my fingers, I knew I reached a point where I can either try another desperate pivot or bookmark this as not progressing further.

What went well

I was able to scope out the idea for the product and set up an infinite loop with Claude to run overnight despite tier limits. Plus, I managed to get Gemini to do live coding and make audio that sounded sorta like music.

Learnings

I need to be able to describe either the exact people I’m planning to build a service for or have a solid idea of the profile of customer I’m looking for. Even if I could do the napkin math in advance for figuring out pricing of songs or chats, economics don’t make a business, solving problems do.

The next two weeks will be a jab towards something with a clearly identifiable and winnable market.

Molt is the Netscape Moment

2026-03-18T08:00:00+00:00

Molt, aka OpenClaw or aka Clawdbot, wasn’t dramatically technically novel. It didn’t advance any field in a way that’d be worthy of a research paper. When people got excited or gathered in thousands to attend meetups, they weren’t there because Steinberger found something new, it was more like he cracked something new.

If you look at self-driving cars, over a decade ago we got told within five years folks wouldn’t need a license. Sure, it’s trying to be useful to more places than just Silicon Valley but we don’t have commercially available Level 5 autonomous driving. Molt was something different.

For several years, there’s been a group or “community” in the Bay Area trying to ring the bell about a world in which artificial intelligence can not only sound like a human but also be productive like one. With the current AI race, it was unclear whether we’d eventually hit another AI winter or we’ve finally brought enough pieces together to spark the singularity.

We’ve already tackled verbal communication years ago and continue to have “voice agents”. Folks are attempting to hire AI agents and already multiplying their code output with them as well. While it doesn’t look 100% like something out of sci-fi yet, the world in which humans and AI exist side by side is already here.

The “Netscape moment” in the dot-com rush was Netscape IPO’ing and indicating the dawn of a new era of products plus services. Web browsers and internet communication had been going on prior to Netscape going public but it was undoubtedly clear the world was not going to proceed without the web from that point on. Here, I proclaim Molt is the Netscape moment of today. I won’t do so by referencing stars on GitHub or OpenAI’s acquisition. Instead, I’d like to point at cultural changes that followed the dot-com rush as well as the current AI buzz we’re experiencing.

In the very early days of e-commerce, there was a specific unease for buying clothes since you couldn’t try on a shirt before clicking on a checkout button. However, with newly developed practices like online return policies, we then shifted into a world where buying items online was as “legitimate” as buying from a store in-person.

In the early days of 21st century AI (for instance Siri or Amazon Alexa), there was an unease around bridging AI with things that were tangible since an AI couldn’t validate a thing in the real world before taking an action on behalf of a person. However, with at least 10% of the planet using ChatGPT or the exorbitant investments into business applications, we’ve shifted into a world where AI being useful is “legitimate”. Personally, my favorite indicator that AI’s here to stay is more and more people failing to differentiate between humans and AIs; like when someone sets up a Molt with their text messages and folks not realizing they’re not talking to a person.

While there are material differences between humans and AIs, it’s funny to watch people go through a process of accepting some foreign group after initially being perplexed by their existence; as seen historically between different groups of humans. There was a time where social stratification divided people into partitions within society but now, at least in the modern democratic Western lens, “there is no race but the human race”. Couple years ago, we’d scoff at the suggestion of LLMs being applicable across industries and now we can’t stop building data centers to “catch up with possible use cases”.

As time goes on, whether it looks more like reservations or eliminating disenfranchisement, we’ll define strong legal definitions for where AI sits in the world. Roko’s basilisk is often postulated to be the omnipotent AI in the future which punishes individuals for witholding society’s progress by not bringing it about sooner. But, what if instead Roko’s Basilisk is a mob of righteous AIs in the future with religious opinions like seen in today’s political discourse.

24 startups in 12 months

2026-03-11T08:00:00+00:00

How many in why?
The plan
- What is a startup
- Working smart not hard
24 startups in 12 months

How many in why?

Peter Levels has a neat blog post that pretty well details how he went from freelancing around to his first success with Nomad List. The rest of the 12 startups in 12 months, of course, is history.

Since you can go ahead and read the original posts, I’ll instead continue this one inspired by how the original post went. Which is by outlining what he describes as the “problems” that he found to have prevented his prior projects from succeeding. The twist here is I’d go and describe what I think are the ‘problems’ that have prevented me from seeing through some things of my own.

Problem one: Nailing down

If you’ve ever worked on a hackathon project right up to the submission deadline, then you know the experience. That moment when a judge asks about the project and you don’t give a cohesive pitch so much as you basically narrate what each technical component you were working on till the last minute do. Could it have been smaller and simpler in scope? Absolutely. Could it have been grander and more impressive in scope? Absolutely too.

The beauty and peril of programs is it can usually technically be simpler or technically be more complex. There goes a joke about how a recovering addict can turn down a dose of heroin after ten years clean but won’t stop to jump at implementing another static abstract singleton factory bean class in Java. An embarrassing number of projects I’ve worked on have died due to the similar disease of scope creep.

What begins as an innocuous “what if this python function did this one cool thing” becomes “what do you mean the impossible-to-read class doesn’t also do this other feature for the sake of doing that feature?” In the context of open source code or business oriented pursuits, nailing down the scope or market being actionably targeted is something I could make use of.

Problem two: Delivering the right thing

In no disrespect to Peter at all, I don’t empathize today on the point of fear of launching. If I set out in mind that I want some specific thing and I want to post it wherever I can click a share button, then I get over the jitters by doing it in a more concentrated go so I don’t let it eat away at my ego the longer it doesn’t “pick up”. Where I could certainly be improving is the design and intended user journey for my ships.

Whether it’s being more clear about a repo being something to install with a package manager versus run locally; or it’s empathetically identifying the flow in which a person would like to use some UI to solve a problem. If my goal were to be accumulating the most impressive GitHub in the world then that’s one thing. It’s another thing if my goal is to have the button people click be a payment checkout rather than starring another repo.

While I know folks who are supportive of ‘indie’ projects or who just happen to be friends, I need what I’m making to be accessible to complete strangers. An added value to dumbing down the intended interactions is they don’t need to be polite enough to spend five minutes understanding something if they could figure out in 30 seconds whether or not it solves a problem they have.

It’s knowing the audience

In both myself and others, I’ve seen the problems I described above. In the simplest of terms, I’d say it’s “knowing the audience”. If you’re writing a document for work it’s important to know whether it’s an internally facing doc that uses certain lingo or it’s a public facing article that gives a more informative picture. Likewise, if you’re making an iPhone app for young adults in the United States, it’s important to use English rather than Klingon.

From performative arts like dance or comedy to software, knowing the audience is perhaps one of the most important things to get right.

The plan

Levels was writing his post in prehistory-, sorry, pre-vibe coding times. While I can run into tier limits with Claude Code, I can still crunch out hefty applications with the right steering. Instead of doing one startup a month, I figured it could be applicable to do a startup every two weeks.

What is a startup

I’d directly quote the definitions mentioned in Level’s post however I’d like to just recite Eric Ries’:

A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty.

Different epochs or hype cycles in tech have had different categories be either popular or irrelevant; it doesn’t matter if something is cool or not if there are people who’d pay for it to exist. So if it’s SaaS or content or a physical product, it’s fair game.

Working smart not hard

Reading the debriefs from Levels’ first launches was interesting both in terms of my stated “problems” but also in terms of reaching out to news. Generalizing to post across different platforms or channels was already something I was familiar with but extending this to stuff like TechCrunch or other contemporary publications seemed immediately interesting.

Making funny things can be admittedly fun but it undoubtedly needs to be done with a careful idea on budgeting. While sending something to a cool publisher could be fun for the sake of it, it would make a world of a difference if it’s a message with a funny “advertising” video accompanied by a business rather than solely a business page or just a funny website that’s asking for an explosion in hosting costs.

Each startup will generally go through the following steps:

Identifying - what is it and who is it for?
Development - not only talking the talk but walking the walk
Distribute - share within intentional groups and audiences
Maintenance - fix bugs and apply feedback

Now, the state of the project at step three should not feel like a thing to get over with so I can get to step four and fix something or add some new irrelevant feature. Step one should include the research for what’d be done as part of step three and help eliminate any early dumb ideas that don’t have a real way to “get out there”.

In order to do maintenance well and not lose track of projects, like if custom domains and different setups are introduced, I plan to keep track of all I’m working on in a personal stack for knowledge/task tracking and vibe coded apps for repeatable processes.

Since two weeks can be a bit of an aggressive timeline for identifying and distributing, I think it could be fun to vibe code marketing apps that help continually do recon or shilling; the big important rule I’d follow is that I am ultimately pressing the Post button even if I have some assistance with finding what to reply to and what to reply with.

The thinking here being two weeks may be too short a timeline to say a startup had no time to see anything but a month is certainly too much time to be picking the sidewalk looking for gold, hence keeping the 24 in 12 approach.

Lastly, Levels is notoriously a religious supporter of PHP and it’s worked really well for him. I’ve had the chance to work with a variety of languages both for personal projects as well as professional settings; I don’t have presently have a strict opinion about what I’ll be developing these apps in. But, over a couple of these startups, I may end up finding myself with a specific set of tools I swear by.

24 startups in 12 months

I think it would be silly to come back here 24 times for each update so search my blog for debriefs! It would be funny for some of these startups to work out and people end up linking this post as something to follow.

I will properly start the clock on March 15th 2026 (Sunday being the start of the week) and look forward to seeing the blog list at the end!

The argument for alternative interfaces

2026-03-09T08:00:00+00:00

There’s a funny video?
A video game?
Why
GitHub

There’s a funny video?

That’s right, you can watch it on:

A video game?

You can think of the funny video as a trailer for a game which was the actual project I worked on for the Hermes Agent hackathon. You can watch a playthrough of the game on Twitter

Here I modded an open source game that’s inspired by Factorio to update the “point of contact” for the Hermes Agent to be a factory building game. Levels and tutorials guide a person playing through the features I assembled together:

Level 1: Prompt a Playwright web browser agent in a Vers VM
Level 2: Prompt an iMessage communication agent locally
Level 3: Prompt a GitHub administrative agent inside an Apple container rather instead of a Docker container
Level 4: Prompt a coding agent in a cloud sandbox

Why

Like how every AI company has a similar looking logo, every AI company with a website has a centered input bar that resembles Google:

Compared to:

And every AI company with a big enough engineering team will have their own browser and so forth. The list of similarities is endless so what are we missing?

A great deal of the work done in harnesses or orchestration revolves around clarification. What do I mean by that? Let’s refer to a classic XKCD comic:

A point of the comic can be said in the bit about how “some things are difficult for humans but easy for computers and some things are difficult for computers but easy for humans”. We already have companies with market caps of trillions of dollars being run by people and they seem to continue along just fine. Translating how organizations of people work into systems of agents is a game of clarifying the necessary loops people or agents should be working in.

We already know that AI can be organized to work on flashy or sizeable problems yet we’re always eager to chip away at arranging the next best system. With folks rediscovering political science but for agents, it’s worth recognizing the never ending rabbit hole here; there’s always going to be a number to increase or decrease in a score or benchmark.

Where things could be interesting is if we consider how science fiction always has intuitive interfaces; holograms that slide at the gesture of a hand, voice input immediately available, and information that presents itself as itself and not an output of a medium. We get what’s in front of us rather than only see it.

What factory building games accomplish well is visually watching the pulse of the game. By being able to watch the pulse of a game, I’d compare it to being able to watch an animation or diagram of cellular activity. And that was precisely what I thought would fit well in conjunction with the newly added features like iMessage or Apple containers.

Nevertheless, so I can tie this into the whole “argument for alternative interfaces” and not just link to Iron Man on YouTube. The miracle of modern video calling tech is people can talk to folks thousands of miles away as though they were right in front of them without having ever gained geographical distance. Sure, you’re talking to a face on a screen and not a person; with VR you’re talking to a face on two screens instead. But, if you’ve ever talked to any person through a smartphone, then you can see how much different it is in terms of presence versus sending a hand written letter.

The promise of the information highway was that all of the world’s information would be at our fingertips and it’s gotten pretty good at it if we’re to be honest. For AI to be a similar advancement in the world, it’s got to come with the new interface. We already have information at our fingertips so where are the interfaces with capabilities at our fingertips?

GitHub

Finally, if you scrolled down here for the GitHubs, here ya go

https://github.com/hdresearch/shapez.io - shapez.io fork with mod
https://github.com/hdresearch/hermes-agent - hermes-agent fork
https://github.com/hdresearch/shapez - Custom bridge/server (submodule of hermes-agent)

How I spent a billion tokens bridging Elixir and WebAssembly

2026-03-02T08:00:00+00:00

What I did
What is WebAssembly?
What is Elixir?
Why bring the two together?
What you can now do

What I did

I blasted a billion or so tokens at some concentrated problems to accomplish scoped goals. If you’re curioous about the “how” for corralling coding agents like so, I go into detail in this post. For details on what WebAssembly or Elixir are as well as the motivation behind bridging the two, keep on reading!

What is WebAssembly?

WebAssembly, or WASM for short, was once the rage and even got proclaimed by the inventor of Docker as what could have been the “missing piece” for isolating computational work. Virtual machines didn’t do it, containers didn’t do it, unikernels didn’t do it, perhaps WASM was the solution we needed.

Starting as asm.js, a subset of JavaScript with the intent being performance, WASM is a collection of technologies that allow programs in various languages to not only be run in others’ environments but also in a way that’s secure. Included under its umbrella are WAT, WebAssembly Text format, as well as near universal browser support.

This does mean there are two problems, one is bridging WebAssembly technologies into an Elixir project (e.g. writing a computationally expensive function in Rust and then importing over) as well as bridging Elixir into the world of WebAssembly (e.g. writing a module in Elixir to then be used in a separate program). At the time of writing this, there are no WebAssembly packages with Elixir or maintained Elixir tooling.

What is Elixir?

Elixir is a, taking from their website, “dynamic, functional language for building scalable and maintainable applications”. It’s a part of the erlang ecosystem since they both run atop of the BEAM (Erlang Virtual Machine); you could view it as being similar to how both Java and Clojure run on top of the Java Virtual Machine.

Widely known for the Phoenix framework (the Elixir version of Rails), Elixir is a nifty functional programming language if chaining together pipes in your code sounds appealing to you. Otherwise it can be appreciated for the reputable reliability of erlang with its “nine nine’s of uptime”.

In addition, both Elixir and Phoenix have consistently been reaching the top of the leaderboard for Stack Overflow’s Developer Survey (RIP Stack Overflow) and is the foundation for a rewrite of Sonic-Pi! It’s mostly fallen off WebAssembly interest due to avoiding core features like concurrency or BEAM’s complexity.

Why bring the two together?

Why not?

For a fuller answer, aside from performance gains, there are too many recent demos in the tech industry where hello worlds don’t compile or browsers don’t build.

We could harness technology for the vanity of buzzwords or we could harness technology towards implementing gaps that, otherwise, would require several hours of human engineering time. Tough choice.

What you can now do

You can use WebAssembly from Elixir! You can also transform Elixir or Phoenix projects to WebAssembly!

Don’t believe me? Install the package from hex or point your coding agent at this repo and have fun https://github.com/hdresearch/firebird/

Yev Barkalov

Bringing Hermes to WebAssembly

What’s in this post?

Should I replace my Hermes with one of these?

Where did Hermes come from?

Why WebAssembly?

How to WASM Hermes

Pyodide

Run Hermes in pyodide yourself

How it works with Pyodide

Hermes in Pyodide source

pywasm

Run Hermes in pywasm yourself

How it works with py2wasm

Hermes in pywasm source

Benchmarks

Takeaways

Taking MemPalace to 100%

Overview

What this is not

What we did do

Context

Improving MemPalace

What worked

In more detail

1. spaCy-based extraction

2. Pure-Python logic engine

3. NER-enriched synthetic documents

4. Noun-phrase embedding bridge

5. Time related bridge

What didn’t work

Running yourself

Conclusion

A coding agent with direction

Contents

What is this?

What is this not?

The background

Coding agent

Ralph loops

RLMs

zagent terms

Code cannon

Code pirate

Code captain

Takeaways

Ziggit

Digest

How to code cannon yourself

Install vers CLI

Configure environment variables

Write your initial plan

Let it start running

Check where it’s at

Repeat running and checking in

How we rewrote git in zig

Environment

The initial plan

The produced agent loop

Meta note

What it cost

The final results

bun improvements

git drop-in

WebAssembly

Succinct mode

Theory

Agent spawned agents is like being a manager of managers

Why we think this works

Writing the best dev blog with headless browser automation (scraping via emacs)

Contents

Who are you

Why read this

What is emacs

Pen Pineapple AppleScript

Browsers… in the cloud!

Key takeaways

First startup recap

Contents

Intro