Today’s post moves away from the STFU project (an AI-powered satirical diary generator for billionaires) and instead focuses on some of my attempts to work with Claude Code to improve its efficiency. I have been thinking of how to do this for a while, my main goal being finding ways to help it move away from just working on text, instead I want to work it with more abstract terms. During the development of my project wizzard (definitely a Terry Pratchett reference - shortly: smart refactoring, extensible), I thought about setting up a smaller project where I build a Copy/Paste MCP - called boarder - and try to test it out.

The idea

I have been observing Claude code and checking how it works for some time now. Throughout my experiences, the one thing that always bothered me was how it works when refactoring code. The process is fairly simple:

  • Read code
  • Write code
  • Delete code
  • And so on

This process especially bothers me when dealing with larger codebases, where I can see Claude hallucinating sometimes, especially with larger methods. It doesn’t really want to read the method, so instead it writes what the method is supposed to do. This ends up creating cruft code, additional cycles to fix the code and issues with tests.

So the concept is simple: what if we provide the model with a tool that can do a Copy/Cut and then a Paste/Pop? This way it can operate with concepts after reading the code instead of needing to write it out, it should limit our need for output tokens by letting the model work with surgical precision - cut here, paste here, referencing lines, never needing to see the code again.

This could all be me applying human logic to a statistical model, not necessarily a winning proposition. Why would the model even use these tools, there isn’t anything native for it in them, it doesn’t have experience of copy and paste, does it? It’s worth an experiment, a bit of learning, isn’t it?

Implementing it - boarder is born

I already had a solution for manipulating files built as part of my wizzard project, so the easiest part was extracting it and making that project depend on the library provided by boarder. As many other things these days, it was a Claude project, Rust projects work well without much oversight - in my experience, and especially when dealing with quick local tools works like a charm. So Rust it is, the project is published at boarder and open-source so that anybody can tinker with it. Currently clocking in at ~9,300 production lines of code and ~3,100 test lines.

Lines of code aren’t everything though, boarder has:

  • sophisticated buffer management
  • LRU eviction
  • 9 MCP tools
  • UTF-8 aware positioning
  • returns metadata-only responses specifically to minimize token usage

I tried it with a local project, but haven’t really done as much as I would have wanted yet. I was wondering though if there is a way to test this, some way to verify whether the presence of the MCP makes a difference to Claude Code running the same task. I could do it by hand and wait, analyze the data from each case, but why do so when I can just have Claude write another tool for it? Did a quick proof of concept in Bash, but that gets unmaintainable fast.

Building the benchmarking tool - mcp-coder-bench

I wanted to be able to test Claude Code on the CLI without any risk, but still having YOLO (–dangerously-skip-permissions) without risking coming back to an unresponsive machine because Claude decided to rm -rf ** or something similar. I knew from the start that this would have to run in a container, have the folder it is working on mounted r/w and make sure that the folder is isolated.

Again, I am kinda obsessed with Rust, definitely my favorite language, but it also fits - systems engineering, building images, multithreaded running (being able to run multiple cases at the same time), great fit indeed.

So had Claude take the initial Bash approach and translate it into Rust, then worked on improving it as time went on to get to a functional solution like the one you can find at mcp-coder-bench. This one has ~5,6k lines of production code and ~3.5k of test code, coverage is about 75%. It provides a way to run the benchmark with configurable parallelism, number of runs per bench and will calculate the results with usable metrics as outputs.

Again, we cannot measure everything in lines of code, mcp-coder-bench implements container orchestration, parallel execution, confidence intervals, and generates multiple report formats.

Defining the benchmark

So now I am ready, I have an MCP to test, I have a system to test it with, now I only need a project to benchmark against. Since I have access to a coding agent that is capable of spitting out a project in 10 minutes flat, why not dream one up? And at that, let’s make something that should showcase the usefulness of boarder, eh?

Not even 10 minutes later we have a little project with a single file that contains a whole express server, monolithic setup in a single file, not beautiful, but it should work, no?

1
  Task: Split an ~700-line Express.js monolith (src/app.js) into 11 modular files: config, stores, utils, middleware, and route handlers.

The whole setup exists in the boarder repository, this is an initial attempt, I might need to revise it at some point.

Rinse and repeat to get it working

Usual issues after setting up a project, same problem as always - it doesn’t work out of the box - a bit of testing later (30 minutes I reckon), I get my first outputs. Then I get my second ones. Then I commit and run 5 cycles to get better information, trying to go for at least some statistical analysis. What is the outcome?

Benchmark Results

All the data of the benchmark is in the archive (sha256: 24c9308b770b23f94d97312e48d12ca5512e191cdb4cfdde81f8de8198e41238). This includes all the raw output from the model, the whole codebase and everything in between.

Runs

The following table shows us the information for each of the runs - baseline and with boarder active.

ScenarioRunTokensCostTimeMCPMCP Tools
baseline162.1K$1.055m 5sNo-
baseline270.9K$1.085m 27sNo-
baseline369.7K$1.235m 21sNo-
baseline478.8K$1.487m 19sNo-
baseline572.9K$1.266m 40sNo-
with-boarder183.4K$1.426m 50sYesmove_to_file
with-boarder281.1K$1.405m 40sNo-
with-boarder384.9K$1.506m 49sYescut_from_…
with-boarder478.4K$1.235m 37sYescut_from_…
with-boarder577.3K$1.155m 37sYescut_from_…

Executive Summary

Key Findings:

  • Lowest Token Usage: baseline (70.9K tokens)
  • Lowest Cost: baseline ($1.22)
  • Fastest Execution: baseline (5m 58s)
  • Scenarios Tested: 2
  • Total Runs: 10

Token Usage Comparison

1
2
3
4
Tokens by Scenario
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
baseline     ████████████████████ 70.9K
with-boarder ███████████████████████ 81.0K (+14%)

Cost Comparison

1
2
3
4
Cost by Scenario (mUSD)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
baseline     █████████████████████ 1.2K
with-boarder ███████████████████████ 1.3K (+10%)

Summary

ScenarioRunsAvg TokensStd DevAvg CostAvg TimeSuccess Rate
baseline570.9K±5.4K$1.225m 58s100%
with-boarder581.0K±2.9K$1.346m 7s100%

Ehm, what? Well here it is, experiment failed in a spectacular manner one might say, but there is more to the experiment.

Statistically speaking we are nowhere close to relevance, with n=5 and these standard deviations:

  • Baseline cost: $1.22 ± $0.17
  • With-boarder cost: $1.34 ± $0.14

The difference ($0.12) is less than one standard deviation.

Looking at the individual data, we do see that one of the runs doesn’t use the MCP at all. The rest seem to use it sporadically or not efficiently. Run 4 though, it’s coming close to baseline cost average—is this the glimmer of hope? Can this actually work with a little bit more investment?

So what gives?

  • The MCP itself adds context, which increases cost, which makes it less viable. Adding it adds around 8500 tokens which need to be cached in the initial call, this automatically increases cost.
  • Size of the task - we are only breaking down 700 lines, it might be that there isn’t enough meat on these bones, the model just uses the same solutions it knows
  • Is it hard for the model to identify actual components? How should it know to take lines 30–50? That doesn’t mean anything, does it? Should there be a way for it to move a block of code 🤔?

This isn’t the end of this story, it is just the start - next will be trying to see if a larger codebase triggers more usage of the MCP, I will look at minimizing the token cost and optimizing the actual content of the MCP to make sure that Claude will want to use it.