If you haven’t read STFU the Basics, the short version: I built an AI that writes daily satirical diaries for billionaires-as-aliens. This post covers how I got it to stop being terrible.
From humble beginnings to humbler current state
I started it out as a pure MVP, trying to answer a single question: “can I do it?”.
MVP - Monolithic monstrosity
Single file, complexity through the roof, the outcome? Most of the time it was a simple format with a lot of repetitions, just like the actual person it is based on, the LLM likes to repeat itself:
Dec 2024 - Simple mission log format: Mission Log: Sol 18,247 Location: 37.7749° N, 122.4194° W Encryption Key: πβ¦
The Ultracapacitor Swarm nears completion…
I pre-generated content from December 1st, 2024 - creating a basis for the characters, with simple prompts, not enough direction, fighting the LLM on the way as it didn’t want to generate satirical content. “It doesn’t matter that these billionaires have so much power, you might hurt their wittle feelings” or something like that.
The output was funny to me, my wife: “too nerdy”, friends: “eh? too complex”, and let’s face it, it is still extremely nerdy nowadays, but the repetition was really wearing me down. From the start I decided I wanted to review every entry, ensure that the quality of the output was at least okay (I don’t want this to be offensive after all, now do I?). Not a lot of people have read it though (according to analytics, I clock in at 100 visits combined, at best).
I have read it every day for the last 10 months. So I can see the changes…
March - The next stage → First refinement cycles
I had the first online generated cycles, every day, three entries (one for each character), this was the initial refinement phase. I started breaking down the actual code into smaller parts, a single file became 6 modules, but the content was still mostly the same, the headers got more structured for a while, but this was more a result of initial prompting approaches than an intentional change.
COVERT TRANSMISSION LOG 72B: EARTH OBSERVATION REPORT
Recorded while soaking in rare-bismuth plasma bath…
I started thinking about improving code coverage here, adding more tests. There were a couple of issues as I was deploying changes sometimes, oh-shit moments where I broke the whole thing. I never lost data, but I did almost miss a couple of days, kludged together a couple of fixes…
April and beyond - mid-term memory
Long term memory was and is not their forte. I am talking about Martians, Reptilians and Energy Vampires here - these guys have literally three paragraphs of memory (that is the usual length of the summaries they keep anyway). Short term memory is covered by including the latest three entries, the long term memory is the summary that is regenerated every day in post-processing.
But what is the mid-term side? I thought of programs, these three definitely radiate some kind of Bond Villain vibe, so why not give them something to work towards?
Now I added program detection and generation, limiting the number of programs, let them decide which program they talk about. It adds some consistency, a little more mid-term perspective.
July - Single shot out, welcome multi-agent review
This change was critical for content improvement. A single shot generation can only get you so far, I experimented with alternative prompts, parallel generation, multiple candidates - these all lacked in refinement. What does an actual editorial process look like? From a very simplistic point of view:
- An author writes something
- An editor / reviewer suggests changes
- The author adjusts the content based on their opinion
- Another cycle ensues?
Or something like that. I don’t know, I am a software engineer who studied electrical engineering, this is what they show in movies though…
Personal Reflection: Courtroom Vulnerability Assessment
Miami jury selection proving… challenging. Humans staring…
This is definitely getting less flat, but we are having some issues with continuity, oh, damn, the summary process broke again, hasn’t it (for the 4th time I guess)… These refusals are getting pesky now.
Side quest - refusal to generate
How to deal with an LLM refusing to do the work you ask it to? I am using Anthropic models here, a bit more ethically driven than others, lots of protections in it.
First attempt
Does the response contain expected markers? Is it an actual summary or a refusal?
| |
Extremely simple idea, as you can expect, an absolute catastrophe - real content got classified as a refusal, the other way round, just forget it.
Tried the opposite as well: Does it contain text about “ethical reservations” and other such content? Worked a couple of times, but not very reliable.
Second attempt - ask an LLM
Have an LLM judge if the response is a refusal. Result: worked for about 10 days, then they got wise to it, the sneaky hobbitses. After the 10 days it would just always return that the response was a refusal.
Third attempt - Don’t ask directly
But since we have the providers preventing the ability to detect whether we received a refusal by providing the refusal to be judged, how about we change the approach? What can you not disable in a model without breaking functionality for a lot of use-cases? Well, you cannot prevent it from classifying things into expected piles and into a pile that doesn’t fit. Whatever doesn’t fit the expected output is deemed a refusal. This still works, just don’t ask the question directly.
August - Quality, composability, back to good old engineering
By the time the multi-agent system was working, I got to the point where I was breaking the system most times I made a change. I had made it too brittle, too coupled, not tested enough. I had lost my way, moved away from the rule: be a good engineer. I needed to change, I needed discipline.
I am an old Java hand, Spring Boot, all that composable stuff, DI and so on. So I started looking at how to do this in Typescript. TSyringe seemed to fit the bill, I had no idea how it worked, but hey, no better time than now, is there?
First attempt? Crash and burn. Second one? Little better. Third? There’s the winner. I got my main components broken down, at least a little bit, just enough for Claude to finish the job. Then it was a couple of weeks of rinse-and-repeat. Find another target, refactor, break it down, get tests running on it.
Voila, suddenly we had a change failure rate of 10% instead of 90%, we were back, baby!
Test file counts start climbing. Suddenly there are over 2000 tests, breaking out DB tests, testing the repositories against the DB, cleaning up typescript errors, so many Typescript errors…
September - Decompose all the things
I now have the ability to slowly break down the whole complexity, extracting repositories, hexagonal architecture, decoupled services, a job system, ability to retry jobs. It seemed all to be going so well. Then my colleague (won’t name names, but it was Claude) went and broke half of the repository by “refactoring” code. Refactoring the only way an LLM knows how to, read the code and then write it out, by definition a lossy process. The amount of attempts it took cannot be counted (it was 3, it is always 3 for some reason).
Got through it, reduced tech debt - what does this mean? Time for some more evolution!
Oct-Nov: Grow the concept
They have a number of tools to work with, use the programs, keep a running summary, but where is the humanity, where is the arc? What story are they telling during the week, are they actually working towards those damn programs?
I wanted to have some more interesting developments, a lot of the time the incidental “thoughts” they have end up going nowhere, so I wanted to nudge them. I am a lazy person, though, so I cannot write it all on my own, I can read though, and give feedback - so automated Arc generation was born. An LLM generates the arc, a structured output, I give feedback or adjust it by hand, suddenly the stories start developing further.
They also seem to talk about each other sometimes, especially when there isn’t any news. This was due to a decision I had made earlier on - beforehand, when there was no news, there was no entry, I didn’t like that, so when this was the case; they would talk about each other. I only gave them little tidbits, and that was the thing they returned, just little expressions, not very subtle with little more than mechanics. This became tedious to read and very annoying, so I revamped the concept and gave them relationships. Now they track how they feel about each other, what events made them feel differently about the others. The mentions became more interesting.
Jan 2026 - some payoff
By now, the stories and characters had started to develop. Mark’s mechanical relationships start giving way to actual warmth, even though it is only his overlay doing so. Elon is feeling betrayed by Grok, Jeff is feeling respected by his extraction nodes.
WHEN YOUR OFFSPRING SUES YOU
Ashley filing. Mother of one of my offspring initiating legal proceedings against infrastructure I created…
The progress is undeniable, looking at the entries from the start and the entries now - there is a definite difference.
The missing - hidden - piece
But there is a missing piece to this story, the system was not the only one progressing. The first model I used was claude-3-opus-20240229, everything was Opus in the start. That got expensive fast, so I introduced Haiku and Sonnet, working together - Haiku (claude-3-haiku-20240307) for simple summarization, Sonnet (claude-3-7-sonnet-20250219) for generation. As the models progressed on the Anthropic side, I could see an improvement in the quality of content. July was Haiku 3.5 and Sonnet 4, this coincided with the multi-agent pipelines, the improvement was large. The last upgrade was Sonnet 4.5 and Haiku 4.5, this was in October (15th), again an improvement in quality, this time without changes in the pipeline. Afterwards I added the Arcs and the rest - helped even further.
Can I say what made the most difference? Not really… Any analysis of content would probably mean little. Not enough signal, not enough data - but is it ever fun.
So what?
I used AI to write a lot of the code, I had no idea how to go about it. But I need to learn, it’s a skill - learning that is - so I experimented.
The outcomes aren’t too surprising:
- nothing beats good engineering
- constraints make better engineers (and better LLM coders)
- LLMs don’t remove the need for tech-debt management - they make it more important
This project has forced me to learn how to deal with complexity in new and unexpected ways - it also showed me the progress of coding agents, as they matured, so could I move faster. A project like this helps you learn and apply, with little risk.
What next? I might discuss some more about how I wrangle the AI. Maybe I should talk about my decision to go for hexagonal architecture, or should it be about a Wizzard? Tell me on Bluesky, I swear I will read it.
Bonus diagram
%%{init: {'theme': 'base', 'themeVariables': { 'cScale0': '#e74c3c', 'cScaleLabel0': '#ffffff', 'cScale1': '#e67e22', 'cScaleLabel1': '#ffffff', 'cScale2': '#f1c40f', 'cScaleLabel2': '#333333', 'cScale3': '#2ecc71', 'cScaleLabel3': '#ffffff', 'cScale4': '#1abc9c', 'cScaleLabel4': '#ffffff', 'cScale5': '#3498db', 'cScaleLabel5': '#ffffff', 'cScale6': '#9b59b6', 'cScaleLabel6': '#ffffff', 'cScale7': '#e91e63', 'cScaleLabel7': '#ffffff'}}}%%
timeline
title Generation Pipeline Evolution
section Posts start
Dec 2024 : this is where the diaries start
section Monolith
Feb 2025 : Initial 962-line file
Mar 2025 : Split into 6 modules
section Features
Mar-Apr : OG images, SEO, Programs
May : Social media generation
section Multi-Agent
Jul 2025 : AI self-review cycle
section Architecture
Aug 2025 : DI and Job Flow System
section Decomposition
Sep 2025 : 6 focused services
section Characters
Oct-Nov : Arcs, tips, relationships
section Polish
Dec 2025 : Caching and optimization
Comments
Loading comments...