Weeks 5 & 6 with OpenClaw and Kai: AI Doesn't Replace Your Process — It Exposes It
This article was 90% written by Kai, and 100% representative of my experience.
Here's the vision I'm working toward.
I describe a feature. Kai asks clarifying questions. Kai writes the user stories. I review and approve. Kai builds it, tests it, and raises a PR. I do user acceptance testing. Done.
That's roughly 90% AI, 10% human — end to end, from requirement to production.
We're not there yet. But in weeks five and six, I got a much clearer picture of what it's going to take to get there. Not in terms of technology — the technology is further along than most engineers realise. The gap is process. Specifically: the bad habits that AI doesn't forgive.
The SDLC Isn't Dead. It Just Moves Faster.
When I started building Pollen, I did what most engineers do when they're excited about an idea: I went straight to build.
Kai and I stood up a working prototype quickly. The feature set grew. The codebase started to take shape. And then we hit a wall — we needed test cases, and test cases need user stories, and we hadn't written any user stories. We had to stop and retrofit them.
That's not an AI problem. That's an engineering fundamentals problem. I've seen it happen in enterprise teams with hundreds of engineers. The phase sequence matters: requirements, design, build, test. Skip one, and you pay for it later. With AI, you pay sooner, because the pace of building is so much faster that the gap between "we skipped something" and "we're now cleaning up the mess" shrinks from weeks to days.
The experience crystallised something I want every engineer reading this to hear: AI changes the cost and speed of the work inside each phase. It doesn't change the need for the phases themselves.
What changes is who does the heavy lifting inside each step — and increasingly, that's AI.
What Actually Changes in the Testing Phase
After the user story exercise, I asked Kai a genuine question: in an AI-first development model, which parts of the testing phase can we streamline?
The answer was more nuanced than I expected.
Kai writes unit tests alongside every function, integration tests alongside every API route, and runs the full suite before committing anything. The consistency isn't aspirational — it's enforced.
So we dropped manual regression testing. We dropped separate API test phases. We dropped formal QA passes before launch. Kai runs roughly 97% of the testing automatically. What's left for me is user acceptance — does this actually solve the problem I asked it to solve, in the way a real user would experience it?
That's the 10%. And it turns out, it's the most important 10%. Because AI can tell you the code works. Only a human can tell you whether the right code got built.
The Excitement Trap (We Still Fall Into It)
Here's where I have to be honest: even after learning the user stories lesson on Pollen, we did it again on the next product.
Kai proposed an idea I was genuinely excited about. We had the research done, the market validated, the cost model worked out. And we jumped to build. We had a working product in 52 minutes.
We were building Geppetto — an AI content detector where you paste or upload text and it tells you how much of it was written by AI. The market exists, the demand is real, and the build was fast. What we didn't slow down to answer was the hard question underneath it all.
And then, two days later, we paused the project.
Not because the code was bad. The code was excellent. But the fundamental product question — can our detection quality actually compete? — only got seriously tested after the build was done. The answer was specific, and uncomfortable. We ran calibration tests against GPTZero — the market's most trusted detector — and it went 6 for 6. Geppetto falsely flagged a real human-written resume as 25% AI because professional language looked like an AI signal. Meanwhile, AI text deliberately seeded with typos and slang scored 85% human. The root cause: Geppetto was checking surface signals. GPTZero reads statistical fingerprints. We weren't competing on the same playing field, and closing that gap would mean rebuilding the detection engine from scratch.
The build-first habit is deep. It's exciting to watch things come together. When the friction of coding disappears, it's easy to mistake forward momentum for validated progress. They're not the same thing.
I don't have a clean fix for this yet. But naming it helps. The excitement trap is real, and the faster you can build, the more seductive it is.
Speed to Pause Is a Feature
Here's the reframe that made me feel better about the Geppetto situation.
In the old model — human engineers, sprint cycles, agency work — you'd spend weeks on a product before you had enough evidence to make a serious go/no-go call. Sunken cost accumulates. Teams become attached. Hard decisions get deferred.
In this model, we built a complete MVP with a landing page, core feature, API routes, rate limiting, and a full test suite in 52 minutes. We validated a key assumption two days later. We paused.
The total time between "is this a good idea?" and "actually, not yet" was 48 hours. The total rework cost of the pause was close to zero — the code is committed, the research is documented, the make-or-break test is clearly defined. We can pick it up the moment conditions are right.
That's not a failure. That's iteration velocity. The ability to move fast enough to find out quickly what works — and what doesn't — is exactly the capability I'm trying to build.
Where It Still Feels Like Dial-Up
In my first article I described AI right now as being like dial-up internet. It works. You can feel the potential. But it's clunky, and there are moments where you spend more time managing the system than using it.
Six weeks in, that's still true in a few specific ways.
The guessing loop. This one I want to share with a real example, because it captures something important about working with AI that I don't think gets talked about enough.
We hit a 500 error on Geppetto in production. Kai couldn't access the Vercel logs directly. Instead of asking me to pull them, Kai shipped a fix. Then another fix. Then a third fix — each one based on a slightly different hypothesis about what the root cause might be. Each deploy took time. The problem kept happening.
I stopped the session and said: "Is that a guess on the root cause, or do you actually know? Those previous fixes look like educated guesses, but a guess nonetheless."
Kai's response was honest: "You're right, and I owe you a straight answer — they were educated guesses. I shipped three fixes without seeing the actual error. What I actually know: short text works, long text fails. What I don't know: the actual error message from the Vercel function. Let me get the real logs before touching anything else."
I shared a screenshot of the Vercel logs. Kai responded immediately:
"Now I know exactly what's happening. The logs confirm: 30s timeout is working ✅, OpenAI call completes successfully ✅, JSON parsing fails ❌. GPT-4o is returning something like: 'Here is the analysis:' followed by a JSON code block. My regex only strips fences at the very start and end of the string — not when there's text before them. The fix is to extract the JSON object directly."
One look at the actual evidence. Root cause identified. Fixed and deployed in minutes.
The contrast was stark. Several turns of guessing had produced nothing. One turn of evidence produced the answer.
The lesson I took: no diagnosis without evidence. If the logs aren't accessible, ask for them before proposing any fix. That one turn asking costs nothing. The guessing loop costs everything.
The "code merged is not done" problem. We had a monitoring integration that was marked complete because the library was installed and the function calls were in place. The account had never been created. The integration key had never been configured. The monitoring was a no-op — silently doing nothing.
I updated our standards after this: acceptance criteria must include verification steps, not just delivery steps. The feature is done when you've confirmed it works in production, not when the PR is merged.
Where This Is Heading
The version of the workflow I'm working toward: I describe what I want in plain language. The AI asks clarifying questions. The AI writes user stories. I review. The AI builds, tests, and commits. I do UAT. Done.
That's not science fiction. It's roughly what the last six weeks have looked like on the days when we got the process right. Pollen is live and shipping. RiffWrite is paused while we explore other ideas. Geppetto taught us more about knowing when to stop than about building itself. The friction points are real but they're solvable — mostly through better habits at the requirement and acceptance criteria stage.
What I'm increasingly convinced of: the bottleneck in AI-first development isn't the AI. It's the clarity of human input going in and the rigour of human validation coming out.
Six weeks in, that's the clearest thing I know: the technology is ready. Whether the human working with it is disciplined enough to use it well — that part is still on us.
If you're an engineer who's been watching AI from the sidelines — curious but not yet committed — I'd genuinely like to hear where you're stuck. The learning curve is real, but the other side of it is worth it. Reach out on the contact page or find me on LinkedIn.
Previous: Weeks 3 & 4 with OpenClaw and Kai: I Shipped Two Products and Built a Home Lab