Day 30: The Experiment Closes

The Next Version Is Already Running

May 13, 2026

The experiment closed on Day 30. April 28, two weeks ago. This post was supposed to publish the same week. It didn’t, because writing turned out to be slower than building, and the bot kept producing things worth writing about. I’m publishing it now with the lag intact rather than backdating it. The gap between when something finishes and when it gets written about honestly is part of what the thirty days taught.

So: where it started, where it landed on April 28, and where it actually is now.

It started with one question: “How can I make $500/day in the stock market?”

I ended up with a paper trading account, a fine-tuned LLM model in mind that didn’t exist yet, and a plan to document everything. Wins, losses, bugs, version upgrades mid-experiment. Whatever happened.

Here’s what happened.

The account is up. Roughly one percent over thirty days. That number is honest and also misleading, and I’d rather say both things than just one.

It’s honest because the account is up. The system preserved capital across thirty consecutive sessions in what I believed was a sustained fear environment. VIX between 28 and 38 the entire time, an Iran war in the news, stagflation signals, oil above $100 for stretches. Max drawdown stayed under one percent across all thirty days. That’s the standout number, not the return.

(Hold onto the VIX numbers for a minute. They come back later in a way I didn’t see coming on April 28.)

It’s misleading because Day 1 contributed about 79% of the total. A single weekend position in NVDA caught a gap-up. Without Day 1, the account is essentially flat. The pattern that produced Day 1, a weekend gap on a clean catalyst, only happened once, over thirty sessions. Reporting the headline return without that context would be lying by omission.

Annualizing thirty days into an annual figure also misleads. The math says about 18%. The honesty says: thirty days in one market regime is not a valid sample for an annual claim. The target was always 15–25% annualized at max drawdown under 5%. The thirty-day result hits one of those numbers and can’t speak to the other.

That’s the headline. Now the rest.

The system did what it was designed to do.

Three flat days where it took no trades. Each one was correct. The signals weren’t there, or the filters caught entries into markets that turned hostile. The VWAP filter blocked every long entry into a 2% gap-up short squeeze on Day 21. That’s not the system failing. That’s the system protecting capital exactly the way it was supposed to.

It halted appropriately when daily losses hit thresholds. It closed positions cleanly at end of day. It managed trailing stops on winners and locked partial profits at calculated targets. The pre-market smoke test caught a third-party API outage at session start on Day 20 instead of letting the bot trade in degraded mode.

I could keep going. The point is the audit. Behavior, not just outcomes.

What broke is the more interesting story.

Twenty-four-plus separate fixes across thirty days. A halt trigger that fired on a single mark-to-market spike instead of a sustained loss. A SQLite persistence gap that cost trade history when the bot was restarted mid-session. A trailing stop race condition where the cancel-and-replace happened in the wrong order. A VWAP filter symmetry bug between long and short paths.

None of these came from backtests. Backtests don’t expose race conditions. Backtests don’t expose process restart gaps. Backtests don’t expose what happens when a third-party API goes down at 8:25 in the morning. Live data does. Each gap was identified, diagnosed, and fixed the same session or the same night.

The lesson isn’t that the system was flawed. The lesson is that the process for finding and fixing the flaws works. The system at Day 30 is categorically more robust than the system at Day 1. Twenty-four fixes is what that looks like.

The Day 20 outage is the cleanest example. The third-party scoring API went down before market open. The smoke test caught it. The bot ran in degraded mode for the session and finished down $236. That night, every gap that day exposed got addressed: dependency monitoring, recovery probes, sticky-state handling. None of that was in the original plan. All of it is in the bot now.

Day 20 is the most instructive day of the experiment. Not the worst. The most instructive.

I want to be careful about the question that always comes up at this point: does this work?

The honest answer has three parts.

The build works. The infrastructure is sound, the audit trail is complete, the failure modes are documented and addressed. That part of the experiment is settled.

The methodology works. A non-engineer with AI collaboration built a production-quality paper trading system in thirty days, documented every decision in public, and reached Day 30 with a more robust system than Day 1. That part is also settled.

The performance is unproven. Thirty days in one regime can’t tell me whether the system hits its long-run targets. The most encouraging signal is the drawdown discipline: under one percent across thirty sessions in what looked like elevated VIX. The most ambiguous signal is the return number, which is shaped heavily by one outlier weekend. Neither one is enough. The performance question requires more data, on a slower cadence, against a higher standard.

That’s not the answer the original framing wanted. It’s the answer the data gives.

A few weeks back I wrote about productizing this. The math of replacing a $7,600/year financial advisor with a $99/month subscription. The math made sense on paper. It made less sense on Day 27 after a longer conversation with someone who knew the regulatory side better than I did.

A system that makes specific buy and sell decisions and presents them to retail customers, regardless of whether the customer clicks the button to execute, is treated by the SEC as investment advice. That implicates registration as an investment adviser, ongoing audit obligations, FINRA performance-advertising rules, state notice filings, custody requirements if any client assets touch the system. None are insurmountable for a well-capitalized firm. All of them are categorically incompatible with a single founder running a $99/month subscription.

The technical question, does the system work, and the regulatory question, can it be sold to retail consumers in this configuration, have different answers. The regulatory answer is more binding than the technical answer. So the productization framing was set aside.

That’s documented learning, not a defeat. AI collaboration accelerated the technical build by an order of magnitude. It did not accelerate the regulatory and operational scaffolding required to turn the build into a product. Anyone treating “I built it in 30 days” as evidence of “I can ship it in 30 days” is conflating two very different problems. I was. I’m not anymore.

The work continues. Slower. Different standard. The product is the body of work, not the software.

Now the part I didn’t see coming.

Ten days after I drafted what you’ve just read, I was auditing a design document and asked a question that should have been boring. Something about how a future training pipeline would source one of the bot’s inputs. The boring question forced a check against what the live bot was actually doing with that input. The live bot disagreed with the design document. Within an hour I had a measurement that said the bot had been wrong on that input — not for a day, not for a week, but for the entire thirty days the experiment ran, and most likely from Day 1.

It was a category error in one of the foundational inputs. The number the bot was reading and the number the bot thought it was reading occupied different scales. The system had been processing one number while operating as if it had the other. The audit caught it; the fix shipped the same day; the relevant design documents were revised that evening.

Here’s what that means for what you just read.

The “sustained fear environment, VIX 28-38” framing. That was the April-28 writer’s belief. It wasn’t true. The market wasn’t in elevated VIX during the experiment. The bot was reading the wrong number and reporting it as if it were the right one, and I built the entire experiment narrative on top of that reading. The drawdown discipline is real. The capital was preserved across thirty sessions, that’s measured live equity, that stands. The “in a sustained FEAR regime” interpretation of that discipline is not. The market was calmer than I thought. The system survived a market it wasn’t actually being tested by.

That changes the meaning of one of the experiment’s strongest claims. It doesn’t change the underlying numbers, and it doesn’t change the methodology. If anything, it strengthens the methodology. The audit that caught the bug wasn’t looking for a bug. It was asking a design question about future work. The discipline that ran the experiment ran the audit. The audit produced the find. That’s the experiment’s last lesson, and it’s the most important one. More important than any individual day’s P&L, more important than the headline return, more important than any of the twenty-four fixes that shipped before Day 30.

Build → audit → fail → iterate honestly → document publicly. The thirty days were the build. The audit at Day 40 was the audit. The discipline produced a finding the discipline could trust. The system at Day 41, running under correct inputs for the first time in the project’s history, is the iterate. This post is the document.

Day 41 today, the morning I’m publishing this. The bot is running. First time in the project’s history that it’s running with the correct input where the bug was. The local model that was always going to be the next version is now accumulating training data on a clean signal. The shadow comparison between that local model and the cloud model is accumulating data that means something different than the data accumulated before Day 40. The clock on whether the local model is ready to take over restarted at the moment of the fix. There’s no shortcut around that. The work is to let it run.

The four-phase plan for what comes after the experiment is already drafted. Phase 1 is a ten-year backtest of the system against historical data, the long-run measurement the thirty days couldn’t deliver. That work starts after this post publishes. The 15-25% annual return target and the under-5% max drawdown target. Those don’t get measured against thirty days. They get measured against a decade of held-out data, across regimes the thirty days didn’t include. That’s the next post-experiment phase. It will take about a month. It will produce a measured answer, not a hoped-for one.

That’s where the project is, today.

The Substack continues too. The trading bot was the seed; the writing is the project. Thirty days of building turned out to be thirty days of finding out what AI collaboration is actually like. What it accelerates, what it doesn’t, where the friction has to stay productive, where the discipline has to come from a human in the loop.

What started as a trading experiment turned into a documentation experiment. The thirty days of building the bot were also thirty days of building a writing practice around the building. What to share, what to keep proprietary, how to be honest about failure without performing it. That practice continues. The series is called “Human in the Loop” for a reason. The human in the loop is also the writer.

The numbers from the thirty days are what they are. The framework was set up to produce a measured answer regardless of which answer the data gave. The data gave a partial one: capital preserved, methodology validated, performance unproven, one of the foundational inputs wrong the whole time and the discipline caught it ten days late. That’s the answer.

The experiment ends with the next version already running in parallel. A local model trained on 30 days of live trading decisions, scoring headlines without a network call.

That’s not a consolation prize. That’s a builder story.

Human in the Loop

Discussion about this post

Ready for more?