Case study: NurtureBoss

One team's path from 66 percent date errors to 5 percent in a single error-analysis cycle.

NurtureBoss is the canonical short case study for what a working error-analysis loop looks like in production. The team built a real-estate AI assistant; the loop closed in a single cycle; a single failure category dropped by an order of magnitude. The vignette is from Hamel's field guide and is reproduced here as the worked example the rest of the error analysis section refers back to ^[1].

The product and the problem

NurtureBoss is an AI assistant for property managers. The bot handles inbound text-message inquiries from prospective tenants: scheduling tours, answering availability questions, gathering qualification information. The team had shipped a working v1 and was getting user complaints across a handful of conversational categories. The complaints read like a list of unrelated annoyances. There was no labeled set, no judge, no way to prioritize.

The cycle

The team ran one round of open coding on one hundred logged conversations. A domain expert read each one in roughly forty seconds, wrote a one-line note where something was wrong, and tagged the trace. Clustering produced a handful of failure categories; frequency times severity produced a clear top three ^[1].

Three categories accounted for sixty percent of all observed problems. The largest was conversational-flow promises the bot did not keep ("I'll check on that and get back to you" with no follow-through). The second was misinterpretation of handoff requests (user says "transfer me to a human" and the bot keeps responding). The third was a tight cluster of date-format errors: the model produced dates the downstream calendar API could not parse, so a meaningful fraction of scheduled tours never made it onto the calendar.

The fix and the result

The date-format errors were both frequent and high-severity (a bad date cancels a tour, which is the entire conversion event). The fix was a targeted prompt change that specified the date format the calendar expected, plus a small post-processing pass that validated date strings before calling the calendar tool ^[1].

The next cycle showed the date-format failure rate dropped from roughly sixty-six percent of date-related interactions to roughly five percent. The team did not change the model or buy a new vendor. They read traces, named the failure mode specifically, and made one targeted change ^[2].

Why the cycle worked

Three properties of the loop generalize.

Specificity. The label was not "hallucination" or "formatting issue". It was "model produces dates the calendar cannot parse". The fix was unambiguous because the label was unambiguous.

Domain expert in the lead, not the engineer. The property manager who knew which conversations counted as successful tours was the one reading traces. The engineer built the viewer; the product owned the rubric ^[1].

The dataset became a substrate. The labeled hundred traces became the seed for a stage-2 judge on each top category. Future regressions get caught at offline eval before merge, not in a user complaint three weeks later ^[3].

What the vignette does not say

The case is a clean win because the bug had a clean root cause. Not every cycle produces an order-of-magnitude drop. What it shows is the lower bound of effort that can produce a real result: one expert, one afternoon, one hundred traces, three categories, one targeted fix. Teams without an eval program are not failing because the work is hard; they are failing because it has not been put on a calendar.

The team's job was not finished after one cycle. The top failure mode shifts; cycle one's second-largest becomes cycle two's largest. The discipline is the cadence, not the heroics.

How to run your version

Pull one hundred traces from the segment where complaints are loudest. Have a domain expert write one-line notes. Cluster into four to seven candidate failure modes. Pick the top three by frequency times severity. Ship one targeted fix per mode. Re-pull, re-count. Promote the labeled set into a stage-2 offline eval so the next regression blocks at CI rather than reaching a user ^[2]. First cycle: one person-week. After that: ten to twenty traces and an hour per week, plus a hundred-trace rebaseline once a quarter.

That stage-2 judge is where the labeled traces start paying off at scale, and building one that actually agrees with your domain expert is the subject of LLM-as-Judge.