Rethinking What “It Works” Means

The biggest learning as a PM transitioning into AI: “𝗪𝗼𝗿𝗸𝗶𝗻𝗴” 𝗶𝘀 𝗻𝗼 𝗹𝗼𝗻𝗴𝗲𝗿 𝗯𝗶𝗻𝗮𝗿𝘆.

In traditional software, PRDs define requirements, engineering writes test cases, and you validate against clear pass/fail criteria. That works when “does it work?” has a yes or no answer.

AI doesn’t behave that way. It can pass predefined test cases and still fail in production because inputs are unpredictable and “correct” is often a spectrum, not a checkbox.

That’s when I realized evals aren’t the AI version of QA. They’re not something you bolt on after building the product. They become the foundation – the thing you design first because without them, you genuinely don’t know what you’re building toward. Without evals, you’re essentially flying blind with a well-prompted prototype.

Working on conversational AI made this painfully concrete for me. A chatbot can retrieve the right data, call the right API, and still deliver a response that’s useless to the user. A voice agent can have great pick-up rates but fail because the conversation breaks at turn three. 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗲 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 𝘁𝗲𝗹𝗹 𝘆𝗼𝘂 𝘀𝗼𝗺𝗲𝘁𝗵𝗶𝗻𝗴 𝗶𝘀 𝘄𝗿𝗼𝗻𝗴. 𝗘𝘃𝗮𝗹𝘀 𝗵𝗲𝗹𝗽 𝘆𝗼𝘂 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝘄𝗵𝗲𝗿𝗲 𝗮𝗻𝗱 𝘄𝗵𝘆.

That meant learning to evaluate at every layer of the conversation: was the right intent captured? Were the right tools called? Was the response actually useful in context? And doing this not once during deployment, but continuously on real production interactions.

I also learned that off-the-shelf evals give you a starting point, but they rarely capture what matters for your specific product context. Hence, defining what “good” looks like for your users, edge cases, and failure modes is where most of the actual product thinking now lives.

Some of this can be measured with fixed metrics. A lot can’t, which is where approaches like LLM-as-a-judge help add qualitative judgment. And then there’s model drift, quietly degrading performance over time, making continuous monitoring non-negotiable.

All of this fundamentally changed what I deliver to engineering. It’s no longer a one-time product spec. 𝗜𝘁’𝘀 𝗮𝗻 𝗼𝗻𝗴𝗼𝗶𝗻𝗴 𝘀𝘆𝘀𝘁𝗲𝗺 𝗼𝗳 𝗱𝗲𝗳𝗶𝗻𝗶𝗻𝗴, 𝗺𝗲𝗮𝘀𝘂𝗿𝗶𝗻𝗴, 𝗮𝗻𝗱 𝗿𝗲𝗳𝗶𝗻𝗶𝗻𝗴 𝘄𝗵𝗮𝘁 “𝘄𝗼𝗿𝗸𝗶𝗻𝗴” 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗺𝗲𝗮𝗻𝘀.

One thing building AI products has made clear: teams that define quality upfront don’t just ship better products. They move faster, because everyone is aligned on what they’re optimizing for and how to iterate toward it.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *