Observability

Years ago, before “AI” meant chatbots and before anyone said the word “eval” …

I reread The Phoenix Project last month. I do this every year or two — it’s one of those books …

Say you’ve built an LLM judge and you want to know if it’s any good. The obvious move is …

Ask most people what an eval is for and you’ll get some version of “testing”. You …

Last time I pulled apart the eval prompt - the role, criteria, rubric and examples you write to tell …

When people decide to use an LLM as a judge, the prompt they reach for first is almost always some …

The moment you start tracing an AI app, three words turn up everywhere: span, trace, session. People …

There’s a version of building an AI app that goes like this. You build the thing, you get it …

Someone asked me last week how you actually run an eval on an AI app. I gave the honest answer, …

Here’s how most people build an eval. They open a file, write an LLM judge prompt that says …