Senior PM (Societe Generale, CFA) shipping eval-first AI products end to end — from grounded RAG to multi-repo anomaly detection across the Nasdaq-100.
AI that can't cite its source is a liability. I build with provenance first, then put an eval suite around it so quality is a number, not a vibe.
Every release is a question to the market. I design for fast feedback loops — not for perfect launches.
I write code, ship prototypes, and build tooling. Engineers respect PMs who understand what they're asking for.
Eval-first financial anomaly detection across the Nasdaq-100 — where the signal isn't the anomaly, it's the gap between the numbers and the story management tells about them.
ProblemAnalysts can't manually catch when a company's numbers diverge from its own history and the narrative glosses over it.
InsightThe signal isn't the anomaly — it's the gap between the numbers and the story management tells about them.
TradeoffScoped divergence to top anomalies (not all 1,371 filings) to ship a trustworthy demo over a boil-the-ocean one.
ResultCONTRADICTS / NEUTRAL / CORROBORATES labels with cited MD&A passages, behind an LLM-as-judge eval suite.
Each one a rung on the same ladder — every link below is live.
A grounded "Decision Coach" over 700+ verbatim PM regrets from Lenny's Podcast. Precision 33%→90% via prompt iteration; confidence threshold tuned to refuse rather than guess (top-1 relevance 30%→80%).
Résumé + JD → an AI-scored candidate-fit radar, with auth, batch processing, an analytics dashboard, and a feedback loop. Kept deliberately raw — the "before" of the arc.
An open-source AI system that reads your goals + backlog and surfaces what to work on next. The meta-tool that runs this whole portfolio — building the system that builds the products.
Each product taught the next. Read top to bottom: the capability ladder from shipping to systems to agents.
Learned to ship → learned to measure → learned to build systems → learned to make them agentic — in public.
The capabilities a senior AI PM is expected to own — and where each shipped product demonstrates them.
| Product | Grounding / RAG | Evals | Agentic | Telemetry | Shipped live |
|---|---|---|---|---|---|
| RedInk | |||||
| PM Confessional | |||||
| Strategic Fit Canvas | |||||
| Product Management OS |
A few from the feed — each one a real decision from building these tools, in the open.
Numerical rubrics smooth over the failures you most need to see. Moving RedInk's eval suite to binary PASS/FAIL (per Hamel Husain's method) surfaced a routing bug — the model pointing analysts to the wrong filing artifact — and cut the failure rate 41% in one iteration, with no prompt change.
Three independent anomaly signals — including teaching the system to notice what management chose not to address. ALERT fires only when all three agree. Set one false positive and an analyst forgives you; three and they stop opening your alerts. Precision is a trust problem before it's a technical one.
PM Confessional's search took 11 seconds. Dropped it to under 0.1s by skipping the expensive rerank when internal confidence is already high, falling back to Gemini Flash-Lite when it isn't, and caching embeddings. The fix was judgment about when to spend the call, not a faster model.
My manager's review of an exec-summary agent I built over our org's goals, repos, and sprints. Tightening the instructions didn't fix it. Feeding it five summaries he'd actually written did — one pass later it learned the audience cares about clients onboarded and CSAT, not plumbing. Few-shot beat instruction-tuning.
The principles that shape every product decision I make.
Features are solutions in search of problems. I spend more time understanding why users behave the way they do than cataloguing what they ask for. The best product insights live one question deeper — "why does that matter to you?" is where the real brief is.
For AI products especially, "it feels better" isn't shippable. I put an eval harness around the thing — precision, TPR/TNR, an LLM-as-judge with a golden set — so I can tell whether a prompt change helped, regressed, or just moved the demo. Metrics inform; judgment decides; but I refuse to fly blind.
Momentum beats perfection. I bias toward shipping a prototype that answers the key risk question over writing a detailed spec for a product no one has validated — then I iterate fast on real signal. Scoping RedInk to the top anomalies instead of all 1,371 filings was exactly this call.
I'm looking for senior AI PM roles where grounding, evals, and shipping matter. If that's the bar you're hiring for, let's talk.