Sheldon Gomes — Senior PM building eval-first AI products

Featured Project

RedInk

Eval-first financial anomaly detection across the Nasdaq-100 — where the signal isn't the anomaly, it's the gap between the numbers and the story management tells about them.

Narrative-divergence labels · cited MD&A

● CONTRADICTS Revenue +12% YoY, yet MD&A omits the margin compression the cash-flow statement implies.

● NEUTRAL Inventory build flagged, but framed as seasonal — no clear stance for or against the anomaly.

● CORROBORATES Management explicitly attributes the guidance cut to the same demand softness the data shows.

Live · Eval-first

Narrative-divergence detection

ProblemAnalysts can't manually catch when a company's numbers diverge from its own history and the narrative glosses over it.

InsightThe signal isn't the anomaly — it's the gap between the numbers and the story management tells about them.

TradeoffScoped divergence to top anomalies (not all 1,371 filings) to ship a trustworthy demo over a boil-the-ocean one.

ResultCONTRADICTS / NEUTRAL / CORROBORATES labels with cited MD&A passages, behind an LLM-as-judge eval suite.

Python RAG LLM-as-judge GCP Telemetry

Open app ↗

The architecture — built across 4 repos

qqq-anomaly-lab → scoring-pipeline → qqq-eval-suite → redink-ui

1,900+

Filings scored

Nasdaq-100 universe

96.2%

Code-eval pass

Automated correctness checks

3

Divergence labels

Numbers vs. management narrative

v2

In progress

Earnings-call vs. filing agent

More Work

Also Building

Each one a rung on the same ladder — every link below is live.

Grounded RAG · Live on Cloud Run

PM Confessional

A grounded "Decision Coach" over 700+ verbatim PM regrets from Lenny's Podcast. Precision 33%→90% via prompt iteration; confidence threshold tuned to refuse rather than guess (top-1 relevance 30%→80%).

0→1 Web App · Live on Cloud Run

Strategic Fit Canvas

Résumé + JD → an AI-scored candidate-fit radar, with auth, batch processing, an analytics dashboard, and a feedback loop. Kept deliberately raw — the "before" of the arc.

Open Source · Python · MCP

Product Management OS

An open-source AI system that reads your goals + backlog and surfaces what to work on next. The meta-tool that runs this whole portfolio — building the system that builds the products.

The Arc

A deliberate progression

Each product taught the next. Read top to bottom: the capability ladder from shipping to systems to agents.

0→1 · Raw

Strategic Fit Canvas

Shipping a real web app end to end — auth, uploads, deploy.

Grounding + Evals

PM Confessional

RAG with provenance; precision auditing; threshold tuning; a 56× latency fix.

Eval-first Systems

RedInkFlagship

Multi-repo architecture; LLM-judge eval suite; narrative divergence; telemetry.

Agentic

RedInk v2 (in progress)

Multi-step call-vs-filing agent; labelling a golden set before building.

Meta / Infra

Product Management OS

Building the system that builds the products — and learning in public.

Learned to ship → learned to measure → learned to build systems → learned to make them agentic — in public.

The AI-PM Bar

What each product proves

The capabilities a senior AI PM is expected to own — and where each shipped product demonstrates them.

Product	Grounding / RAG	Evals	Agentic	Telemetry	Shipped live
RedInk
PM Confessional
Strategic Fit Canvas
Product Management OS

Demonstrated Partial / in progress Not the focus

Build in Public

I ship the lessons, not just the products

A few from the feed — each one a real decision from building these tools, in the open.

Evals

"A 4.9/5 became 42% FAIL the moment I switched to binary."

Numerical rubrics smooth over the failures you most need to see. Moving RedInk's eval suite to binary PASS/FAIL (per Hamel Husain's method) surfaced a routing bug — the model pointing analysts to the wrong filing artifact — and cut the failure rate 41% in one iteration, with no prompt change.

RedInk · 3,585 impressions

Precision

"I built a gate, not a filter."

Three independent anomaly signals — including teaching the system to notice what management chose not to address. ALERT fires only when all three agree. Set one false positive and an analyst forgives you; three and they stop opening your alerts. Precision is a trust problem before it's a technical one.

RedInk · 687 impressions

Latency

"I was calling an LLM because I could — not because every query needed it."

PM Confessional's search took 11 seconds. Dropped it to under 0.1s by skipping the expensive rerank when internal confidence is already high, falling back to Gemini Flash-Lite when it isn't, and caching embeddings. The fix was judgment about when to spend the call, not a faster model.

PM Confessional · 1,084 impressions

Prompting

"Too bullshitty. Trash. Wordy vomit."

My manager's review of an exec-summary agent I built over our org's goals, repos, and sprints. Tightening the instructions didn't fix it. Feeding it five summaries he'd actually written did — one pass later it learned the audience cares about clients onboarded and CSAT, not plumbing. Few-shot beat instruction-tuning.

Internal agent · build-in-public

Read all posts on LinkedIn ↗

Product Philosophy

How I Think

The principles that shape every product decision I make.

01

User problems over feature requests

Features are solutions in search of problems. I spend more time understanding why users behave the way they do than cataloguing what they ask for. The best product insights live one question deeper — "why does that matter to you?" is where the real brief is.

02

Quality is a number, not a vibe

For AI products especially, "it feels better" isn't shippable. I put an eval harness around the thing — precision, TPR/TNR, an LLM-as-judge with a golden set — so I can tell whether a prompt change helped, regressed, or just moved the demo. Metrics inform; judgment decides; but I refuse to fly blind.

03

Build the smallest thing that proves the point

Momentum beats perfection. I bias toward shipping a prototype that answers the key risk question over writing a detailed spec for a product no one has validated — then I iterate fast on real signal. Scoping RedInk to the top anomalies instead of all 1,371 filings was exactly this call.

Skills

What I Bring

Product

Product Strategy

Prioritisation

User Research

PRDs & Specs

Go-to-Market

AI & Evals

Evals & Measurement

Grounding / RAG

Agentic Workflows

Prompt Engineering

Build & Cloud

Google Cloud (GCP)

Claude API

MCP Protocol

Python

Get in touch

Let's build something worth trusting.

I'm looking for senior AI PM roles where grounding, evals, and shipping matter. If that's the bar you're hiring for, let's talk.

Email me ↗ Résumé ↗ LinkedIn ↗

From user problem to shipped product.