I help you optimize AI for cost, quality, and complexity.
Pricing and benchmark numbers come from public sources and our own runs. What we measure ourselves vs. aggregate.
Decide
What are you trying to decide?
The research below is useful only when it helps answer a concrete tradeoff.
Cloud API or self-hosted?
Find when hosting your own model is worth the operational overhead, and when a managed API is still the pragmatic answer.
Small model or frontier model?
Estimate when cheaper models are good enough and when quality, ambiguity, or risk requires the stronger option.
Do you need an AI agent, or is this overengineering?
Test whether an agent loop is actually useful, or whether rules, retrieval, validation, and a few model calls solve the job with less fragility.
Judgment
The right tool depends on the job.
The goal is not to use the smallest model or the biggest one by default. The goal is to spend where it changes the outcome.
Test the lightweight route first.
Rules, cached calls, smaller models, or local tools often handle the boring majority. Upgrade only where the lighter route fails.
Optimize cost per successful outcome.
The useful metric is not token price in isolation. It is what you pay for a correct extraction, accepted answer, resolved ticket, or clean handoff.
Escalate ambiguous cases.
Use low-cost deterministic paths for clear work, then route uncertainty to stronger models, validation, or human review where it changes the result.
Respect privacy and operations.
Self-hosting, cloud APIs, and hybrid pipelines each have a cost. The right answer depends on data sensitivity, team speed, and maintenance reality.
Signal notes
Get the AI Tradeoff Notes
Practical notes on model choice, API costs, self-hosting, benchmarks, and when lighter systems are good enough. Sent only when there is something useful to say. No hype, no daily news sludge.
Live data
Cost vs. quality across 15 benchmarks
Pick a benchmark below. Models on the dashed line are Pareto-optimal: no other model offers better performance for less money.
Original Work
Aggregation is table stakes. The useful part is running the systems.
We Benchmarked 7 OCR Models So You Don't Have To
Results from our own olmOCR-bench runs across OCR-specific models, general-purpose VLMs, and Tesseract. The main lesson: evaluation methodology changes scores more than most leaderboard readers realize.
EVMbench: Can AI Agents Hack Smart Contracts?
EVMbench measures whether AI agents can detect, patch, and exploit vulnerabilities in Ethereum smart contracts. Here's what it tests, how it's graded, and which models lead.
Claude Code is the New Cursor (and the Cycle Never Ends)
Everyone's screaming about Claude Code. A year ago, they screamed about Cursor. Before that, Copilot. The pattern is more interesting than the product—and reveals something uncomfortable about how we adopt tools.
Head-to-head
Compare models side by side
Pick any models you're evaluating and compare benchmarks, pricing, and specs in one view.
| Spec | AnthropicClaude Opus 4.6 (Thinking) | MetaMuse Spark (Thinking) | GoogleGemini 3.1 Pro |
|---|---|---|---|
| Arena ELO | 1,503 | 1,489 | 1,488 |
| Input price | $5.00/1M | — | $2.00/1M |
| Context | 1.0M | — | — |
Explore
Built for comparison, not browsing noise.
Language Models
LLMs
Benchmarks, pricing, and tradeoffs that actually change a model decision.
Extraction Systems
Document AI
Document parsing, OCR, and extraction systems evaluated as workflow components, not hype objects.
Tracked Entries
Image AI
Open image generation models, local tools, and workflow questions as a growing seed hub.
Tracked Entries
Video AI
Open video generation models, timeline tools, and workflow questions as a growing seed hub.
In Progress
Robotics + AI
An emerging category for embodied intelligence, platform capability, and where the field is actually moving.
Work with me
Bring me a messy AI system decision.
The benchmarks and pricing above are the evidence layer. The actual product is judgment: knowing when to use the powerful model, when a lighter system is enough, when to use rules, when to self-host, and when not to use AI at all. I can review it, audit it, benchmark the options, prototype the better route, or build it with you.
Decision Report
An async, evidence-backed recommendation for one workflow question: cloud API or self-hosted, small model or frontier, regex or LLM.
Workflow Audit
I map your AI workflow, find where you overpay or lose quality, and recommend the reliable route across models, rules, validation, and routing.
Prototype & Build
A runnable proof of the recommended path — eval table, cost-quality comparison — and implementation support when the work compounds into reusable machinery.
First proof vertical: document & OCR workflows — measurable, with real infrastructure behind it. See services and audit options.
Signal
AI News
How Endava Redesigns Software Delivery Using ChatGPT Enterprise and Codex
openai.comHow Wasmer Used Codex to Build a Node.js Runtime for the Edge
openai.comTed Chiang Argues Against the Concept of AI Consciousness
theatlantic.comGoogle Releases Gemma 4 12B with Performance Claims Rivaling 26B Models
reddit.com