I help you optimize AI for cost, quality, and complexity.

VoidSource is my independent AI systems lab. I benchmark models, APIs, self-hosted tools, rules, and workflow designs to find where simpler systems are enough, where powerful models are worth it, and where AI adds no value at all.

Audit my AI workflow See the research

Price / 1M tokensbest available pricing

Claude Opus 4.6$5/$25PremiumClaude Sonnet 4.6$3/$15Gemini 3.1 Pro$2/$12Gemini 3.5 Flash$2/$9Grok 4.20$1/$3Claude Haiku 4.5$1/$5Qwen 3.7$0.78/$4Gemini 3.0 Flash$0.50/$3K2.5$0.40/$2GPT-5.1$0.25/$2v3-0324$0.20/$0.77V4$0.10/$0.20Best valuenon-thinking-2507$0.07/$0.10Microsoft Phi-4$0.07/$0.14Qwen 3 30b A3b$0.05/$0.19Qwen 3.5 9B$0.04/$0.15Claude Opus 4.6$5/$25PremiumClaude Sonnet 4.6$3/$15Gemini 3.1 Pro$2/$12Gemini 3.5 Flash$2/$9Grok 4.20$1/$3Claude Haiku 4.5$1/$5Qwen 3.7$0.78/$4Gemini 3.0 Flash$0.50/$3K2.5$0.40/$2GPT-5.1$0.25/$2v3-0324$0.20/$0.77V4$0.10/$0.20Best valuenon-thinking-2507$0.07/$0.10Microsoft Phi-4$0.07/$0.14Qwen 3 30b A3b$0.05/$0.19Qwen 3.5 9B$0.04/$0.15

113

models tracked

benchmarks

with live pricing

providers

Pricing and benchmark numbers come from public sources and our own runs. What we measure ourselves vs. aggregate.

Decide

What are you trying to decide?

The research below is useful only when it helps answer a concrete tradeoff.

Deployment

Cloud API or self-hosted?

Find when hosting your own model is worth the operational overhead, and when a managed API is still the pragmatic answer.

Audit your setup

Model choice

Small model or frontier model?

Estimate when cheaper models are good enough and when quality, ambiguity, or risk requires the stronger option.

Compare cost vs. quality

System design

Do you need an AI agent, or is this overengineering?

Test whether an agent loop is actually useful, or whether rules, retrieval, validation, and a few model calls solve the job with less fragility.

Review the workflow

Judgment

The right tool depends on the job.

The goal is not to use the smallest model or the biggest one by default. The goal is to spend where it changes the outcome.

Start simple

Test the lightweight route first.

Rules, cached calls, smaller models, or local tools often handle the boring majority. Upgrade only where the lighter route fails.

Measure

Optimize cost per successful outcome.

The useful metric is not token price in isolation. It is what you pay for a correct extraction, accepted answer, resolved ticket, or clean handoff.

Route

Escalate ambiguous cases.

Use low-cost deterministic paths for clear work, then route uncertainty to stronger models, validation, or human review where it changes the result.

Constrain

Respect privacy and operations.

Self-hosting, cloud APIs, and hybrid pipelines each have a cost. The right answer depends on data sensitivity, team speed, and maintenance reality.

Signal notes

Get the AI Tradeoff Notes

Practical notes on model choice, API costs, self-hosting, benchmarks, and when lighter systems are good enough. Sent only when there is something useful to say. No hype, no daily news sludge.

Live data

Cost vs. quality across 15 benchmarks

Pick a benchmark below. Models on the dashed line are Pareto-optimal: no other model offers better performance for less money.

158 models compared50 = baseline, 100 = top 5%, 100+ = frontier — higher coverage yields higher confidence

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Browse all models Compare API pricing View benchmark atlas

Original Work

Aggregation is table stakes. The useful part is running the systems.

All evaluations

NEWMAR 26, 2026Analysis

We Benchmarked 7 OCR Models So You Don't Have To

Results from our own olmOCR-bench runs across OCR-specific models, general-purpose VLMs, and Tesseract. The main lesson: evaluation methodology changes scores more than most leaderboard readers realize.

14 min read

FEB 23, 2026Reference

EVMbench: Can AI Agents Hack Smart Contracts?

EVMbench measures whether AI agents can detect, patch, and exploit vulnerabilities in Ethereum smart contracts. Here's what it tests, how it's graded, and which models lead.

7 min read

JAN 18, 2026Commentary

Claude Code is the New Cursor (and the Cycle Never Ends)

Everyone's screaming about Claude Code. A year ago, they screamed about Cursor. Before that, Copilot. The pattern is more interesting than the product—and reveals something uncomfortable about how we adopt tools.

11 min read

Head-to-head

Compare models side by side

Pick any models you're evaluating and compare benchmarks, pricing, and specs in one view.

Spec	AnthropicClaude Opus 4.6 (Thinking)	MetaMuse Spark (Thinking)	GoogleGemini 3.1 Pro
Arena ELO	1,503	1,489	1,488
Input price	$5.00/1M	—	$2.00/1M
Context	1.0M	—	—

Compare these models Build your own comparison

Explore

Built for comparison, not browsing noise.

— Core

113

Language Models

LLMs

Benchmarks, pricing, and tradeoffs that actually change a model decision.

Explore

— Emerging

Extraction Systems

Document AI

Document parsing, OCR, and extraction systems evaluated as workflow components, not hype objects.

Explore

— Seed

Tracked Entries

Image AI

Open image generation models, local tools, and workflow questions as a growing seed hub.

Explore

— Seed

Tracked Entries

Video AI

Open video generation models, timeline tools, and workflow questions as a growing seed hub.

Explore

— New

—

In Progress

Robotics + AI

An emerging category for embodied intelligence, platform capability, and where the field is actually moving.

Explore

Work with me

Bring me a messy AI system decision.

The benchmarks and pricing above are the evidence layer. The actual product is judgment: knowing when to use the powerful model, when a lighter system is enough, when to use rules, when to self-host, and when not to use AI at all. I can review it, audit it, benchmark the options, prototype the better route, or build it with you.

See how to work with me

Decision Report

An async, evidence-backed recommendation for one workflow question: cloud API or self-hosted, small model or frontier, regex or LLM.

Workflow Audit

I map your AI workflow, find where you overpay or lose quality, and recommend the reliable route across models, rules, validation, and routing.

Prototype & Build

A runnable proof of the recommended path — eval table, cost-quality comparison — and implementation support when the work compounds into reusable machinery.

First proof vertical: document & OCR workflows — measurable, with real infrastructure behind it. See services and audit options.

Signal

AI News

Read all

How Endava Redesigns Software Delivery Using ChatGPT Enterprise and Codex

openai.com

How Wasmer Used Codex to Build a Node.js Runtime for the Edge

openai.com

Ted Chiang Argues Against the Concept of AI Consciousness

theatlantic.com

Google Releases Gemma 4 12B with Performance Claims Rivaling 26B Models

reddit.com