- Articles
- /
- Can LLMs Actually Trade? What the Benchmarks Don't Tell You
Can LLMs Actually Trade? What the Benchmarks Don't Tell You
Can large language models actually trade profitably? A critical look at what trading benchmarks reveal about AI capabilities, why markets are uniquely hard, and how to think about the hype.
The $100 Bet That Broke the Internet#
In mid-2025, someone on Reddit did something that sounded either brilliant or completely unhinged: they gave ChatGPT full control of $100 in a stock portfolio.
Not "get recommendations from ChatGPT." Full control. The AI picked the stocks. Decided how much to buy. Determined when to sell. The human just clicked the buttons.
Four weeks later, the portfolio was up 24%.
For context, the S&P 500 gained about 3% during that same period.
The equity curve went viral. People lost their minds. "This is it. AI is going to replace Wall Street."
Then, around month five, one of the holdings—a small biotech company called ATYR—announced bad clinical trial results. The stock dropped 80%. Overnight.
The AI hadn't set a stop-loss. Hadn't hedged. Hadn't done anything to protect against this scenario. The gains? Gone.
"The model clearly has major risk management issues," the creator wrote afterward, with admirable understatement.
This experiment tells us something important—not about how to make money, but about where AI capabilities actually stand. Markets are one of the harshest testing grounds imaginable for intelligent systems. What happens when we put LLMs through that gauntlet?
Why This Matters Beyond Trading#
Let's be clear about what this article is and isn't.
This is not a guide to making money with AI. If you're here looking for trading tips or the next edge, you'll be disappointed. Markets are adversarial, and any publicly known strategy gets arbitraged away.
This is an examination of AI capabilities. Trading benchmarks are interesting because they test something most benchmarks don't: sequential decision-making under uncertainty, with immediate feedback and real consequences.
Most AI evaluations ask: "Can you answer this question correctly?" Trading asks: "Can you make good decisions repeatedly, in a changing environment, when being wrong costs you?"
That's a fundamentally different—and harder—challenge. It's closer to what we'd need AI to do in the real world: not just respond to prompts, but act autonomously over time with judgment.
So when we look at LLMs trying to trade, we're really asking: how far along is AI on the path from "answering questions" to "making decisions"?
Why Markets Are Uniquely Hard for AI#
Before diving into results, we need to understand why markets are such a brutal test. This isn't incidental—it's structural.
The Non-Stationarity Problem#
Most machine learning assumes the future resembles the past. You train on historical data, then deploy on similar future data. Markets violate this assumption constantly.
What worked in 2020 doesn't work in 2022. Patterns that held for decades suddenly break. The underlying dynamics shift—sometimes gradually, sometimes overnight. Central bank policy changes. New regulations appear. Entire sectors transform.
An LLM trained on historical financial text has learned patterns that may no longer apply. Worse, it has no way of knowing which patterns are still valid. The training data and the deployment environment are fundamentally different in ways the model can't detect.
The Reflexivity Problem#
Here's something even trickier: in markets, predictions change the thing being predicted.
If enough people believe a stock will rise, they buy it, and it rises. If an AI strategy becomes known and widely copied, the edge disappears—or reverses, as others trade against it.
George Soros called this "reflexivity." Markets aren't like weather (which doesn't care about forecasts). Markets respond to beliefs about markets. This creates feedback loops that make prediction fundamentally different from other domains.
An LLM making trading decisions is participating in a game where other players—including other algorithms—are actively trying to exploit predictable patterns. Any discoverable edge gets competed away.
The Adversarial Environment#
This leads to the deepest issue: markets are adversarial in a way that Q&A benchmarks aren't.
When an LLM answers a trivia question, nobody is actively trying to make it fail. When an LLM trades, the entire market is effectively the adversary. Professional quants with decades of experience, massive compute resources, and proprietary data are on the other side of every trade.
It's not enough to be smart. You have to be smarter than the collective intelligence of everyone else playing the same game—many of whom have been optimizing for this specific problem for years.
What This Means for AI Capabilities#
These aren't just "hard problems" in the sense of requiring more compute or better training data. They're hard in ways that challenge core assumptions of current machine learning:
- Non-stationarity means the training distribution doesn't match deployment
- Reflexivity means actions change the environment you're trying to predict
- Adversarial dynamics mean other agents are actively exploiting your weaknesses
Any domain with these properties will be difficult for current AI approaches. Markets just make the difficulty brutally visible through P&L.
The Mental Model: A Capability Stack#
Now let's look at what researchers have actually tested. Different evaluations measure different capabilities—and conflating them leads to confusion.
Think of trading as a stack of increasingly difficult skills:
The Trading Capability Stack
[4] Portfolio Management Allocating across assets over time, controlling risk
[3] Sequential Decisions Daily choices with consequences that compound
[2] Predictive Analysis "Will this asset go up or down?"
[1] Information Retrieval "What was Tesla's revenue last quarter?"
The crucial insight: excellence at lower levels doesn't guarantee competence at higher levels.
This is where many evaluations mislead. A model might retrieve facts perfectly (level 1) and still make terrible predictions (level 2). It might make reasonable predictions and still blow up a portfolio through poor risk management (level 4).
CryptoBench, one of the benchmarks we'll examine, discovered exactly this pattern. They called it the "retrieval-prediction imbalance"—models that excelled at looking things up often struggled with forward-looking analysis.
This matters because most ways we evaluate LLMs focus on lower levels. Answering questions. Summarizing documents. Passing exams. But autonomous decision-making happens at levels 3 and 4—and those are different skills entirely.
What the Benchmarks Actually Show#
Three major benchmarks emerged in 2025, each pushing toward more realistic evaluation of these higher-level capabilities.
CryptoBench: When Tools Matter More Than Scale#
CryptoBench, from Princeton and collaborators, throws LLMs into realistic cryptocurrency analyst workflows. Fifty fresh questions every month—analyzing on-chain data, reading DeFi dashboards, making predictions.
The interesting finding: the biggest models didn't win.
Grok-4 with web browsing scored around 44% accuracy. GPT-5? About 30%.
The specialized capability—real-time web access—mattered more than raw model size. Even more telling: when researchers gave weaker models an "agentic framework" (tools to browse, plan, and iterate), some closed the gap significantly.
What this reveals about AI capabilities: Architecture and tooling can matter more than scale. A model with the right scaffolding may outperform a larger model without it. This has implications far beyond trading—it suggests that how we deploy models matters as much as the models themselves.
StockBench: The Humbling#
StockBench, from Tsinghua researchers, simulates real trading: $100,000 portfolio, 82 trading days, 20 Dow Jones stocks. Daily decisions based on prices, fundamentals, and news.
The sobering result: Most LLM strategies failed to beat a simple buy-and-hold approach.
GPT-5 managed +0.3% over the period. The passive benchmark? About +0.4%. Just buying the index and going to the beach would have worked better.
The leading model was Kimi-K2, an open-source model, with +1.9% and better risk metrics than larger proprietary models.
What this reveals: Static knowledge doesn't translate to dynamic decision-making. GPT-5 "knows" vastly more about finance than Kimi-K2. But trading isn't a knowledge test—it's a judgment test under uncertainty. Different capability entirely.
LiveTradeBench: Real-Time Adaptation#
LiveTradeBench, from UIUC, streams live market data, news, and social sentiment. Models make allocation decisions in real-time across stocks and prediction markets.
In a 50-day experiment with 21 different LLMs trading side by side:
-
Traditional benchmark scores didn't predict trading success. Models crushing standard NLP evaluations sometimes flopped at making money.
-
Each model developed distinct behavior. Some kept 20%+ in cash (cautious). Others went nearly all-in. Different "personalities" emerged from the same task.
-
Adaptability varied wildly. Some models adjusted to breaking news. Others seemed to ignore new information entirely.
What this reveals: There's a gap between answering questions about markets and making decisions in markets. Static knowledge versus dynamic adaptation. Current LLMs are trained heavily for the former; trading requires the latter.
The Hype Cycle Lens#
If you've watched technology long enough, this pattern looks familiar.
Every few years, a new capability emerges. Early demos are impressive. Hype builds. People extrapolate to transformative outcomes. Then reality sets in—the capability is real but narrower than imagined. Eventually, genuine applications emerge, usually different from what was initially hyped.
We saw this with expert systems in the 1980s, neural networks in the 1990s, big data in the 2010s, and now with LLMs. Each time, the technology was real. The mistake was in the extrapolation.
LLM trading is somewhere in the early hype phase. Viral experiments show impressive short-term results. Startups raise money. Breathless coverage follows.
But the pattern suggests caution. The Reddit portfolio that gained 24% in four weeks? One biotech stock randomly doubled. A commenter pointed out you could have achieved similar returns just buying NVIDIA that month.
When markets are hot and volatility is high, almost any strategy can look brilliant briefly. This is noise, not signal. Distinguishing luck from skill requires years of data across different market regimes—exactly what we don't have yet.
Where we likely are: The technology enables new experiments. Some results are genuinely interesting. But we're far from the "LLMs reliably beat markets" claims that hype would suggest. The honest answer is: we don't know yet, and anyone claiming certainty is selling something.
How to Evaluate Any LLM Trading Claim#
Here's a framework you can apply to future claims—because there will be many.
1. What Level Is Being Tested?#
Map the claim to the capability stack. Is it:
- Retrieving information? (Level 1 — not impressive)
- Making predictions? (Level 2 — interesting but limited)
- Sequential decisions with feedback? (Level 3 — genuinely hard)
- Portfolio management with risk control? (Level 4 — the real challenge)
Higher levels are harder and more relevant. Most impressive-sounding demos test lower levels than they appear to.
2. What's the Time Horizon?#
- Weeks: Pure noise. Anyone can get lucky.
- Months: Suggestive but inconclusive.
- Years across market regimes: Actual evidence.
A strategy that worked in a bull market tells you nothing about bear markets. You need to see performance through different conditions before drawing conclusions.
3. What's the Real Baseline?#
Comparisons matter enormously:
- vs. Random: Trivially easy to beat. Not meaningful.
- vs. Buy-and-hold index: The minimum bar. Most LLMs struggle here.
- vs. Buy-and-hold, risk-adjusted: Even harder. Accounts for volatility.
- vs. Simple quantitative strategies: The real test. Momentum, mean reversion, etc.
If the baseline isn't specified, assume it's the easiest possible comparison.
4. Is Risk Accounted For?#
Returns without drawdown data are meaningless. A strategy returning 30% with 50% drawdowns isn't better than one returning 15% with 10% drawdowns—it's worse.
Ask: What was the maximum loss? How volatile were returns? What's the Sharpe or Sortino ratio?
The Reddit portfolio's 24% gain looked impressive until one position crashed 80% overnight. Risk management is where most LLM strategies fail.
5. Is It Reproducible?#
- Open data + open code + open methodology: Trustworthy
- Proprietary but audited: Somewhat trustworthy
- "Trust us, it works": Marketing, not evidence
If you can't verify a result, discount it heavily.
6. Who Benefits From the Claim?#
Is this academic research with public code? Or a company raising money? A YouTuber seeking views? An influencer selling courses?
Incentives shape what gets reported and how. Academic benchmarks have their own biases, but at least the incentive is peer validation rather than profit.
A Note of Caution#
If you're tempted to experiment with LLM trading yourself, some honest warnings:
LLMs hallucinate. They generate confident-sounding nonsense regularly. In trading, a hallucinated "fact" can justify a disastrous position. The model doesn't know what it doesn't know—and neither will you until it's too late.
LLMs get stuck. Like any optimization process, they can find local minima that feel like solutions but aren't. A trading strategy that "works" in backtests may be overfit to historical quirks that won't repeat.
LLMs lack uncertainty quantification. When a model says "buy," it doesn't tell you how confident it is or what would change its mind. This makes risk management nearly impossible to automate properly.
Markets punish overconfidence brutally. The difference between 95% accuracy and 99% accuracy is enormous when you're trading daily. At 95%, you're making costly mistakes multiple times per month.
Survivorship bias is everywhere. The experiments that get shared are the impressive ones. The countless failures stay private. What you see is a biased sample of what's possible.
Never risk money you can't afford to lose. This should go without saying, but the allure of AI-powered returns can override common sense. Treat any LLM trading experiment as exactly that—an experiment, not an investment strategy.
Where Might This Actually Go?#
Setting aside hype, what's the realistic trajectory?
Near-term: Augmentation, Not Autonomy#
The most promising use case isn't "let the AI trade for you"—it's "let the AI help you trade better."
One trader described using Claude 4.5 as a "trade manager"—opening positions based on his own signals, then letting Claude manage exits. Tighten stops. Take partial profits. React to changing conditions.
His assessment: Claude often managed exits better than he would have. "Scary good" timing on some trades.
Why might this work? The model isn't predicting markets from scratch (hard). It's reasoning about a constrained problem with clear inputs (price action, indicators, position details). That's more tractable.
Human intuition for entries. AI discipline for exits. Possibly better than either alone.
Medium-term: Specialized Models#
General-purpose LLMs probably won't beat specialized systems at trading. The winning approach likely involves:
- Models fine-tuned specifically on financial data
- Architectures designed for sequential decision-making
- Explicit risk management built into the objective function
- Real-time data integration as a first-class capability
The future isn't "ChatGPT trades your portfolio"—it's purpose-built systems that may not even look like chat models.
Long-term: Unknown#
Honest answer: we don't know. If AI capabilities continue advancing rapidly, current limitations may prove temporary. If markets adapt faster than AI improves, current challenges may prove fundamental.
The structural problems (non-stationarity, reflexivity, adversarial dynamics) don't disappear with better models. But they might become manageable with fundamentally different approaches.
The Open Questions#
The most interesting aspects of this space aren't the current results — they're the unknowns.
What happens with better tools?
Current benchmarks mostly test models in isolation or with basic web access. But agentic AI is evolving rapidly. What if a trading agent could:
- Query multiple data sources in real-time (news, filings, social sentiment, on-chain data)
- Run its own backtests before committing to a strategy
- Maintain memory across sessions, learning from its own mistakes
- Coordinate with specialized sub-agents (one for research, one for risk, one for execution)
The StockBench and CryptoBench results show that tool access already matters more than model size in some cases. How far does that extend? We don't know yet.
What about different market regimes?
Most current evaluations happened during relatively calm or bullish periods. How do these models behave during a crash? A liquidity crisis? A black swan event? The honest answer is we have almost no data on this — and that's exactly when risk management matters most.
Can explicit risk training help?
Current LLMs aren't trained to manage risk — they're trained to be helpful and accurate. What if you fine-tuned a model specifically on risk management scenarios? Or built risk constraints into the reward function? This seems tractable but largely unexplored.
What's the ceiling for human-AI collaboration?
The "co-pilot" approach — human intuition plus AI discipline — might have a higher ceiling than either alone. But we don't have good frameworks for measuring this. What tasks should humans own? What should AI own? How do you hand off gracefully?
These aren't rhetorical questions. They're genuine gaps in what we know.
What We'd Like to Explore#
At VoidSource, we're curious whether these claims hold up to independent scrutiny. Academic benchmarks are valuable but limited — they test specific scenarios with specific constraints.
Some things we're considering:
- Running standardized evaluations across multiple models with transparent methodology
- Testing the "agentic" hypothesis: does tool access actually improve trading decisions?
- Comparing human-AI collaboration against pure AI and pure human baselines
- Publishing results regardless of whether they're impressive or disappointing
The goal wouldn't be to find "the best trading AI" — it would be to understand what current capabilities actually are, with honest reporting.
If you have thoughts on this — experiments you'd want to see, questions you think matter, or approaches we haven't considered — we'd genuinely like to hear them. The interesting work in this space will come from many perspectives, not just ours.
The honest answer matters more than the impressive one.
The Bottom Line#
Can LLMs trade profitably?
The honest answer: Sometimes, sort of, we don't really know yet.
They can retrieve financial information well. They struggle with forward-looking analysis. They really struggle with risk management over time. Short-term wins are usually luck. Long-term outperformance is unproven.
But that's not the interesting question. The interesting question is: what do trading benchmarks reveal about AI capabilities?
The answer: current LLMs are much better at answering questions than making decisions. They're better at static knowledge than dynamic adaptation. They lack the uncertainty quantification and risk awareness that consequential decisions require.
These are genuine limitations, not just "needs more training data." Solving them would represent meaningful progress in AI capabilities—progress that would matter far beyond trading.
Markets are a harsh but clarifying test. The results so far suggest we're earlier on the path to autonomous AI decision-making than the hype implies. That's not a criticism—it's useful information.
The technology is interesting. The experiments are worth watching. The conclusions are still pending.
Key Takeaways#
-
Markets test something most benchmarks don't: sequential decision-making under uncertainty with real consequences. That's closer to what autonomous AI would need in the real world.
-
Structural challenges (non-stationarity, reflexivity, adversarial dynamics) make markets uniquely hard. These don't disappear with better models—they're features of the domain.
-
Most LLM strategies don't beat buy-and-hold. GPT-5 roughly matched the index in StockBench. Knowing facts doesn't mean making good decisions.
-
Smaller, specialized models sometimes outperform giants. Kimi-K2 beating GPT-5. Grok-4 with tools beating GPT-5. Architecture and deployment matter.
-
Risk management is the failure mode. LLMs can reason about risk when asked but don't automatically incorporate it. This is the critical gap.
-
Apply the framework: Level tested? Time horizon? Real baseline? Risk metrics? Reproducible? Incentives? These questions cut through hype.
-
Never blindly trust LLM outputs for consequential decisions. They hallucinate, overfit, and lack uncertainty awareness. Markets punish these failures immediately and expensively.
Resources#
- CryptoBench — Dynamic benchmark for crypto analyst workflows
- StockBench — Multi-month stock trading simulation
- LiveTradeBench — Live market evaluation (open source)
- TradingAgents — Multi-agent trading framework
- FinGPT — Open-source financial LLM
- Lopez-Lira & Tang (2023) — Foundational paper on LLM sentiment for returns