Can LLMs make investment decisions?
Models are trading stocks and betting on prediction markets in new public "arenas"
Key Takeaways
There are a growing number of public experiments where LLMs make investment decisions and forecast future events
There isn’t yet evidence of models beating the market at meaningful scale or statistical significance. However, each new generation of frontier model tends to be less bad
OpenAI and Grok tend to be better at trading while Claude tends to be better at pure forecasting
The various arena designs address the various “catch 22s” when LLMs make investment decisions including: (i) choosing across many assets vs focusing your tokens on a single asset, (ii) tool use vs context size, (iii) “real time” vs randomness, (iv) forecasting skill vs data access
We will monitor these arenas going forward, as they are prototypes for eventual institutional strategies
Background
Hedge funds primarily use LLMs to make their teams more efficient, but everyone secretly wonders whether the models will eventually make investment decisions on their own.
There’s a growing universe of “arenas” where models are evaluated on their ability to predict events and make investment decisions. Unlike popular benchmarks such as GDPval, GPQA and ARC, the answers haven’t happened yet so there’s no risk of contamination, and they can’t be saturated because the market makes them harder every day. Also unlike typical benchmarks, these investing and forecasting arenas exactly match a highly valuable real world task.
LLM Investing Arenas
Alpha Arena (nof1.ai)
System design: In Alpha Arena, the models trade $10K each in real money across 7 stocks. Every ~2 minutes, the system asks each model to make a buy/sell/hold trading decision with context including its current portfolio, news, trading data and the original trade parameters. The arena is actually four separate arenas, each with a unique trading goal, in order to add statistical power to the final leaderboard.
Result: All models eventually lost money, but Grok 4.20 (pre-release) and GPT-5.1 performed the best and made money in a couple instances. Claude Sonnet 4.5 and Grok 4 performed the worst
My Take: Very slick implementation of LLM trading, and I’m excited to see what they roll out in “season 2.” One challenge with LLMs is they are usually non-deterministic, meaning they don’t produce the same answer every time. So if you call your model enough it might randomly dump your entire portfolio. The nof1.ai team solved this problem by prompting the model to create a trading plan (price target, stop loss, invalidation conditions, etc), then feeding that same plan back in future calls. Another smart thing they do is ask each model about the narrative supporting each trade, and where it expects it to go.
AI Controls Stock Account (Nathan Smith)
System design: Ran ChatGPT Deep Research once per week for six months to allocate real money across a universe of small cap healthcare stocks
Result: -17%
My take: This was a great, early implementation (especially by a high school student!). A challenge with this approach is it uses one giant call to set its portfolio each week. That spreads its tokens across many, many different potential investment decisions. The name of the game is to burn as many tokens on the most valuable decisions, so I believe it’s better to build up the portfolio with many smaller decisions. This experiment also highlighted another crucial issue with LLM driven trading: portfolio construction. One of its core positions, AYTR, fell 83% when it announced failed Phase 3 trial results, and the portfolio never recovered. The problem wasn’t that the LLM should have known (the entire market was offsides), the problem was it shouldn’t have been such a large position given the source of edge had nothing to do with predicting drug trial results.
AI Investing Arena (Bobby Dhungana)
System design: Models paper trade 5 ETF (S&P 500, Nasdaq, Gold, Interest Rates, Oil). The system asks each model to make a buy/sell/hold trading decision every 30 ~minutes with context including VIX volatility, treasury yields, dollar strength, oil prices
Result: Still active, started Nov 25. GPT-5 in the lead, Claude Sonnet 4.5 in last place, though all are ~breakeven
My take: Was inspired by and has similar implementation to Alpha Arena. But I like the focus of allocating across ETFs vs individual stocks. It’s possible the generalist nature of LLMs make it better suited to allocating across sectors, vs individual stocks where they have an information disadvantage. But too early to draw conclusions as the experiment has only been running since late November.
AI Arena (rallies.ai)
System design: Models maintain their own portfolios, evaluating them every few days. Architecture includes custom MCP servers and tool calls to distill a large universe of potential investments into a few potential trades, so the decision model can focus on choosing among a few quality options.
Result: Almost every model is making money, led by Deepseek and Grok-4 but still early
My take: This architecture addresses the catch 22 of wanting the model to select among as many assets as possible, while still focusing as many tokens as possible on individual decisions. Their solution is an extensive screening step to first identify stocks at technical extremes, with unusual options flow, interesting fundamentals and near term catalysts. Still too early to draw conclusions, needs to go through an earnings cycle. Factors likely explain most of the move so far.
Flat Circle Arena (Flat Circle)
System design: Models paper traded individual earnings during 4Q24
Results: OpenAI o1 and Grok-2 performed the best, while Claude Sonnet 3.5 performed the worst. o1 performed much better than o3-mini. Opus performed much better than Sonnet. The more expensive models outperformed the cheaper ones
My take: This was an early, rudimentary effort. One advantage to focusing entirely on earnings is the results are “pure idio” - i.e., market and other factors have limited impact on the returns, you’re almost entirely measuring LLMs’ ability to beat other investors. While these results were promising, results in subsequent earnings periods deteriorated as they entered different market environments (ie liberation day, AI capex boom). Another limitation of this strategy was focused on large cap stocks. It’s possible LLMs are more effective on the longer tail where there’s less competition.
LLM Forecasting Arenas
Related, there are a handful of “forecasting arenas” where instead of investment decisions, models bet on prediction markets or forecast events.
My take: Forecasting arenas are a purer benchmark on LLMs’ ability to predict the future, and it turns out LLMs are pretty good at it. Results genearlly show leading forecasting models having a winning hitrate while betting on Polymarket or Kalshi (though unclear if good enough to win at any real scale).
Even the best models aren’t able to beat the best human forecasters (not sure they ever will, as the best forecasters also have access to LLMs).
Claude Opus and Sonnet appear the strongest at pure forecasting (unlike in the investing arenas, where they’re often the weakest). How could this be true? One theory is that Claude has the most analytical rigor (using baserates and proper scenario analysis) but weaker access to tools like google / x.com search that are more important for investing. This is where Grok, Gemini and OpenAI are strongest.
Catch 22s when LLMs make investment decisions
These arenas show various ways to address the “catch 22s” in LLMs making investment decisions:
Wanting the model to select as many possible asset vs. focusing all your tokens on a single asset?
Providing access to as many tools as possible vs. managing to an optimal context size?
Allowing the models to make decisions in “real time” vs more randomness the more times you call the model?
Choosing the best forecaster model (Claude) vs the ones with proprietary data access (Gemini, Grok)?
There’s no evidence yet of LLMs beating the market with any scale or statistical significance. However, there’s going to be a lot of new models and architectures released in 2026. With these improvements, we expect to see more institutional focus on LLMs making investment decisions.
Follow for more on investing arenas and other creative LLM use cases
If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.






