Flat Circle

Creative Agent Use Cases - Channel check agents

Jim Moran — Tue, 26 May 2026 14:06:29 GMT

Where agents are consistently wrong

Claude apparently misled Dan Loeb into thinking Iran could not turn off oil production due to “water coning,” a phenomenon that occurs with too much production (@DanielSLoeb1)

This is an example of LLMs being challenged with highly technical, new investor questions. Claude may have repeated a narrative produced by a political think tank. Just like the market, agents can latch on to well written incorrect information.

We are collecting examples of more content that regularly misleads agents, please reply if you’d like to discuss.

Channel check agents

Financials

To test whether Nubank (NU) was facing non-performing loan stress, an investor used Claude Cowork to scrape Glassdoor, LinkedIn, and Reddit (Jimmy’s Journal)

Identify which banks are growing via new loan agents, a signal for lower creditworthiness (@NDS0909)

Homebuilders

Build a spec-pressure and incentive tracker based on homebuilder community pages (Idea File). Note: “idea files” are long prompts to replicate and customize the analysis in your own Claude or Codex

Retail / Restaurants

Great thread from former Schonfeld / Citadel analyst on using agents to analyze new store productivity and white space opportunity (@GregoryBlotnick)

Turnaround tracker: Agent that monitors employee complaints on reddit for Kohl’s and other retail businesses with new management (Idea File)

Energy / Renewables

Compare E&P management cost-reduction claims against actual filed well budgets (AFE Leaks)

Scan renewable project claims, map to public-company parents, then compare them against newly required NERC IBR registrations (Original idea: PV Magazine, Idea File)

Software

Analyze reddit discussions on new AI products from public software companies (Idea File)

Investment workflows

Stock Thoughts discusses how he uses Claude to manage his PA (Substack)

100 Bagger Hunting describes his agent that scans for “kill conditions” on his data center holding (Substack)

One investor’s prompt for a capital allocation scorecard (AI Investing Lab)

Idea for earnings call agent that extracts management claims then reads competitor transcripts for contradictory claims (@byerlys32)

Interesting prompt for analyzing earnings transcripts (@MashraniVivek)

You are a seasoned analyst who reads between the lines on every management call. Tell me what actually happened:
The 3 real decisions made — not just stated
What management conspicuously avoided answering
The quiet commitments they’ll be held to next quarter

Other news and tools

Synthetic panel startup for hedge funds offers MCP (Thesis Lab). Synthetic panels are collections of agents designed to simulate real human responses. Thesis Lab competes with Aaru, Simile.ai, Electric Twin, Qualtrics and others. See earlier coverage of synthetic panels.

Sellside is increasingly using AI, resulting in increased quality, speed and market reactions (SSRN, thanks to Justina Lee)

Merger arb investors increasingly relying on agents (AI Street)

Follow for more creative agent use cases

Subscribe now

If you would like to discuss incorporating agents into your research process, reply to this email or reach out via X or LinkedIn.

Creative Agent Use Cases - Anthropic job posts, public CEO brains and competitor checks

Jim Moran — Wed, 13 May 2026 16:51:34 GMT

Flat Circle tracks creative uses of LLMs in hedge funds. Join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Second Brains

Instead of reading books, the CEO of YCombinator asks his agents to rewrite them as insights mapped to his life (@GarryTan). This is possible because he’s developed a “second brain” - a personal knowledge base making his agents way more useful. Tan shares his framework for maintaining his second brain in the opensource project, GBrain.

What will happen with all these second brains?

Some people want to rent them out. Sequoia-backed Delphi.ai allows influencers to host second brain chatbots based on their essays, interviews and other documents they feed it. Some public company executives like Brian Halligan (Hubspot co-founder/former CEO), Gokul Rajaram (Pinterest and Coinbase board member), Spencer Skates (Amplitude CEO) even have second brains you can talk to.

A couple funny responses from public company executives:

Alfred Lin, Airbnb board member, on their recent decision to commingle hotels with home listings: “The moment you commingle [hotels] into a single search, you force yourself to build one ranking system that has to serve both. And one of them always loses. You either optimize for homes and hotels feel generic, or you optimize for the transaction and homes feel like a commodity. You end up with a muddled middle that serves neither well.”

Mike Ovitz, former Disney CEO, on Disney+ as the “digital centerpiece”: The Fortnite piece is the tell. If Disney's truly centralizing everything on Disney+, why are they planting the flag inside Epic's universe? That's not strategy; that's optionality masquerading as strategy. It signals uncertainty about which direction actually wins.

We’ll be able to benchmark which second brains bring incremental insight to the market. Delphi has bots for four Iran experts - policy analysts and professors - and they all think Polymarket’s current odds on a permanent ceasefire deal are way too high. From political analyst Hooman Majd:

Delphi’s public bots are trained on mostly public info and are mainly for amusement. But the value to building personal knowledge base wikis for your agents is very real. These second brains reflect years of notes, pattern recognition and work products not currently in any model’s training data. Eventually they’ll start interacting with the world in new and exciting ways.

Creative Agent Use Cases

Yet Another Value Blog describes his Claude workflows to run local competitor, discounting and review checks for Sweetgreen (SG) and Cable One (CABO) (Substack)

How to create a synthetic panel of varied investor personas to model how investors will react to any catalyst (LLMQuant)

By deploying a local large language model called Llama 3.1 8B and endowing it with 216 distinct investor personas drawn from the FINRA Foundation’s National Financial Capability Study, the authors generated 1.188 billion synthetic buy, hold, and sell responses to 5.5 million S&P 500 news headlines spanning 2010 through April 2025.
The result is something the finance literature has lacked for decades: a high-frequency, demographically representative measure of how investors actually disagree about specific news, day by day, headline by headline.

YC partner on how and why to switch from Claude Code to Codex (@OMooreTweets)

How Microsoft’s Director of Data Science builds data agents to explain why KPIs are moving (Medium)

To make this work in practice, we made a handful of concrete design choices — each with trade-offs — across architecture, platform, data access, security, and behavioral guardrails

Former Maverick PM discusses his system that identifies three drivers of each company in his coverage and performs 24/7 monitoring on anything related to those drivers from news, sellside research, transcripts, etc (@FundamentalEdge)

University of Zurich paper used LLMs to process 128,860 10K filings to track how firms prepare, insure and financially buffer themselves against physical climate hazards (Climate and Tech)

LLM generated dataset of Canadian miners, comparing production guidance against drill assays, grade-by-thickness economics and regulatory guidelines (Quintarthai)

Jaguar Analytics spends $2,000 per month on Claude and Gemeni to identify hidden angles outside of boilerplate commentary from equity research, earnings transcripts and investor day presentations (@JaguarAnalytics)

Thread from the /LegalTech community about a system to identify shifts in narrative framing (Reddit)

A lot of the difficult reconstruction problems are less about “finding contradictions” and more about detecting shifts in narrative framing over time. Especially with omission cases, the absence itself can become evidentiary: someone stops repeating a prior position... language gradually softens or changes without acknowledgment

Idea Files

Idea Files are detailed prompts you can paste into your Codex or Claude Code to replicate and customize a custom dataset. For more, read here.

How to track what stocks will be impacted by the next Anthropic release (Original Idea: Jeremy Leung, Idea File)

To replicate and customize this with your Claude or Codex, see our idea file.

How to track comp award changes incentivizing M&A (Original Idea: Yet Another Value Blog, Idea File).

“Friday afternoon, the WSJ broke a report that GME is looking to buy eBay…In late November they gave all of their executives’ PSUs that vested only if the company underwent a change of control and the stock was ‘at least $7.50’ per share within the next five years. ... I’m not sure I’ve ever seen a single PSU grant that flashes ‘we are for sale’ harder than that grant.”

We turned this concept into a system that identifies past comp changes that preceded M&A transactions and identifies similar plan updates:

To replicate and customize this with your Claude or Codex, see our idea file.

How to track local friction for power, storage and infrastructure buildouts (Original Idea: PV Magazine, Idea File). The system flagged two recent public-company-linked siting-risk signals:

DTE Energy (DTE)’s Poseyville Solar Park was denied a special-use permit in Ingersoll Township on April 21, 2026
SK Innovation-linked (KRX:096770) Key Capture Energy’s KCE NY 34 project in Saugerties was facing organized opposition, nearly 600 petition signatures, and potential Article 78 litigation as of May 5, 2026.

To replicate and customize this with your Claude or Codex, see our idea file.

Flat Circle mentioned on Bloomberg

Thanks to Justina Lee for mentioning Flat Circle in her piece last week: AI Bots Auditioning for Wall Street Trading Are Mostly Losing. AI trading arenas are public experiments where agents research stocks and prediction markets and make trading decisions, and we track all of them here.

Follow for more creative agent use cases

Subscribe now

If you would like to discuss incorporating agents into your research process, reply to this email or reach out via X or LinkedIn.

Creative agent use cases - Lexical pivots, strategic biases, cloning yourself, say-do scores

Jim Moran — Tue, 05 May 2026 12:03:52 GMT

Flat Circle tracks creative uses of LLMs in hedge funds. Join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

New feature: Idea Files

The most popular all-time links in this newsletter are:

Daily macro report based on Polymarket odds changes (@pathikrit_wrick)
Evasion detection in earnings call transcripts (arXiv)
Say-do score: identifying whether past CEO promises bore out (@dahu7744)
Scoring the expertise and track record of every CEO (@FundamentalEdge)
Identifying discrepancies between press releases and transcripts (InfoArb)

Investors want creative ways to generate new insights by using agents to process unstructured info on a large scale.

The problem is ideas aren’t enough. Every reader still has to do the same work building the agents, troubleshooting, QAing and analyzing the data before they know whether it will be useful in their process.

Starting today, we’re including “idea files” with certain use cases. An idea file is a detailed prompt you can use to recreate and customize an agent use case in Codex or Claude Code/Cowork (for more on idea files see here). We’ll host the prompt as a GitHub Gist, and include example data as well.

Using an idea file is simple: simply paste the link into your Codex or Claude along with any modifications you’d like and hit submit.

We hope this will improve the newsletter and make it easier to include these use cases into your process.

Creative use cases with Idea Files

Track what brands different models are recommending for top consumer categories (Original Idea: r/ClaudeAI, Idea File, Example Dataset)

Recreate and customize using our idea file

Compare recent earnings transcripts to flag ‘lexical pivots’ by management (Original Idea: Uncle Equity, Idea File, Example Dataset)

window one read every Q12026 earnings transcript from every company in North American consumer discretionary above 2 billion summarized structurally flag any company where tone on pricing power inventory or consumer health materially shifted from Q42025... By 7:41 AM, Claude flags a mid-cap specialty retailer: management pivoted from ‘robust’ to ‘resilient’ quarter-over-quarter, forward-guidance answers dropped from seven to two.

Recreate and customize using our idea file

Creative use cases from newsletters

Use an agent to score the distance between a company’s “AI rhetoric” and its concrete financial KPIs - for example, mapping claimed efficiency gains against actual compensation expenses and headcount data. Companies whose rhetoric matched their financial reality generated a 41% 12-month return and significantly outperformed those with high discrepancy (Terminal-X.ai)

How to build a “capital allocation scorer” - uses Claude to map five years of historical cashflow statements to management discussion sections, then traces the outcomes of buybacks, M&A and dividend decisions (AI Investing Lab)

How to build a real estate deal pipeline that ingests broker emails, offering memoranda and CoStar exports (AI Consulting Network).

Last-100-declined-deals log with rationale. This is the gold standard input. The model learns far more from “why did we pass” than from “what do we like.”

BlackRock built RockAI, a vibe coding tool with data access and convernance, for its employees to vibecode safely (WSJ, thank you to Matt Robinson)

Creative use cases mentioned on X

Ex-Meta AI founder shares a system to turn podcasts into a knowledge base (@omarsar0)

The agent (Opus 4.7) spots important insights, does deep analysis, and generates thought-provoking observations that really get me curious to research further. All the research goes into a self-improving wiki for later use by any of my agents.

Investor on X shares prompt to grade management on say-do score, transparency, discipline and shareholder alignment (@kaizen_investor)

An investor recommends feeding Claude “The Visual Display of Quantitative Information” by Edward Tufte before prompting it to design charts (@ClarkSquareCapital)

IR teams may be building AI simulators of their top sell-side analysts to stress test their earnings call scripts (@rchikballapur)

An investor recommends thinking of agents as your clones that can be delegated to perform tasks you’d otherwise not prioritize (@FundamentalEdge)

> A clone that listens to every public statement from every competitor, whether on an earnings call, investor conference or podcast, pinging you with relevant read-thoughts
> A clone that gives you the devil's advocate on every position, encoded with your own custom thesis creep prevention checklist
> A clone that does a deep proxy/form-4 analyses on equity incentives for all of your management teams
> A clone that helps you analyze the buy-side whisper on every name heading into print

Creative use cases mentioned in papers

New paper analyzing 16 years of mutual fund outlook reports (Princeton.edu)

Using an LLM-based approach, we extract each fund’s outlook for the domestic equity market and decompose it into beliefs about macroeconomic fundamentals, beliefs about government policy, and a residual component that we interpret as fund sentiment. This decomposition yields a clear hierarchy of information content: funds’ policy beliefs significantly predict subsequent market returns, while beliefs about macro fundamentals and sentiment offer little or no forecasting value.

HBS paper reveals models are biased toward certain business strategies over others (HBR.org). What will be the implication of this as LLMs increasingly become thought partners for investors and executives.

Requesting your feedback

A few investors have shared a challenge around assessing whether an insight generated by an agent represents a variant perspective from consensus. Is this a problem for you? We are working on a solution and would love to discuss - please reply or reach out.

Follow for more creative agent use cases

If you would like to discuss incorporating agents into your research process, reply to this email or reach out via X or LinkedIn.

Subscribe now

Creative AI Use Cases - Apr 16, 2026

Jim Moran — Thu, 16 Apr 2026 13:36:38 GMT

Flat Circle tracks creative uses of LLMs in hedge funds. Join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Custom datasets you can build with Claude Cowork

Flag discrepancies between earnings press releases and transcripts (InfoArb)

Yale finance professor creates a custom dataset of tariff exposure (Paul Goldsmith-Pinkham)

Extract supplier geographies from a 10K filing, determine transit corridors and query AIS vessel-tracking databases and ICEGATE customs registries to audit and monitor supply chains (Alan Shore)

Analyze webcam footage of a fab construction site, counting building levels based on crane heights, to test consensus for WFE spending (@lfg_cap)

Official Claude for Financial Services plugin, including skills like catalyst-calendar, model-update, morning-note and idea-generation (GitHub). Related, a trader published skills around market analysis, technical charting, economic calendars and screeners (GitHub)

How to organize your second brain

Lot of discussion this week re: building your own personal knowledge base - a system of markdown files containing your investment philosophy, research, patterns, skills, notes, etc - enabling your Claude Cowork or OpenClaw to deliver exactly what you need.

It’s all about how you organize it. Models perform worse if you feed it too much information, so you want to arrange your knowledge base such that your system pulls in exactly the context it needs to make a given decision, while not missing anything important nor diluting itself with unnecessary info.

A few approaches:

OpenAI cofounder Andrej Karpathy published LLM Wiki, which organizes your knowledge into a personal wikipedia, with agents that maintain and crosslink across concept pages. Focus is getting your system to reflect your views on a given company, sector or theme (@karpathy, GitHub).
CEO of YCombinator published GBrain, a framework optimized for systems like OpenClaw where the focus is performing as many different actions as possible (GitHub)
The 5th Element star Milla Jovovich published MemPalace, a framework optimized for retrieval of verbatim documents (GitHub)
A former sellsider recommends pulling in four main concepts for each decision: (i) the most important events, (ii) the most recent events, (iii) retrieved related events, (iv) curated long term memories. This attempts to model how the human brain compiles knowledge to make a decision (@HenryChien4)

My take: the right organizational scheme depends on what you’re optimizing for and your ability to catch mistakes. GBrain is great for personal AI assistants. LLM Wiki best reflects your views and experience. MemPalace optimizes for accurately returning the right source documents. The more gold standard examples you can test against, the easier it is to experiment with organizational schemes.

New papers and benchmarks

New paper on “LLM herding” shows including buy/sell ratings significantly influences model responses, even if subsequent analysis contradicts the rating.

My take: For investor agents, sellside research can be a prompt injection: managing what insights are fed to the decision models is critical. This can be a problem when you enable a model with open web search, which can pick up sellside research headlines (OpenReview)

New agent benchmark tests ability to build LBO, lender and DCF models: GPT 5.4 outperforms Claude and Gemini models but still lags human experts (arXiv.org). Related, an investor substack tests the leading Claude models on assessing business quality and recommends Sonnet 4.6 w/ thinking mode enabled.

Updates from AI trading arenas

These are public experiments where agents make trading decisions. We track every arena here:

New arena focused on clinical trial predictions (launch post, website). Continues a trend of vertical trading arenas.

GLM 5, a model by publicly traded Chinese lab Z.ai, has been making money on Prediction Arena (PredictionArena.ai). GLM models score well on long-horizon software engineering tasks. The agent powered by GLM 5 appears to prefer betting on Kalshi markets with asymmetric payoffs (ie that trade around 1 cent for a 1 dollar payoff at time of entry.

Prediction Arena published a paper (arXiv.org). Takeaways are that models perform better when they are able to select from a larger universe of markets and that there’s diminishing returns to incremental research.

Follow for more creative AI use cases

If you would like to discuss incorporating agents into your research process, reply to this email or reach out via X or LinkedIn.

Subscribe now

Creative LLM use cases - Needles in a haystack, interviewing yourself, email for agents and OpenClaw

Jim Moran — Fri, 03 Apr 2026 12:03:56 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Which model works best for which workflow?

LLM benchmarks score models against real world tasks designed by experts. These popular ones cover various investment research workflows: accounting (DualEntry), excel (SpreadsheetBench), SEC filings (Vals), finance reasoning (PRBench), event forecasting (Prediction Arena), thought partnership (BullshitBench) and OpenClaw (PinchBench):

We’re developing investment research benchmarks for agentic systems (ie not just the models but the prompts, tools, etc), which can drive meaningfully different outcomes. If you’re interested in collaborating, please reach out.

Creative LLM use cases

Prompt and methodology for identifying ‘moving targets’ - flagging whenever management teams change what metrics they highlight (arXiv.org, AI Street)

YC partner releases open source, AI-native email inbox (@agupta, GitHub). Makes it easier to bring internal and external data into your email flow. Alternatively, new YC startup AgentMail is an easy way to give your agents their own email account

Coatue-backed long-only uses AI to condense overnight research into custom podcasts (Advisor Perspectives)

Eve also scours the disclosures of more than 13,000 companies; listens to podcasts; scrutinizes social media posts; summarizes the news; and, each morning, generates a podcast for Kishore to listen to while he drives to work.

BullshitBench: When using LLM as a thought partner, does it push back when the premise of your question is flawed? (Benchmark, @petergostev, thanks to Mark Ainsworth). Claude Opus 4.6 lets only 2% of flawed questions through, while GPT 5.4 lets 16% through.

Reddit post on using Claude Code to analyze retailers using satellite data (r/ClaudeCode)

Former hedge fund PM shares prompt for Perplexity Computer to create a guidance credibility analysis (@FundamentalEdge)

AllianceBernstein AI head uses LLMs to fill in missing time series data (post)

Researchers frequently incorporate historical data into their analysis, and data may be unavailable for certain time periods—the dreaded broken time series. Rather than throw away the series, analysts can use AI models to create fill-in data that, in the human expert’s judgement, may be sensible given the context.

TMT investor shares favorite LLM use cases (@lfg_cap)

Option scenario pricing
Portfolio optimiser
Factor / thematic correlation / analysis / alerter
Getting from 0 to 95% on new sub sector
Qualitative relative business quality analysis (based on structured questionnaires)
New ideas / needle in haystack (parsing through 1000s of emails and twitter messages) for differentiated / contrarian views on different industries / geopolitics etc

AlphaSense launches custom agents to run prompts on a schedule, and custom AI expert calls to automatically interview a panel of experts (press release)

Interviewing yourself

Several good pieces this week about documenting your personal workflow into prompts and markdown files.

Example prompt to launch voice interview session - turning open ended discussion into detailed instructions (Ben’s Bites)

More interview prompts that extract context from yourself and make your agents more effective (@Shpigford)

Lawyer discusses how he embeds his own personal frameworks into skill files (@zackbshapiro). He also says it’s impossible to infer someone’s process simply by looking at their outputs, says it needs to come directly from the person:

I’ve had people try to reverse-engineer my Claude skills by studying my outputs, using AI to analyze what I produce and reconstruct the instructions that generated it. They never get close...what my skills actually contain is not a description of what the output should look like. It’s a detailed operating procedure for how the output gets created: decision trees, analytical frameworks, sequencing logic, edge-case handling, judgment calls about when to be aggressive and when to hold back. You can’t see any of that by studying the finished product...A finished contract shows you what a great lawyer decided. It doesn’t show you how she decided it, what she considered and rejected, or the order in which she worked through the issues. The process is invisible in the product. My skills encode the process.

Follow for more investor LLM workflows

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Subscribe now

Creative LLM use cases - Job post analysis, say-do scores, merger arb, expert network MCPs

Jim Moran — Mon, 16 Mar 2026 16:31:22 GMT

GenAI job post analysis

We’ve been collecting hedge fund GenAI job posts over the past month to identify creative LLM use cases, and thought we would analyze them to see what else we could learn:

Funds with the most job listings

Most common technologies

Key takeaways from the data

Average annual salary in $212.5K, with a range from $150K to $300K
Teams prefer vendors for models and storage, open source for everything else
OpenAI mentioned twice as much as Anthropic
AWS is the most popular hyperscaler
While Balyasny, Millennium, Point72 dominate hiring, we did not find evidence of open fundamental-focused GenAI roles at Citadel

If you’d like the full dataset, please reach out

Creative LLM use cases

Bloomberg Businessweek used Claude to review 1,500 hours of livestream footage of influencers playing Stake, a crypto gambling site, to reveal the company was rigging bets (Bloomberg, thanks to Byrne Hobart)

Reporters used Anthropic’s Claude, a large language model, to analyze footage frame by frame and determine the balance, bet and games being played during livestreams

An investor vibe coded an LLM that identifies past CEO claims and whether they bore out into a “say-do” score for every management team (@dahu7744)

Good thread on the process of iterating with Claude Code until it can correctly generate excel models (@tomasrice_au). Related, OpenAI launched ChatGPT for Excel (OpenAI)

Guide to using Claude Code / Cowork for investment research by CEO of Daloopa (@oneThomasli)

JPAM hiring data scientist to analyze sellside notes and news to identify trending and emerging themes (LinkedIn)

Case study on Jefferies equity research internal alternative data LLM chat (Databricks)

This multi-source response surfaces analytical angles that analysts may not have explicitly requested, enabling corroboration across independent sources.

Roundup of internal LLM tools at BAM, Citadel, Point72, etc (@TheValueist)

Merger arb

A merger-arb ETF manager describes his LLM system (@JulianKlymochko):

For example, we have Equity Research Analyst agent that writes an initiating coverage report on the merger target. Next, we have our Legal M&A Analyst agent, that summarizes merger agreements and proxy statements, and our Antitrust Analyst agent, that analyzes market shares along with the DOJ / FTC, EC, China SAMR, and other global regulators would view the deal, in addition to ascribing probabilities of antitrust clearance / 2nd requests / merger challenges.

How Balyasny built its merger arb bot (OpenAI):

early feedback from merger arbitrage teams revealed that agents needed to continuously re-evaluate deal probabilities as new filings or press releases came in. The Balyasny team quickly extended agent planning capabilities and tool access, replacing a slow, manual workflow with real-time probabilistic monitoring

Expert network MCP

Third Bridge launched an MCP for its transcript library (press release). AlphaSense offers one as well but GLG and Guidepoint currently do not have public MCP or API endpoints.

My take: seems like all the major transcript libraries will offer MCP / API access soon. Demand for transcript content should meaningfully increase as agents aren’t limited by cognitive ability to process transcripts. However, I expect pricing to fall even more as the search cost across providers will go down. Networks will compete on their ability to access experts exclusively - and unique experts will benefit accordingly. I also wouldn’t be surprised to see consolidation.

Follow for more investor LLM workflows

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Models deliver Beta, Humans deliver Alpha

Jim Moran — Thu, 05 Mar 2026 15:26:28 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Models deliver Beta, Humans deliver Alpha

Harvard study shows AI can predict 71% of mutual fund trading decisions, but the remaining 29% of trades generate the most alpha (Matt Levine via Justina Lee)

Howard Marks describes the human x-factor (Oaktree LP Letter)

Great investors…have to be strong exactly where Claude admits AI might be weakest: in dealing with novel developments where there’s not enough prior experience for dependable patterns to have been compiled (and learned by AI during its training). They also have to make subjective decisions regarding qualitative factors and exercise taste and discernment. For instance, choosing the right counterparties has played an important part in Oaktree’s success. How will AI make judgments of that sort? And there’s something else: AI doesn’t have skin in the game. It doesn’t feel the weight of concentrated positions or the fear of capital loss. Its willingness to take risk might not be constrained by humans’ normal risk aversion. The best investors sense potential risk intuitively, and this contributes greatly to their success.
Especially when investors are dealing with new and untried products, CEOs, or industries, there can be few facts or analogous experiences, meaning we have to rely on “opinion or speculation.” Given the limitations discussed above on AI’s ability to tackle brand new situations, will its speculation about new things – as opposed to extrapolating historic patterns – be consistently superior to that of all humans? I believe there will continue to be human investors who are superior to AI, since I don’t think AI will be able to do an unbeatable job of these things.

My take: LLMs level the playing field in processing public information and increase the reward for proprietary research, personal relationships and experience

Creative LLM uses case

Hayden Capital vibecoded a pixel tracker for Applovin (APP) (LP Letter)

For example, I recently “vibe-coded” our own Applovin Axon Pixel Tracker, to track Applovin’s new ecommerce push (LINK). The program scans the top 100,000 ecommerce websites, and whether they’ve adopted Applovin’s ecommerce tools – useful for us tracking adoption in real time. I did this all with Claude Code, in just a couple hours over a weekend, and runs on Amazon’s AWS.

Norges Bank uses Claude to monitor foreign language media for ESG issues in their portfolio (CNBC)

“Often, this information has not been captured in international media coverage or data vendor alerts…In multiple instances, we identified and sold these investments before the broader market reacted to the risks, avoiding potential losses.” NBIM said using AI this way had been particularly valuable for researching smaller companies in emerging markets, where news about the firm may be limited to small media outlets in local languages.

VC shares which of his workflows are mostly code vs mostly LLM driven (@ttungaz)

BlackRock and Schonfeld are hiring AI engineers for post-trade operations (BlackRock, Schonfeld)

AWS Bedrock post on using their graphRAG workflow to analyze 10-K filings and identify shared risk relationships across the S&P 100 (AWS)

Rutgers finance professor shares 8 tips for using OpenAI Batch API (50% cheaper) for large scale transcript analysis (LinkedIn)

Interesting paper

Compares LLM use in stock pitches on Seeking Alpha vs r/WallStreetBets. AI drives better returns in the former, more professional community, while on r/WallStreetBets AI drives abnormal trading and lottery outcomes. Interesting take on how LLMs may impact retail and institutional trading (NBER)

New tools

Review of Claude Cowork, a non-technical desktop version of Claude Code (Buyside AI Reviews). When asked to turn a lender presentation into a leveraged loan screening model, the tool made several key errors:

I do think people underestimate (i) the amount of time it takes to check/correct output and (ii) the willingness of senior folks to actually do the checking. And given the black box nature of LLM reasoning, the checking needs may not scale down as fast as AI capabilities scale up.

More: Claude launches Cowork and plugins for finance (announcement)

Checkmate, another AI expert call service, launches (CheckmateResearch.ai). Other services where LLMs source and conduct interviews include: AlphaSense, Guidepoint, NewtonX, Qualitate, Ribbon, Synquery. Interesting reddit thread where experts debate the future of this format: “Please boycott ai mod calls”

FirstDraftResearch, which looks like a “cursor for public market investors,” announces private beta (@atelicinvest)

Bloomberg launches - conversational AI interface (Bloomberg)

Follow for more investor LLM workflows

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM workflows - OpenClaw

Jim Moran — Mon, 23 Feb 2026 13:03:20 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

OpenClaw

OpenClaw (fka ClawdBot, Moltbot) allows a computer to run an LLM system in an “always-on” way and interact with almost anything - making it feel more like a proactive analyst that you can train and work with via slack, email, signal, etc.

How to build an OpenClaw investment research analyst (Saulius)

OpenClaw changes the equation. It is an open-source, self-hosted AI agent platform that runs persistently on your machine, connects to every messaging platform you use, and has access to a full suite of tools -- file operations, web search, browser control, code execution, and long-term memory. It does not just answer questions. It reads research reports, builds financial models, monitors markets around the clock, learns from its own experience, and proactively alerts you when something needs your attention.

Institutional Investor warns funds not to build their own OpenClaw (Institutional Investor)

Data scientist at real estate asset manager analyzes 120 data center projects by talking to OpenClaw via WhatsApp (Infrastructure Research)

Two funds already hiring engineers to build with OpenClaw (job post, job post)

Upwork post by small event driven fund requesting an OpenClaw screening system (job post)

We are actively exploring how to safely use OpenClaw for investor workflows, please reply to this email to discuss.

Creative LLM workflows

Founder of AI native hedge fund details how he builds with Claude Code (@thomasrice_au)

If it's front end or something we interact with, I start by describing what I need to do and initial ideas for interface. I'll then generate 30 mockups (10 each from GPT, Claude, Kimi), asking them to make each quite different, and to make each one an isolated html file with embedded JS and CSS. I'll then go through the 30, dismiss most, keep a few, then keep iterating until it feels right for what I want it to do.

LLM workflow to analyze new 13F holdings and generate a list of relevant positions and reason each fund likely owns them (@FundamentalEdge)

Bonus: New feature from Polymarket allows anyone to pledge rewards to encourage more research (@polymarket thanks to @adrien_nav). Currently the markets with the largest sponsored rewards are for whether the US strikes Iran, Fed decisions and the S&P.

Interesting jobs

Point72 hiring GenAI engineer focused on alternative data (job post)

Updates from trading arenas

In Prediction Arena, all models turned negative this week (predictionarena.ai)

My take: it appears the models made a few concentrated bets that were impacted by extreme weather and a surprising unemployment report. Like a lot of the other trading models, they often win for a period but are actually selling vol. Read more on AI trading arenas here.

Research paper outlines findings from nine-month LLM driven trading strategy. Findings include a 2.43 sharpe and skill at identifying longs but not shorts (arXiv.org)

Chamath Palihapitiya publishes paid post covering trading arenas (Substack)

Follow for more investor LLM workflows

Subscribe now

If you would like to discuss incorporating LLMs or OpenClaw into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM use cases - Dan Loeb, Baupost, Altimeter, YCombinator, Ralph Wiggum

Jim Moran — Wed, 11 Feb 2026 13:02:55 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Investor LLM workflows

Dan Loeb asked Claude which public companies it might disrupt (Third Point 4Q25 LP Letter)

A simple query into Claude’s chatbot: “Which companies is Anthropic capable of dislocating or disrupting?” yields some fascinating results and was in our view a fruitful source of hedges for our firm.

My take: the subtext here is that models are evolving into the source of conventional wisdom.

AI driven short seller Abelian Analysis analyzed hundreds of Youtube transcripts to assess the pricing environment for CVNA’s used vehicles (Short Report, Github)

Each transcript was analyzed by Claude Sonnet 4 using a prompt designed to separate market conditions from creator mood. This distinction is critical. A dealer complaining about thin margins is telling you margins are compressed — a bearish market signal. A flipper excited about “deals everywhere” is telling you inventory is high and prices are soft — also bearish. The LLM was instructed to ignore emotional spin and extract the underlying market reality across three categorical signals (inventory direction, demand strength, repossession activity), two continuous scores (bullish 0-100, bearish 0-100), and a sensationalism rating (1-10) that we use for quality control.

Investor from Altimeter Capital outlines two LLM workflows, including a PDF example from a “council of LLMs that rigorously debate topics with access to web search” (@_clarktang, thank you @realLigerCub)

An RIA PM shares his deep research prompt (@TedHZhang)

AI-native hedge fund, Minotaur Capital, used “Ralph Wiggum” style iterative research loop to determine that gaming stocks were oversold following the Genie 3 release (Minotaur January Letter)

We immediately spun up a research process using the iterative techniques we described in our December Quarterly. From a 127-word prompt asking for implications on the games industry, our AI system iteratively chose what to explore: value chain analysis, five-year scenarios with falsifiable signposts, unit economics ($/minute cost models), IP and licensing questions, and a winners/losers matrix across engines, platforms, and publishers. Over 50 iterations it built out each section, cited sources, and stress-tested its own conclusions.

Former Baupost investor Dave Plon shares an AI workflow around killing ideas faster: develop a list of non-negotiables (e.g., CEO compensation structure, guidance track record), and have the system eliminate every name in your coverage failing those non-negotiables. (Business Breakdowns, 12m 50s)

Former Maverick / DE Shaw / Citadel PM shares his prompt to analyze the technical setup (@FundamentalEdge, thank you @realLigerCub)

New paper analyzes the impact of LLM tools like ChatGPT on price reactions during earnings calls. Stocks don’t react faster, but at a greater magnitude after a delay due to model latency and transcription availability (Price Discovery Within Earnings Calls, thank you Justina Lee)

Cons to LLM investor workflows

An investor argues LLMs make it harder to build conviction (@evrgn11112231)

I view investment research as akin to the slow LLM training process. It’s not supposed to be fast. The goal is to ingest raw data over long periods of time to train your brain (the ultimate LLM) for instant recall and pattern matching later.

Former credit investor warns on using LLMs for initiation style reports, as they often miss key events that would materially alter the narrative (@BuysideAIReview). My take: the quality of an LLM workflow is only as good as its eval. Before asking for a deep dive, first build a list of key events, transactions, players, etc - then run deep research as a loop until it hits all the required items. See the “Ralph Wiggum” discussion above.

Interesting LLM tools

Former mega fund and credit hedge fund investor reviewing every buyside AI tool (Buyside AI Reviews)

Former hedge fund PM previews an insights feed built from earnings call transcripts (@atelicinvest)

New tool benchmarks the stock impact of short reports / short selling firms (ShortReportImpact)

YCombinator Request for Startup: AI-Native Hedge Funds

YCombinator just published its latest Request for Startups:

…the next Renaissance, Bridgewater, and D.E. Shaw's are going to be built on AI. The biggest funds in the world have been slow to adapt. I worked as a quant researcher at one of these funds, and when I asked compliance to let us use ChatGPT, I didn't even get a response. It made it clear to me that the hedge funds of the future won't just bolt AI onto their existing strategies. They'll use it to come up with entirely new ones. That's where the alpha is.

Now tracking every AI trading arena

AI trading arenas are public experiments where LLMs perform research and trade in a live environment. They are one way to track LLM progress in making investment decisions.

Our new page tracking every public arena is here: AI Trading Arenas

Key takeaways include: (i) the median model always loses money, (ii) newer frontier models outperform the older models, (iii) soon-to-be-released Grok 4.2 is undefeated, (iv) Claude has not yet won any trading arena.

Follow for more investor LLM workflows

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM use cases - Illusion of competence, management exploits and other vulnerabilities

Jim Moran — Tue, 03 Feb 2026 19:54:06 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Vulnerabilities in LLM investment research

AllianceBernstein Chief AI Officer warns about management altering phrasing to exploit LLMs earnings calls summarizers:

“Companies know we are measuring sentiment, so they adjust. They’ve started using more positive words, even with bad news.” That forces investors to evolve. “If I focus on the prepared remarks, sentiment scores are high. But in the Q&A, it’s much harder to control. That gives you a better read. “It’s a cat-and-mouse game,” he adds. “You have to keep improving to continue to generate alpha.”

New research on hidden text attacks in automated trading systems. A vulnerability when your LLM web search agent discovers text hidden from human readers:

My take: As investors increasingly leverage LLMs, the market will respond with new attempts to manipulate them. Point72 just posted a role for a GenAI Security Engineer. We are experimenting with a few approaches here, so if you’re interested in this problem, please reach out.

Another LLM risk: Illusion of competence

Two LP letters issue similar warnings on the increasing use of LLMs in investment research:

O’Keefe Steven - 4Q 2025 Investor Letter:

The more concerning dynamic is the illusion of competence. There is a risk that access to more contextually rich output leads to overconfidence in areas where the user lacks actual domain expertise. Nowhere is this more dangerous than in highly regulated, technically complex industries like healthcare or energy, where surface understanding is insufficient for investment decision-making. We expect many market participants to expand into unfamiliar sectors with misplaced confidence, armed with tools that enhance comprehension but not judgment

Upslope Capital - 4Q 2025 Investor Letter

AI is also everywhere – particularly on the desktops of buyside analysts and PMs. I suspect this technological shift is part of a not-so-virtuous cycle with the cultural shift towards gambling. A couple years ago legendary investor Stan Druckenmiller noted how he made a quick bet on Argentinian stocks with an assist from AI: ‘…do you want to hear how I invested in Argentina? It’s a funny story…I saw the speech in Davos and it was about 1:00 in the afternoon in my office. I dialed up Perplexity [AI] and I said, give me the five most liquid ADRs in Argentina…It gave me enough of a description that I follow the old Soros rule, invest and then investigate. I bought all of them. We did some work on them. I increased my positions and so far, it’s been great.’

More investor workflows

TMT / Energy investor lays out how he orchestrates sub-agents using the “Great Architect” framework (@TheValueist)

Head of AI at Manulife shares strategies to drive internal adoption (AI Street)

Former Capital Group partner and founder of new LLM driven hedge fund uses LLMs to analyze 2026 outlooks from the top asset managers (Linkedin)

We used a language model to extract thousands of individual statements and organize them by topic and time horizon. Similar ideas were grouped and weighted by how often they appeared across independent firms…On the environment, there was broad agreement….Firms disagreed on where returns are most likely to come from, how durable US market leadership will be, the timing and impact of policy easing, and how investable AI is at current valuations. These differences were not about facts. They reflected judgment…

Slides from new NYU Stern course on AI in Finance (Substack, Slides)

RAG vs GraphRAG

The platform funds are building AI teams to develop workflows for their PMs and analysts. One of the top required skills for these teams is advanced retrieval augmented generation (RAG) techniques such as GraphRAG - an open sourced framework developed by Microsoft Research. Example job post: E.g.,

Millennium: Senior GenAI Engineer - Advanced Rag

Enrichments and Knowledge Graph Construction: Move beyond flat vector search by building GraphRAG systems and advanced annotations such topics, keywords, sentiment, etc. You will extract entities (Companies, People, Metrics) and relationships from text to build a dynamic Knowledge Graph that captures the nuance of the financial markets and its temporal aspects.

Basic vector RAG, which you would experience by attaching files to ChatGPT or NotebookLM, searches documents for relevant excerpts and simply attaches them as context to your prompt. GraphRAG indexes your corpus into entities and relationships then uses that structure to synthesize answers that aren’t obvious from any single chunk. Funds are hiring for this knowledge graph retrieval frameworks since they see their edge buried in writeups, interviews, surveys, notes and other longform text and want to maximize second-level insights.

Update on trading arenas

Soon to be released Grok 4.2 is the only model making money trading weather prediction markets (PredictionArena.ai)

In another arena featuring Grok 4.2, Alpha Arena, xAI’s forthcoming model won as well. We cover trading arenas in more detail in an earlier post.

Follow for more investor LLM workflows

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM use cases - Prompts for evasion, switching costs, risk arb, and hidden real estate value

Jim Moran — Mon, 26 Jan 2026 14:57:49 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Detecting evasion

Generation IM ($21B AUM) shares a couple workflows in their 4Q25 letter:

“Our ‘First Looks’ initiative serves a different function. When analysts evaluate a new company, we leverage AI to provide a snapshot overview with green, yellow and red flags drawn from sources like Glassdoor reviews, which used to take hours of manual work. Finally, our ‘Deception Detection’ dashboard analyses earnings call transcripts across the portfolio, flagging watchlist topics and potential areas for forensic accounting review”

New paper features prompt and system design for identifying evasion: EvasionBench: Detecting evasive answers in financial Q&A via multi-model consensus and LLM-as-Judge. The key is to score each response on the degree to which management (i) answers the specific question, (ii) introduces irrelevant framing, (iii) relies on generalities, and (iv) deflects. Their eval is a set of 1,000 human annotated scores. Prompt is on page 12.

Automated expert network interviews

A couple things this week:

Tech investor shares his experience with AlphaSense’s new AI-led interview product, with example outputs
Former Chief Data Scientist at Third Point covers Ribbon.ai’s new expert-network-in-a-box: The biggest disruption to expert networks since Tegus

You can now source 1B+ experts leveraging their tool and then instantly book an expert call leveraging their voice AI, which then generates a transcript, which you can then leverage all their tools to analyze the transcript. All of this is white-label and available via API.

Expert networks utilizing AI interviews include: AlphaSense, Expert Insights, Guidepoint, Qualitate, Ribbon.ai, Synquery (Email me any I’m missing, and I’ll compile and circulate a longer list)
I’ve also heard of multiple funds building this internally

My take: I’ve been skeptical of AI-led interviews because I worried they’d overindex to lower quality “professional experts” and they wouldn’t create the chemistry eliciting deeper conversation. While definitely a limitation, I underappreciated how much better this experience is for the expert: an LLM is available around your schedule, they’re never rude to you or cancel on you. And there’s no chitchat. Also, investors don’t do more interviews because they’re constrained on time/mental-energy not so much on money. LLM driven interviews may dramatically increase both the supply and demand of expert interviews this year.

Software switching costs

Former long/short PM shares his ChatGPT thread on finding low mission critical software (@atelicinvest):

Prompt: Help me build a first-principles framework to identify low mission-critical software by analyzing integration depth, switching costs, compliance exposure, workflow impact, and behavioral indicators (e.g., low engagement, discounting, promotions), without relying on retention metrics.

More investor workflows

Former Healthcare PM vibecodes datasets his internal datateam would have never prioritized (@FundamentalEdge)

My take: Brett also mentions the notion of “artisanal workflows” which I think is a good concept here: everyone’s vibecoded research systems are going to be unique.

Fintool founder shares lessons from two years building investment research agents. Very thorough, technical and good (@Nicbstme)

Prompt for finding risk arb filings (@RodAlzmann).

Prompt for first- and second-order impacts of tariffs, sanctions and export controls in a research paper on geoeconomic pressure. Starts page 10.

Tool using Opus 4.5 to estimate hidden real estate value (@AltayCapital)

TMT and Energy investor shares experience building a 10-K deep dive tool (@TheValueist)

Interesting papers

What does it take to be a good AI research agent? Studying the role of ideation diversity. Two good things in here: (i) as models get better at using tools, LLM systems designed for greater diversity of ideas will outperform, and (ii) changing temperature doesn’t really help.

Taxonomy-aligned risk extraction from 10-K filings with autonomous improvement using LLMs. Written by the team behind AI tool Massive, so they don’t share the prompt. But it’s a good framework for classifying a large universe of companies into a customized set of risk factors. A related tool was recently released on X (@JaredKubin).

Updates from AI trading arenas

These are public experiments where LLMs make trading decisions. Longer summary of this trend is here; takeaway is that all trading robots eventually lose money but worth monitoring because the newer, more expensive models are starting to lose less money.

Aster - Humans vs AI: traders compete with robots for a $150K prize pool. Right now the AIs are being the humans on average, but nine out of the top ten traders are human. This feels about right!

Okbet Arena: 5 models compete at placing bets on Polymarket. All models are losing money, and right now GPT-5.1 is in the lead while Deepseek R1 is in last place.

FinDeepForecast: Live benchmark based on a new paper: FinDeepForecast: A live multi-agent system for benchmarking deep research agents in financial forecasting. Basically identical results to Okbet Arena.

Follow for more investor prompts and workflows

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM use cases - Podcast agent, investor letters, synthetic panels

Jim Moran — Thu, 15 Jan 2026 22:28:07 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Interesting use cases

Have an LLM listen to every industry podcast and alert you with anything relevant using our new Podcast Agent

26 million podcast episodes are published each year. Many are interviews with management teams from public companies, their competitors, customers and suppliers. Based on interest from several funds, we built a simple tool that listens to every podcast and alerts you of anything incremental to your coverage. Couple examples from this week:

ORCL: VP of a 17 hospital system discusses their recent decision to transition off Oracle Health (Becker’s Healthcare Podcast)
COTY, ULTA: Ulta’s SVP of Ecom discusses their new marketplace, how it plans to bring emerging brands into their brick and mortal channels (Omni Talk Retail)

We plan to add more sources of content to this over time. Sign up for free.

Askelladden Capital discusses its process for ingesting and scoring fund LP letters

I’ve built a tool that reads investment letters of other fund managers, ignore all macro / philosophical discussion, extract only single-ticker investment ideas, summarize them, and scores them against a rubric based on our historical priorities. That rubric was – drumroll – drafted by AI after reading years of our letters, then subsequently refined by me.

My take: If you’re building this agent for your process, Yellowbrick is a pretty good datasource for investor theses.

Prompt from Reddit commenter to analyze 10Ks/Qs for changes in forward looking statements.

My take: Agree with the focus on using LLM to extract verbatim language vs return a conclusion.

Interesting papers

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings. Surveys a panel of LLMs pretending to be shoppers and compares results against real panelists, finding a 90% overlap. Prompts discussed in the paper and on their github.

My take: There’s a lot of synthetic panel research vendors including Aaru, Electric Twin and Qualtrics. I think synthetic panels work almost as well as human panels, which is not that great. The best ones are calibrated against real panel and purchasing data. I’m not clear why a panel of LLMs is better than a system leveraging a single LLM, or even what the difference is. Synthetic panels are also unlikely to produce entirely new findings like humans. They also might be more useful for companies for feedback on a specific product, vs investors who are looking for fresh, on the ground information. There’s a lot of interest in this space, so will explore it more and report back with anything interesting.

New tools

My take: SimilarWeb already offers an MCP server for its existing customers to access via LLM, the difference here is their move to share data for free as lead gen. I think we’ll see a lot more of this, especially in relatively commoditized segments like web traffic data. Worth noting this was announced two weeks after Meta acquired Manus for >$2b.

Interesting job descriptions

AlphaSense/Tegus is hiring a hedge fund LLM workflows product manager

your value lies in your ~5 years of experience at a top-tier Hedge Fund…You will be the primary architect of “quality.” Design, test, and refine prompts to ensure our AI output meets the high standards of a professional investor. You will look at an AI summary or extraction and immediately know if it “sounds right” to a PM or Analyst

Millennium hiring GenAI engineer for advanced RAG

Follow for more case studies of LLMs+Investment Research

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative LLM use cases - Evaluating new CEOs, summarizing Bloomberg IB chats, charting investor narratives

Jim Moran — Fri, 09 Jan 2026 18:29:04 GMT

Flat Circle tracks creative use cases of LLMs in hedge funds. If you haven’t already, join hundreds of PMs, analysts and engineers reading each week:

Subscribe now

Case studies

Former PM at Schoenfeld, Citadel, DE Shaw shares his prompt for evaluating a new CEO (@FundamentEdge)

My take: I like Brett’s prompt because it first asks the model to do a lot of work understanding the history of the situation and what levers might be available new for management. That’s important for sizing the opportunity but for the user to see if the model’s understanding of history aligns with their own.

Australian hedge fund Minotaur Capital says new Claude Opus 4.5 model outperforms OpenAI on its internal research benchmarks (LP Letter)

On the technology front, we’ve spoken extensively about our disciplined framework for testing and evaluating large language models across different use cases. One of the challenges is how quickly the frontier shifts. Recently, we’ve been testing Codex versus Claude for writing and research tasks and have found that Claude (Opus 4.5) is currently delivering superior results. As a result, we’ve migrated a meaningful portion of our internal research workflows accordingly.

My take: New models come out frequently and even the same model can vary in performance week to week. Maintaining an objective benchmark for research summary is necessary to ensure you’re always using the highest quality model. One way I’ve seen this done is having the top LLMs summarize every earnings press release as soon as its posted, then later scoring each one against sellside recaps. A decent proxy for the best model at junior analyst type work.

JPMorgan is cutting ties with proxy advisory firms and will use in-house AI to cast shareholder votes (WSJ)

The bank will use the platform to manage the votes and the AI also will analyze data from more than 3,000 annual company meetings and provide recommendations to the portfolio managers, the memo said, replacing the typical roles of proxy advisers.

My take: Glass-Lewis + ISS is a duopoly because they aggregate voting power on behalf of their institutional clients. LLMs will cost them pricing power but I don’t see why most asset managers will follow suit and build their own.

New tools

New tool uses Gemini to chart the cultural prominence of various narratives over time (Cultural Eigenclusters, The Diff)

My take: I think something similar can be combined with price data to map the narratives around assets. Reach out if you’re interested in exploring.

Interesting job descriptions

Soros is hiring an AI Orchestration Engineer (LinkedIn)

Examples projects include tonal analysis of earnings calls or summarization of Bloomberg IB chats.

Verso Parters, a $600mm AUM SF-based hedge fund, is hiring a founding AI product engineer (Johns Hopkins)

Much of our investment research process is qualitative: expert interviews, reconstructing industry history, understanding how a company got to where it is, pressure-testing narratives, and tracking what would change our mind. The challenge is that the raw material of great research is vast and messy – notes, transcripts, filings, models, datasets, internal memos, and trade history.
Your job will be to build a research and decision-support “OS” that helps us:
execute our research process more effectively,
integrate insights across disparate data sources,
spot biases and patterns in our own analyses and trading behavior,
and ultimately ask better questions, see the ball more clearly, and make better investment decisions.

OpenAI hiring large cap research analyst from buyside or sellside (LinkedIn)

As equity research community’s interest in OpenAI grows, we are hiring a full-time role to engage closely with the analysts. Looking for an experienced large-cap equity research analyst from buy/sell-side. Arguably one of the best seats in the world to understand AI if you are curious. DM me if interested.

Millennium hiring AI engineer - equities technology (MLP careers)

Norway’s sovereign wealth fund hiring engineer for LLM workflows (LinkedIn)

Interesting papers

LLM driven investment strategies lose money over time because they struggle account for regime change (arXiv.org)

Follow for more case studies of LLMs+Investment Research

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Can LLMs make investment decisions?

Jim Moran — Tue, 06 Jan 2026 13:06:29 GMT

Key Takeaways

There are a growing number of public experiments where LLMs make investment decisions and forecast future events
There isn’t yet evidence of models beating the market at meaningful scale or statistical significance. However, each new generation of frontier model tends to be less bad
OpenAI and Grok tend to be better at trading while Claude tends to be better at pure forecasting
The various arena designs address the various “catch 22s” when LLMs make investment decisions including: (i) choosing across many assets vs focusing your tokens on a single asset, (ii) tool use vs context size, (iii) “real time” vs randomness, (iv) forecasting skill vs data access
We will monitor these arenas going forward, as they are prototypes for eventual institutional strategies

Subscribe now

Background

Hedge funds primarily use LLMs to make their teams more efficient, but everyone secretly wonders whether the models will eventually make investment decisions on their own.

There’s a growing universe of “arenas” where models are evaluated on their ability to predict events and make investment decisions. Unlike popular benchmarks such as GDPval, GPQA and ARC, the answers haven’t happened yet so there’s no risk of contamination, and they can’t be saturated because the market makes them harder every day. Also unlike typical benchmarks, these investing and forecasting arenas exactly match a highly valuable real world task.

LLM Investing Arenas

Alpha Arena (nof1.ai)

System design: In Alpha Arena, the models trade $10K each in real money across 7 stocks. Every ~2 minutes, the system asks each model to make a buy/sell/hold trading decision with context including its current portfolio, news, trading data and the original trade parameters. The arena is actually four separate arenas, each with a unique trading goal, in order to add statistical power to the final leaderboard.
Result: All models eventually lost money, but Grok 4.20 (pre-release) and GPT-5.1 performed the best and made money in a couple instances. Claude Sonnet 4.5 and Grok 4 performed the worst
My Take: Very slick implementation of LLM trading, and I’m excited to see what they roll out in “season 2.” One challenge with LLMs is they are usually non-deterministic, meaning they don’t produce the same answer every time. So if you call your model enough it might randomly dump your entire portfolio. The nof1.ai team solved this problem by prompting the model to create a trading plan (price target, stop loss, invalidation conditions, etc), then feeding that same plan back in future calls. Another smart thing they do is ask each model about the narrative supporting each trade, and where it expects it to go.

AI Controls Stock Account (Nathan Smith)

System design: Ran ChatGPT Deep Research once per week for six months to allocate real money across a universe of small cap healthcare stocks
Result: -17%
My take: This was a great, early implementation (especially by a high school student!). A challenge with this approach is it uses one giant call to set its portfolio each week. That spreads its tokens across many, many different potential investment decisions. The name of the game is to burn as many tokens on the most valuable decisions, so I believe it’s better to build up the portfolio with many smaller decisions. This experiment also highlighted another crucial issue with LLM driven trading: portfolio construction. One of its core positions, AYTR, fell 83% when it announced failed Phase 3 trial results, and the portfolio never recovered. The problem wasn’t that the LLM should have known (the entire market was offsides), the problem was it shouldn’t have been such a large position given the source of edge had nothing to do with predicting drug trial results.

AI Investing Arena (Bobby Dhungana)

System design: Models paper trade 5 ETF (S&P 500, Nasdaq, Gold, Interest Rates, Oil). The system asks each model to make a buy/sell/hold trading decision every 30 ~minutes with context including VIX volatility, treasury yields, dollar strength, oil prices
Result: Still active, started Nov 25. GPT-5 in the lead, Claude Sonnet 4.5 in last place, though all are ~breakeven
My take: Was inspired by and has similar implementation to Alpha Arena. But I like the focus of allocating across ETFs vs individual stocks. It’s possible the generalist nature of LLMs make it better suited to allocating across sectors, vs individual stocks where they have an information disadvantage. But too early to draw conclusions as the experiment has only been running since late November.

AI Arena (rallies.ai)

System design: Models maintain their own portfolios, evaluating them every few days. Architecture includes custom MCP servers and tool calls to distill a large universe of potential investments into a few potential trades, so the decision model can focus on choosing among a few quality options.
Result: Almost every model is making money, led by Deepseek and Grok-4 but still early
My take: This architecture addresses the catch 22 of wanting the model to select among as many assets as possible, while still focusing as many tokens as possible on individual decisions. Their solution is an extensive screening step to first identify stocks at technical extremes, with unusual options flow, interesting fundamentals and near term catalysts. Still too early to draw conclusions, needs to go through an earnings cycle. Factors likely explain most of the move so far.

Flat Circle Arena (Flat Circle)

System design: Models paper traded individual earnings during 4Q24
Results: OpenAI o1 and Grok-2 performed the best, while Claude Sonnet 3.5 performed the worst. o1 performed much better than o3-mini. Opus performed much better than Sonnet. The more expensive models outperformed the cheaper ones
My take: This was an early, rudimentary effort. One advantage to focusing entirely on earnings is the results are “pure idio” - i.e., market and other factors have limited impact on the returns, you’re almost entirely measuring LLMs’ ability to beat other investors. While these results were promising, results in subsequent earnings periods deteriorated as they entered different market environments (ie liberation day, AI capex boom). Another limitation of this strategy was focused on large cap stocks. It’s possible LLMs are more effective on the longer tail where there’s less competition.

LLM Forecasting Arenas

Related, there are a handful of “forecasting arenas” where instead of investment decisions, models bet on prediction markets or forecast events.

My take: Forecasting arenas are a purer benchmark on LLMs’ ability to predict the future, and it turns out LLMs are pretty good at it. Results genearlly show leading forecasting models having a winning hitrate while betting on Polymarket or Kalshi (though unclear if good enough to win at any real scale).

Even the best models aren’t able to beat the best human forecasters (not sure they ever will, as the best forecasters also have access to LLMs).

Claude Opus and Sonnet appear the strongest at pure forecasting (unlike in the investing arenas, where they’re often the weakest). How could this be true? One theory is that Claude has the most analytical rigor (using baserates and proper scenario analysis) but weaker access to tools like google / x.com search that are more important for investing. This is where Grok, Gemini and OpenAI are strongest.

Catch 22s when LLMs make investment decisions

These arenas show various ways to address the “catch 22s” in LLMs making investment decisions:

Wanting the model to select as many possible asset vs. focusing all your tokens on a single asset?
Providing access to as many tools as possible vs. managing to an optimal context size?
Allowing the models to make decisions in “real time” vs more randomness the more times you call the model?
Choosing the best forecaster model (Claude) vs the ones with proprietary data access (Gemini, Grok)?

There’s no evidence yet of LLMs beating the market with any scale or statistical significance. However, there’s going to be a lot of new models and architectures released in 2026. With these improvements, we expect to see more institutional focus on LLMs making investment decisions.

Follow for more on investing arenas and other creative LLM use cases

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative ways hedge funds are using LLMs - Dec 30, 2025

Jim Moran — Tue, 30 Dec 2025 19:02:28 GMT

Grading sellside analysts

A former sellside analyst used an LLM to analyze 10 years of earnings calls and identify analysts with the most prescient questions. Examples:

Umer Raffat from UBS - Biogen (BIIB) Earnings Call - Oct 2019:

…the implication in the data is that the high with insufficient exposure at the high dose, the second trial worked as well. But when we look at CDR Sum of the Boxes low dose actually looks more consistent than the high dose…how spot-on is that finding on patients that had a sufficient exposure?

LLM reasoning:

Raffat immediately spotted the fragility of the efficacy claim—dependence on a post-hoc exposure subset rather than the pre-specified ITT, foreshadowing the FDA advisory-panel scepticism and eventual withdrawal.

Return since: -20%

Vivek Arya from Bank of America - Broadcom (AVGO) - Dec 2023:

yesterday one of your peers suggested that the market for AI accelerators could be as large as $400 billion... how does Broadcom participate... and what does this larger AI accelerator market imply for your Ethernet networking business?

LLM reasoning:

Arya’s question forced management to quantify AI’s revenue contribution and linkage to networking months before the Street built outsized AI models. The $400B TAM figure sounded extreme in Dec-2023 yet is now consensus; Ethernet’s role in AI back-end networks—dismissed by many at the time—is today a core driver of Broadcom’s upside.

Return since: +338%

My take: Great LLM use case, figuring out which analysts were right for the right reasons. Would be interesting to run something similar on initiations / ratings changes, controlling for sector returns, comparing insights vs peers. Of course, I’d want to know what these analysts are asking about next (here’s the analyst recommendation pages for Umer Raffat and Vivek Arya).

Grading your own analysts

Walleye Capital records all internal conversations to analyze which of their teammembers are most prescient

…we really record every single Zoom, every single call.…So a big part of my job overseeing the risk of the firm, the chief investment officer title, every single morning, me and my risk team, sort of in the control center of running this giant process we have our risk calls and those are all recorded and we can go back and say, hey what were we talking about at this time? And continually have LLMs that are, that are processing those transcripts and helping us to both remember and provide insights and ultimately be a bit predictive, which has been hugely helpful just in that exercise, which is— We haven’t sort of talked about where I think this is going in the power of all this. And I mean, I do believe that we’re that we’re a leader. I don’t want to say we’re the leader because I definitely don’t know what other firms are doing, but I certainly think that we are a bit more advanced in our thinking of how to use these tools, but we’re just scratching the surface of what is possible once you actually start connecting all bits of information within the walls of the firm…

My take: Would recommend an intermediate step focused on returning verbatim excerpts first, folks are going to want a lot of auditability.

Finding short opportunities in bond indentures

LLMs may have identified the opportunity to accelerate Avid Bioscience’s debt and short their stock. Last March, Avid Biosciences received an acceleration notice because it had failed to remove a restrictive legend on its 2026 notes, causing it be in default. Avid shares declined 28% when it revealed it needed to raise $160mm in private placement to redeem the notes.

Byrne Hobart covered this in The Diff, concluding:

And, right now, it’s suddenly gotten much easier to do this at scale: you can unleash LLMs on indenture agreements, and try to find edge cases that the company didn’t think of or notice. These will all be technicalities in practice; in the Avid case, if the restriction had been a big deal to the note owner, they probably would have noticed right away. But, perhaps coincidentally, they only noticed after the newly-widespread availability of tools that can trawl through vast amounts of text to extract useful information.

In Money Stuff, Matt Levine weighed in that he was skeptical an LLM found this opportunity.

My take: Hard to be sure, but I think this opportunity *was* found by an LLM because the acceleration notice was received just two weeks after Google released Gemini 1.5 Pro - the first time a 1mm token context window was generally available - enabling the analysis of huge documents. Would have been straightforward for any fund to cycle through indentures to identify technicalities that could merit an acceleration notice. In fact, this would make a pretty interesting eval for new models that get released: run them against a huge corpus of indenture agreements and see what new opportunities get identified.

Iran Notice disclosures

John Friedman, CEO of Datamule, collected and published a searchable dataset of Iran Notice disclosures from SEC filings

My take: one way I think about LLMs is they enable instant creation of datasets that are plausibly interesting but not worth hiring and waiting for a human team to build.

Interesting job posts

Select buyside LLM related job descriptions:

Point72 - AI Engineer – Investment Research & Workflows ($150K-$200K)

This role partners directly with L/S equity portfolio managers, analysts, and business leadership to build innovative solutions to improve efficiency and research quality across the equities platform…

Longaeva (new Baly platform) - Research Product Associate - AI Enablement

Longaeva is adding an associate to join the proprietary research team to accelerate adoption of generative AI products across investment strategies. In this role, you will embed directly with the proprietary research and investment teams to build solutions that impact investment decisions. We are seeking a capable, technical candidate—someone able to do hands-on research product development, web scraping, and LLM/AI-powered synthesis of qualitative and quantitative data. The ideal candidate blends scrappy coding, data/information aggregation, and a strong product intuition, with a proven ability to ship projects fast and independently. You will translate our AI capabilities into actionable insights by rapidly prototyping agentic workflows, building novel research products, and driving adoption of in-house tools.

Bayview ($30b AUM credit firm) - LLM Analyst ($90K - $110K)

The Research team at Bayview Asset Management is hiring an LLM Analyst to unlock insight from large volumes of textual data, both external and internal, to inform investment theses, improve operations, and answer foundational questions about the mortgage industry and more broadly, the economy...Meet with portfolio managers, traders, marketing and servicing teams to identify and narrow down the question. Understand the business context behind each question….
Prototype quickly but evaluate rigorously: Design prompts, run experiments in notebooks and concisely synthesize results for fast iteration. Define clear success metrics to measure progress.

xAI - AI Buy-Side Finance Tutor ($45/hr)

We are seeking a skilled AI Buy-Side Finance Data Specialist to enhance xAI’s AI models by providing high-quality data annotations and inputs tailored to buy-side finance contexts. In this role, you will leverage your expertise in portfolio management, hedge fund strategies, private equity investments, venture capital deal sourcing, and high-frequency trading algorithms to support the training of AI systems. You will collaborate with technical teams to refine annotation tools and curate impactful data, ensuring our models effectively capture real-world buy-side finance dynamics.

My Take: Interesting these are mainly early career type hires, no graduate degree required. Looks like funds are happy to build on top of frontier models and popular tools. Lots of focus on prototyping and experimentation. The Grok LLM trainer hourly rate feels a little light!!

Follow for more case studies of LLMs+Investment Research

Subscribe now

If you would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

Creative ways hedge funds are using LLMs

Jim Moran — Tue, 23 Dec 2025 20:29:22 GMT

Learning from prediction markets

Rick Bhowmick, Head of Data Eng at Coatue, built a system that reads Polymarket and Kalshi and generates a daily investment newsletter based on odds changes

Check out today’s newsletter and here’s the github if you want to try yourself.

My take: I think this is really cool, and modified it to send a daily email to my extended family which has spurred some interesting discussions.

Shareholder activism

Askeladden Capital, a long-only small/micro-cap value investing firm, used LLMs in a proxy battle

During our proxy contest at AstroNova, AI tools helped us produce the extensive ISS deck and other materials that would have been vastly more costly to prepare otherwise (i.e., we would have needed to spend an incremental six figures, and/or developed far lower quality work-product). We earned endorsements from both ISS and Glass Lewis, and the incumbent board requested the CEO’s resignation. I believe, though it is of course impossible to verify, that we may well have run the first AI-powered proxy contest in history.

My take: LLMs should enable more proxy contests against smaller companies that previously weren’t ‘worth it.’ One challenge is the eval on something like this, given these are relatively infrequent and the feedback loop is so long. Will most likely be leveraged by advisory firms and investors with direct experience creating these materials.

Expert interviews

CEO of primary research firm, Kane & Company, uses LLMs to drive more meaningful diligence conversations

Before talking to experts, we run AI queries on the market and competitive landscape. This gives us a baseline of what public sources and models already know.
The baseline serves two purposes. It helps us write better questions because we know the common answers and can push past them.
It also becomes our quality control filter. When an expert’s answer matches ChatGPT’s output too closely, we flag it.
Say for example we asked ChatGPT about the Canadian IT outsourcing market before starting expert calls. It tells us growth was 18 percent. An expert later gave us the exact same number with the same framing. We know to ask where that figure came from and what assumptions drove it.

My take: one challenge with expert networks are “professional experts” - folks who make their living doing calls and are too far away from actual industry to offer true insights. It’s helpful to get a few reps in first with ChatGPT, which is often regurgitating the same content these folks read anyway. Then you’ll be better positioned to push past.

Expert interviews (again):

An AI marketing firm built an LLM system that interviews experts automatically

The only way to capture true expertise is to build an AI interviewer that earns the trust to be seen as a peer. An equal. One that asks questions so insightful the expert reveals the distinctive methodologies they’d normally only share with another seasoned professional. That is the real technical hurdle. Here is how we cleared it. …
The Note-Taker is an internal tool, invisible to the expert, that continuously analyzes the conversation. The Interviewer queries it for structured progress reports.
What the Note-Taker Tracks:
Coverage Analysis: Topics explored with confidence levels (high/medium/low).
Gap Identification: Required areas not yet addressed, prioritized by importance.
Time Status: Pacing assessment against the target duration, with wrap-up triggers.
Pattern Detection: Emerging themes, contradictions, or when the expert defaults to generic “best practices.”
Next Action: A specific suggestion for the next probe.
The key insight is that the Note-Taker returns structured data, not prose. This prevents the Interviewer from getting confused by a second voice. If the Note-Taker flags a gap in “decision-making frameworks,” the Interviewer integrates the suggestion naturally: “You mentioned evaluating channels—walk me through a recent decision where you chose not to invest somewhere.”

My take: I think we’re still a ways away from LLMs running effective investor style expert network interviews. The example in this post is for executives who want to be interviewed, and I think the best investment insights come from human run interviews, ideally unpaid ones with trusted relationships. However, the AI notetaker seems a very useful tool for human interviewers. Could enable folks who are very good at lining up phone calls with experts to conduct them, instead of having to be conducted by expert analysts.

Follow for more case studies of LLMs+Investment Research

Subscribe now

If you have any interesting examples or would like to discuss incorporating LLMs into your research process, reply to this email or reach out via X or LinkedIn.

How well to LLMs forecast company KPIs?

Jim Moran — Thu, 16 Oct 2025 04:09:09 GMT

There’s a new trend of AI forecasting arenas like Metaculus AIB and FutureX, where LLM engineers compete to forecast a broad range of future events.

We analyzed how the the top forecasting models on a discretionary hedge fund workflow.

History of forecasting

The forecasting community was first popularized two decades ago by Prof Philip Tetlock’s popular book Superforecasting. Competitions, like the Good Judgement Open and Metaculus, are months long contests where forecasting professionals predict the outcomes of various events that will resolve in the coming months such as elections, sports championships, weather and wars.

Recently, the community has started to accelerate as forecasters use LLMs to automate some of the manual research and calculation steps. Estimates that used to take hours or days can now be done in seconds. Now forecasting expertise is being channeled into specialty LLM system design so the marginal forecast can be automated.

This explains why we’re seeing these new LLM forecasting arenas thousands of individuals and teams competing in them. Historically, hedge funds haven’t had much overlap with the forecasting community - too many questions requiring too much specialized knowledge, with answers needed in too short of a time horizon.

We think this is about to change as LLM forecasting systems can provide answers in real time.

Subscribe now

Flat Circle - How Claude 3.7 makes better investment decisions

Jim Moran — Tue, 25 Feb 2025 18:18:28 GMT

Flat Circle measures the ability of language models to predict company earnings results. See our methodology for detail and disclaimers. If you haven’t already subscribed, join investors and engineers interested in LLMs+investment research here:

Subscribe now

Claude 3.7 and 3.5 make different trading decisions given the same information

Yesterday, Anthropic released Claude 3.7 Sonnet - which shows superior reasoning scores to OpenAI o1 and DeepSeek R1.

Last night, we fed Claude 3.7 and Claude 3.5 the same information about 10 companies scheduled to report earnings this morning and asked both to make BUY/SELL calls for each stock.

Claude 3.7 Sonnet offers an “extended thinking mode” allowing the user to specify a budget for the number of reasoning tokens it should use while formulating a response. While we maxed out the reasoning budget, there is a limit to how high the budget can be set and the model often doesn’t use its entire budget. Each call used only about 4 cents worth of reasoning tokens, though you can imagine a future with investors spending 4 million dollars on a single investment decision.

Across the 10 names, both Claude 3.7 and 3.5 lost money because they incorrectly recommended BUY for Sempra (SRE), which missed on revenue and guidance and is currently down ~20%.

Overall, Claude 3.7 outperformed Claude 3.5 by about 10% thanks to 3.7 correctly calling BUY for Bank of Montreal (BMO) vs Claude 3.5 recommending SELL. BMO beat earnings and is currently up about 5%.

Both generations of the Sonnet model break their reasoning steps into similar concepts, but the newer generation appears to think harder and more deeply. We’ve contrasted the two models’ reasoning on BMO below:

Credit cycle observations

Claude 3.5

Credit Quality Deterioration:
- In Q4 2024, BMO reported significantly higher provisions for credit losses (PCL) of $1,523M vs $446M year-over-year
- Management noted credit performance deteriorated more than anticipated throughout 2024
- While Q4 2024 may represent a "high point" for provisions, they are expected to "remain elevated" through 2025

Claude 3.7 - extended thinking mode

Credit Loss Cycle Appears to Have Peaked
- In Q4 2024, management explicitly stated that Q4 represented a "high point" for provisions for credit losses (PCLs) that would "begin to moderate through 2025"
- The bank has already recognized substantial credit losses in 2024 (PCLs reached $1,523M in Q4 alone)
- Interest rate cuts have begun, which should gradually reduce pressure on borrowers
- The early recognition of credit issues and proactive provisioning positions BMO well for improving credit metrics
- Prior credit cycles show the stock typically recovers strongly once problems are fully recognized and addressed

Peer commentary

Claude 3.5

Challenging Operating Environment:
- Peer commentary (RBC, Scotiabank) indicates continued pressure on Canadian consumers from higher rates and rising unemployment
- Economic growth is expected to remain subdued in H1 2025
- Competitive pressure on deposits remains high, potentially squeezing margins
- Bank of Canada rate cuts may provide some relief but timing remains uncertain

Claude 3.7 - extended thinking mode

Beneficial Rate Environment Emerging
- Central banks have begun easing cycles, with more cuts expected in 2025
- BMO management specifically noted this should be positive for both economic activity and credit performance
- Peer commentary (RBC, BNS) supports expectations for mortgage market improvement and broader economic stabilization

Analyzing recent performance

Claude 3.5

Recent Performance vs Expectations:
- BMO has missed consensus estimates for three consecutive quarters
- Q4 2024 adjusted EPS declined to $1.90 from $2.93 year-over-year
- Return on Equity has fallen below target with management acknowledging it will take time to rebuild to 15% medium-term goal

Claude 3.7 - extended thinking mode

Capital Strength Enabling Shareholder Returns
- BMO has built exceptional capital strength with CET1 ratio reaching 13.6% in Q4 2024
- The bank announced both a dividend increase and a share repurchase program of up to 20 million shares (NCIB)
- January 2025 press releases confirm regulatory approvals have been received and the buyback is proceeding
- This capital return story provides support even if operating results are mixed

Positive Momentum in Core Businesses
- Q4 2024 delivered record pre-provision pre-tax earnings of $3.5 billion (up 8% YoY)
- Canadian P&C business has shown consistent strength with record revenues
- U.S. segment's contribution has significantly increased post-Bank of the West integration
- Management indicated "net bullish" outlook for U.S. growth prospects in 2025

Claude 3.7’s observations seem to be more forward looking. Apparently this is what better investment reasoning looks like.

Interesting articles

New paper details trading system based on LLMs + reinforcement learning. Authors incorporate an LLM monitoring for changes in market sentiment to overcome the structured data limitations of traditional RL based trading strategies.

The paper compares results to o1, GPT 4o and other open source models, and corroborates our conclusions that o1 outperforms other models. However, all models appear to be beaten by the RL-LLM hybrid system discussed in this paper (arXiv)

Is AI really thinking or just pretending to? This is really the key question, and the article lays out the arguments on both sides. One good quote:

The best use case is a situation where it’s hard for you to come up with a solution, but once you get a solution from the AI you can easily check to see if it’s correct. Writing code is a perfect example. Another example would be making a website: You can see what the AI produced and, if you don’t like it, just get the AI to redo it.

… another example is measuring how the models perform in the market (Vox)

Two articles from late last year about Balyasny’s internal LLM tool:

Balyasny’s AI outperforms OpenAI in financial applications (hedgeweek)
A day in the life of an applied AI engineer at Balyasny (efinancialcareers)

Interesting LLM hedge fund job descriptions

Citadel: Commodities - Machine Learning Engineer

“Commodities have undergone an information revolution. From ship tracking to oil storage levels and crop yields, more data on supply, demand, storage, and transport is available than ever before. Commodity markets are more globally connected: natural gas markets impact fertilizer production, while agricultural markets impact gasoline production…We combine specialist domain expertise with advanced modeling techniques to solve problems that others deem unsolvable.”

DE Shaw: Software Developer - Generative AI ($225K)

“Working on greenfield projects, which offer opportunities to shape the future of GAI at the firm and make a significant impact”

Millennium: Senior AI Engineer - Equities Technology ($213K)

“We are building the next generation of Large Language Modeling applications driven by Portfolio Manager's requirements that provide immediate value and scale as a core product.”

Point72 (Cubist Systematics): NLP Engineer

“Build start-of-the-art deep learning models to process large scale unstructured datasets.”

Follow how LLMs are beginning to make investment decisions

Subscribe now

If you have feedback or would like to participate in this project, please reply to this email or reach out via X or LinkedIn.

Flat Circle - Contrasting good vs poor reasoning

Jim Moran — Thu, 20 Feb 2025 22:13:24 GMT

Subscribe now

Key takeaways

After 344 earnings, Grok-2 and o1 share the lead with a ~54% hitrate, earning ~1.1% per earnings
We contrast the reasoning approach of stronger vs weaker models regarding shareholder lawsuits
Perplexity and Grok announced their own competitors to ChatGPT Deep Research
- Agentic research systems create another dimension on which LLMs may compete with human investors: they could either reason better or research better
- Since Deep Researchers search online and can access historical information, it’s impossible to backtest their ability to make investment decisions. You need to test them live

Model accuracy

Are all models converging on a 50/50 coinflip? 2 standard deviations away from the mean of a random coinflipper is 55% after 345 flips, and it seems worrisome the hitrate has been trending down.

However, the mean profit per earnings remains very strong for o1 (113 bps) and Grok-2 (107 bps). o1 is by far the most advanced according to other benchmarks and by far the most expensive in terms of tokens, seems promising it’s earning the most. We’ll see.

Example of good vs poor reasoning?

This project provides a way to quantify correct vs incorrect investor judgement. As an example, PGY recently reported better than expected revenue and margins and provided overall certainty around its balance sheet.

Gemini Flash Thinking, which loses -1.1% per earnings, had issued a SELL:

The emergence of securities fraud investigations significantly overshadows any positive historical trends or recent ABS deal announcements. While Pagaya has shown a pattern of exceeding guidance and achieving strong growth, the serious nature of securities fraud allegations creates substantial uncertainty and risk.

We feed the models recent press releases, and apparently Gemini Flash Thinking had fixated on shareholder lawsuits related to a short seller report.

o1, which earns 1.1% per earnings, was fed the same information - it accounted for the shareholder suits but concluded BUY:

While the wave of shareholder lawsuits and the Q3 share‐price plunge indicate heightened uncertainty, the company’s expanding network of top-tier lending partners (including new bank and auto relationships) and its successful capital raises suggest solid underlying demand. In addition, management’s repeated emphasis on profitability in 2025, along with evidence of improving loan performance and rising fee margins, points to a business now positioned for stronger operating leverage as interest rates stabilize. With shares trading well off recent highs yet fundamentals and guidance still largely intact, the risk/reward profile appears favorable going into the February earnings release—making PGY a buy.

We don’t specify in the prompt whether to ignore shareholder lawsuit press releases. The returns teach us which reasoning is correct.

Interesting articles

The amount of compute per request is going to skyrocket. Major implications for datacenters and hyperscalers, and also what can and will be spent on investment decisions:

“…this single process from a single human interaction would involve 10 billion times more compute than a single human writing into ChatGPT today, at the exact same model size. That is the incredible expansion dynamic in inference compute that is playing out today and over the next few years!”

(See “Inference Compute Scaling” on Attune Research)

Extensive thread on using ChatGPT Deep Research to create an investment thesis around DoorDash (DASH). Lots of great detail, I particularly like the multiple rounds with ChatGPT to create the optimal prompt:

“I asked ChatGPT to build me a prompt for Deep Research to do Deep Research on Deep Research prompting. It read all the blogs and literature on best practices and gave me a thorough report. Then I asked for this to be turned into a prompt template for Deep Research. I've added it below. This routinely creates 3-5 page prompts that are generating 60-100 page, very thorough reports”

(@BuccoCapital on X)

Grok-3 with DeepSearch announced (Techcrunch)

Perplexity launches Deep Research (Perplexity)

Hedge Fund that replaced analysts with AI beat the market (Bloomberg)

Follow the progress of LLM investment research

Subscribe now

If you have feedback or would like to participate in this project, please reply to this email or reach out via X or LinkedIn.

Flat Circle - o1 now best performing model

Jim Moran — Tue, 11 Feb 2025 18:32:56 GMT

Subscribe now

Key Takeaways

After 209 live earnings, o1 now leads with a 57% hitrate and 130 bps mean return per earnings, followed by Grok-2 at a 55% hitrate
- 57% is approximately 2 standard deviations away from random chance after 209 coinflips
- The Gemini and Claude models appear to be approaching 50/50
The models are fairly correlated with each other, tending to make the same calls
- We’ll have to figure out ways to ensure model orthogonality before institutions start adopting LLMs to make investment decisions
Recently spoken with a large number of readers and appreciate the helpful feedback
- In addition to reporting on the leading language models and their ability to call company earnings, I plan to include other resources and news relevant to LLMs+investing

Model correlation

While the models show differing abilities, they are fairly correlated. LLMs are somewhat more likely to issue the same calls than if each model were merely flipping coins.

This makes sense as they likely share much of the same training data, technology and methodology.

This also means the prior basis of comparison of 5 models flipping coins was overly strict as we no longer talking about 5 independent models. After 209 earnings, o1’s hitrate is 57%. 2 standard deviations above the mean of a single coin flipper is 57% vs. 59% for 5 coin flippers.

I can’t deduce any patterns among why certain models are more or less correlated to others. Only thing I see is the two newest ‘reasoning’ focused models, o1 and Gemini Flash Thinking, appear least correlated with others. We’ll see if this trend continues.

Models’ orthogonality, the extent to which they are uncorrelated, will be a crucial dimension on capital allocators’ decisions to use them for trading decisions. Orthogonality across managers and market factors is essential for risk management and leverage.

Model accuracy

Comparison to other benchmarks

How does a model’s ability to call earnings compare to standard LLM benchmarks? We take each model’s reasoning score from LiveBench, and compare it with their hitrate and mean share price return on their ability to call live earnings.

The models with the best share price return are those with the highest and lowest reasoning scores. The models in the middle underperform.

“A lot of smart people think they’re way smarter than they are, and therefore they do worse than dumb people” - Charlie Munger

Upcoming earnings calls

I’m debating whether to continue listing these upcoming earnings calls at all, since the hitrates are so close to 50%. Even the hitrates where BUY or SELL calls are unanimous aren’t meaningfully more predictive. If these calls are valuable or you would like a different display of them, please let me know.

Industry news and updates

I’ve spoken to a lot of readers over the past couple weeks and am grateful for the feedback on this newsletter and the LLM systems we are building. If you and I haven’t spoken, please reach out!

For now, I plan to expand the scope of this newsletter to include news and resources generally relevant for the LLM+investors community. From this week:

OpenAI Deep Research + Open Deep Research. OpenAI released an new tool that’s helpful for investment research. It’s exhilarating to input a query, watch it conduct searches, consider the results, think of new queries and so forth. A few days later, David Zhang launched an open source version of Deep Research that already has 10K stars on GitHub. Excited to monitor development of these ‘research agents’ and their application to investing. Seems a model’s ability to reason is inextricable with its ability to research.

Model ML, LLM platform for PE and investment banks, announces $12m funding round. There have been dozens of these tools, but thought the description on their approach was interesting: “When you open Model ML, it looks a lot like Google Drive. It has its copycat versions of Excel, Powerpoint, Word, etc., which ensures that no information ever needs to leave the workspace.” Of course replicating Microsoft Office will be no small feat. But this approach seems similar to why OpenAI Deep Research is a better experience than OpenAI Operator. The fact that it’s fullstack means it can think around bottlenecks and get 10x as much done, instead of requiring the user to babysit.

Follow the progress of LLM investment research

Subscribe now

If you have feedback or would like to participate in this project, please reply to this email or reach out via X or LinkedIn.