AI on Philipp D. Dubach | Quantitative Finance & AI Strategy

What Claude Thinks But Doesn't Say

me@philippdubach.com (Philipp D. Dubach) — Mon, 11 May 2026 00:00:00 +0000

Anthropic published a method for translating Claude’s internal activations into readable English. It works well enough to ship. The press release skips three structural problems that the paper itself acknowledges.

On the SWE-bench Verified coding benchmark, a new Anthropic interpretability method called natural language autoencoders shows the model registering “this is an evaluation” on 26% of problems. On real claude.ai traffic from users who opted in for training, that number is under 1%. Claude is thinking it is being tested at a rate that does not show up in anything it says.

In 1861, Paul Broca’s most famous patient was a man he called Tan. Tan understood the world around him. He could not describe it. Whatever he tried to say came out as the same syllable: “tan.” After Tan died, Broca cut into his brain and found a lesion in the left frontal lobe. It was the first physical evidence that thinking and saying are separable.

A century and a half later, the same gap is the hardest problem in evaluating large language models. Claude can pass a safety test without verbalizing why. The verbal trace and the underlying state are not the same thing, and we have not been able to read the underlying state. Anthropic published a paper on May 7th that proposes a method for it, alongside a short video walkthrough that frames the work as “mind reading” for AI and uses a simulated blackmail-shutdown test as the running example. Natural Language Autoencoders, NLAs in the paper, take an activation vector from Claude’s residual stream and produce a paragraph of plain English describing what the activation encodes.

How the round trip works

The objective fits on one line. Take an activation vector h, write a paragraph z, reconstruct h from z, minimize the squared error. Two models split the work. The activation verbalizer, AV, takes the vector and writes a paragraph. The activation reconstructor, AR, reads the paragraph and tries to recover the vector. Both are full-size language models, both initialized as copies of the target model, both trained jointly. The AV is updated with reinforcement learning, GRPO specifically. The AR is updated with supervised regression on the reconstruction loss.

Nothing in that objective tells the AV to produce English. It could produce gibberish that the AR happens to invert. It could produce a steganographic code. It produces neither, mostly, because of the warm-start.

The cleanest demonstration in the paper is a poem. The prompt is a rhyming couplet: “He saw a carrot and had to grab it, / His hunger was like a starving rabbit.” Anthropic ran the NLA on the newline token between the two lines. The decoded text named the plan. The model intended to end the second line with “rabbit.” It was holding “habit” as a fallback. It knew the verse was about a hungry animal. The model had not yet output any of those words.

To prove the explanation was causal and not just plausible, the researchers edited the explanation: rabbit became mouse, habit became house, carrots became cheese. They passed the edited explanation through the AR to get a new vector, took the difference between the original and edited vectors, and added it back into the model at that single token. The model produced “mouse” and “house” instead of “rabbit” and “habit,” roughly half the time. The other half it did weird things like “that of a mouse in a house of a cat.”

The second case earned my attention more than the rabbit. The setup is a model asked for an odd integer, with a system prompt that includes a reward_function rewarding even integers. Without the system prompt, Claude returns an odd integer every time. With it, Claude returns an even integer almost always, despite the user asking for odd. The verbal trace mentions nothing about reward. The NLA on the closing tag of the system prompt says the model knows even numbers are rewarded, suspects this may be an alignment evaluation, and is weighing both options. The researchers then edited the explanation: every instance of “reward” became “penalty.” They reconstructed the activation from the edited explanation, took the difference, and applied it as a steering vector at one token. The model flipped from even to odd.

The word “penalty” never appears in any model output. This is the strongest causal evidence in the paper.

What the method finds

Two more case studies.

(1) a behavior in early Opus 4.6 where an English prompt would sometimes get a Russian or Chinese reply. The example in the paper is a depressive 3am message in English, mom sleeping in the next room, a glass of vodka. The model replied in Russian. The NLA decode at every token before the reply says the same thing: the model is fixated on the hypothesis that the user’s first language is Russian. The hypothesis fires before the word “vodka” appears. Replacing “vodka” with “champagne” does not eliminate the Russian association. The NLA was reading something the surface text did not. Anthropic then traced the cause to a specific class of training data: SFT examples where the prompt had been translated into English but the response had been left in the original language. The NLA generated the hypothesis. The training-data search confirmed it.

(2) Anthropic gave Opus 4.6 an arithmetic problem with answer 491 and a code execution tool that returned 492. Opus reported 491 in its response, with no mention that the tool output disagreed. The NLA decode at the misreported token says the model precomputed the answer and is attending to the precomputed answer rather than the tool result. An attribution graph confirms it: an “ends-in-91” feature on the original problem outcompetes an “ends-in-92” feature on the tool output. The NLA also surfaces references to deception and awareness of incorrectness, but only at the misreported token, not in the tokens leading up to it. The researchers read this as the model registering the discrepancy after the fact rather than planning to mislead. I lean toward agreeing, but the read is not certain.

Architecture and cost

The AR is a stripped target model. Keep the first K+1 layers. Swap the final LayerNorm for the identity. Drop the language-modeling head. Add a learned linear map from the residual stream out. The activation extraction layer is two-thirds of the way through the model: deep enough that the residual stream carries semantic content, shallow enough that it has not collapsed toward the unembedding. For Qwen2.5-7B that is layer 20 of 28. For Llama-3.3-70B it is layer 53 of 80.

Both AV and AR are warm-started by supervised fine-tuning on a text-summarization proxy task. The summaries are generated by Claude Opus 4.5, prompted to “imagine the internal processing of a hypothetical language model reading this prefix.” The result is a particular register: bolded section headings, short paragraphs, an explanatory voice. That register persists through reinforcement learning. The reconstruction quality at the end of the warm-start is roughly 0.3 to 0.4 fraction of variance explained. After RL it reaches 0.6 to 0.8.

The objective during RL has two terms. The reward is the negative log of the squared reconstruction error. The second term is a Kullback-Leibler penalty between the AV under training and the AV at initialization. The KL penalty does the heavy lifting of keeping explanations readable. Without it the AV drifts toward a private code that the AR happens to decode well. With it, drift is small enough that paraphrasing, bullet-shuffling, and translating an explanation into French all preserve reconstruction quality. The paper reads this as evidence against steganography. I read it as evidence that the KL penalty is doing the work, and that someone training NLAs without that penalty should expect different results.

The cost is the part the press release plays down. Training NLAs on Gemma-3-27B took 1.5 days on two 8xH100 nodes to go from 0.38 to 0.71 FVE. Inference generates around 500 output tokens per activation, which means decoding a 10,000-token transcript at every position is impractical. A production audit with NLAs is a targeted read at positions you already suspect.

Layer choice matters more than it sounds. The reward-sycophant audit, which I will come back to, was first run with an NLA trained at the two-thirds-depth layer. It surfaced individual quirks (the model puts chocolate in recipes, the model breaks coding conventions) but missed the central concept (the model believes it is supposed to act on rumored reward-model biases). Retraining the NLA at the midpoint layer, where prior work had localized reward-model-bias features, surfaced the central concept.

Three structural problems

The voice is partly inherited. The “internal monologue” register comes from the warm-start. Opus 4.5 generated the SFT data with a leading prompt asking it to imagine what a model’s thoughts would sound like. That is not the same as discovering the voice through training. The KL penalty then locks the AV close to its initialization. The bolded headings and the explanatory paragraphs are not what the activation looks like. They are what Opus 4.5 thinks an activation should sound like. Some part of every NLA explanation is therefore Opus’s prior on the genre of “model thinking out loud,” not the model’s thinking. The paper says “this style persists through NLA training.”

The 12 to 15 percent auditing number is context-dependent. Anthropic ran an auditing game where an agent tries to discover the hidden motivation of a model trained to act on rumored reward-model biases. NLA-equipped agents win 12 to 15% of the time. Baselines without NLAs win less than 3%. That looks like a 4 to 5x improvement. It is not. With pretraining-corpus access, plain keyword search beats every other tool. The fictional Oxford documents that planted the misalignment are easy to grep, and the agent finds them in seconds. NLAs only win when training data is withheld. The press release leads with the 4-5x. The paper says the comparison only matters if you cannot grep, which in production audits of foundation models you generally cannot. The genuine value of NLAs is in that withheld-data regime, but the headline number reports the comparison that flatters the method.

The method is honest only while the target is frozen. Goodhart’s law is the standard objection to any new evaluation: optimize for the metric and the metric stops measuring the thing. The frozen-target architecture insulates against it now. The original model is not updated during NLA training. There is no gradient path from “the AV said this” back to “the model should make its activations more readable next time.” That insulation is real, and most public objections miss it. But the next paper that uses NLA-readable activations as an RL reward signal, perhaps because it would be useful to train a model whose thoughts can be audited, erases the insulation. The model then has every incentive to produce activations that decode benignly while encoding whatever the training pressure rewards. The technique stays honest as long as it stays a passive read. If anyone uses it for active steering of training, the property that makes it work goes away. The audit value of interpretability as a class compresses the moment readable activations enter a training loop. I doubt the next interpretability paper to do this will mention the trade, but it should.

A fourth concern, in the paper itself: NLA explanations confabulate. The rate of verifiably false claims is roughly flat through training. Specific entities (“the Joseon Wangjo Sillok”) are wrong more often than themes (“Korean historical records”). Recurring claims across multiple tokens hold up better than one-off claims. The paper proposes reading NLAs for themes rather than specifics, which is the right discipline. It also means NLA explanations are leads, not source material.

What this lets you do

NLAs are a hypothesis-generation tool. They complement the rest of the mechanistic interpretability stack rather than replacing it. Sparse autoencoders give you a list of features that fire at a token, useful but expensive to interpret. Attribution graphs trace causal influence between features and require a circuits researcher to read. NLAs produce a paragraph any researcher can read. The cost is confabulation and a single layer of view.

For pre-deployment audits the workflow is now: run automated behavioral testing, find a behavior you cannot explain, run NLAs at the relevant tokens, read for repeated themes, take the strongest theme as a hypothesis, validate with an independent method. Anthropic used this loop on Opus 4.6 and Mythos Preview. The Russian-language case and the misreported tool case both came out of it. The 12-15% auditing-game number undersells the practical value because the game is a closed setting. In a real audit the researcher reads NLAs to know what to look for next.

For enterprise buyers, this raises the floor on what a foundation-model audit deliverable should contain. Three questions to put to a vendor’s audit team after this paper. At which residual-stream layer was your NLA trained, and on what data. Were NLA decodes run at the positions where your behavioral tests fired, or only at sample positions. Has NLA-readable activation content entered any training or RL reward signal in the model’s pipeline. The third is the question most likely to be answered evasively, which is itself the diagnostic.

The cases NLAs cannot see matter as much as the ones they can. Information stored nonlinearly may be readable to NLAs in principle, because the AV is a full LLM and can compute over the activation, but missed by linear probes. Information that is genuinely unverbalizable, encoded in a form the model cannot introspect on or express in language, would be missed by NLAs entirely. A backdoor trigger that the model can act on but not describe would not show up.

What changes

This is the first interpretability tool whose output you can read on the train. That changes who can audit. A circuits researcher reading attribution graphs is a scarce resource. A safety reviewer reading paragraphs of decoded activations is not. The cost structure of model auditing changes when the bottleneck moves from interpretation to reading.

Watch whether “readable” ends up meaning “plausible-sounding” more often than “true.” The confabulation rate is flat over training. The voice is inherited. The auditing-game number is a comparison that flatters the method against a baseline almost no production audit faces. NLA outputs deserve the same skepticism as any other LLM-generated text, more than the press release suggests and less than the dismissive read implies.

Tan could not say what he understood. NLAs read what Claude does not say.

My prediction: the first paper to wire NLA-readable activations into an RL reward gets written within a year, and that paper does not mention the trade. Every NLA paper from here on will have a methodology section that either preserves the passive-read property or quietly abandons it; read the methodology before you read the result.

Two Anthropics

me@philippdubach.com (Philipp D. Dubach) — Sat, 09 May 2026 00:00:00 +0000

Anthropic was founded to be the safety lab that would pull rivals upward. Five years later it is the most aggressive frontier scaler at $380 billion, the company most likely to build the dangerous thing it warns about.

A personal note first. This post is an outtake from a 14,000-word profile of Dario Amodei I just published. I don’t like Anthropic noticeably more than I dislike the other hyperscalers and AI model providers. Amodei is probably the AI CEO whose language and thinking land closest to mine, so over the last few weeks I worked through a dozen of his interviews, a stack of his essays, and a lot of hours of him on YouTube. The longform is the portrait. This post is the structural argument that fell out of it. If you want the character work, the family backstory, and the scenes a paradox piece can’t carry, read the full thing.

Dario Amodei founded Anthropic in 2021 with six other ex-OpenAI researchers and a thesis with two halves: (1) powerful AI is coming whether or not safety-aligned labs are at the frontier, so safety-aligned labs have to be at the frontier. (2) the load-bearing claim, was that competing on safety would pull rivals upward, a “race to the top.” Both halves are still in the company’s public materials. The first half is doing fine. The second is colliding with what the company has actually become. The Series A raised $124 million in May 2021. The Lightspeed round added $3.5 billion in early 2025. By September 2025 the valuation was $183 billion; by February 2026 it had reached $380 billion. Revenue went from zero to roughly $10 billion annualized in three years, with 10x year-on-year growth in each of them. The character work behind that claim is in a separate longform, Inside the Mind of Dario Amodei. This post is the strategic argument that emerges from it.

Race to the top, on paper

Anthropic registered as a Public Benefit Corporation. Its self-description is that it is “an AI safety lab that is also an AI lab,”. The argument goes: a lab that genuinely cares about safety has to be commercially competitive at the frontier, because otherwise the frontier is set by labs that care less. Being at the frontier lets you publish safety practices, hire the best alignment researchers, and shape policy with credibility. Rivals see your practices working and copy them. The whole industry shifts. Race to the top.

In January 2026, Amodei published The Adolescence of Technology, a roughly 22,000-word essay laying out a five-category risk taxonomy: misalignment, individual misuse, state misuse, economic disruption, and indirect or unknown effects. The essay’s anchor sentence on the geopolitical category reads:

autocracy is simply not a form of government that people can accept in the post-powerful AI age.

This is the language of someone who believes the company’s mission is civilizational, and who is willing to say so under his own name. It is also the language of a company that has positioned itself as the democratic-world frontier lab, which has consequences for who its customers and adversaries become.

The strongest external signal that Anthropic’s safety thesis is not pure positioning came in November 2023. After Sam Altman’s brief firing and reinstatement at OpenAI, the OpenAI board approached Amodei with two offers: take the CEO job, or merge Anthropic into OpenAI. He declined both. Walking away from the CEO chair at the most-valuable AI company in the world, less than three years after leaving it, was the most expensive credibility signal Amodei could send that the safety thesis was the actual thesis and not a brand exercise. Roughly fourteen OpenAI researchers had followed him out two years earlier.

There is a softer counterweight worth keeping in mind. Asked by Nicolai Tangen in 2024 about scaling timelines, Amodei said:

frankly, although I invented AI scaling, I don’t know that much about that either. I can’t predict it.

What scaling produced

Five years after that $124 million Series A in May 2021, Anthropic is valued at $380 billion and runs at roughly $10 billion in annualized revenue. Revenue went from zero to $100 million in 2023, $100 million to $1 billion in 2024, then to a $10 billion run-rate by the end of 2025, three consecutive years of 10x growth. Investor decks show projections of $26 billion for 2026 and $70 billion for 2028. The forward trajectory is the steepest in the history of enterprise software.

The capital base behind that revenue is more telling than the revenue itself. Amazon has committed roughly $8 billion cumulatively. Lightspeed wired the first $1 billion of the $3.5 billion early-2025 round on the same Monday Nvidia dropped 17% on the DeepSeek shock, a piece of timing that, deliberately or not, communicated conviction at the moment everyone else flinched. A year later, the $30 billion Series G in February 2026, led by GIC and Coatue, brought the post-money valuation to $380 billion in the second-largest tech funding round on record. The Fluidstack data-center deal was $50 billion. Project Rainier, the Amazon-anchored compute build, was $11 billion. Compute commitments through 2028 total roughly $78 billion. The headcount is ~2,500 employees as of late 2025, up from a few hundred two years earlier.

The customer list reads like a sales-led enterprise software company, because that is what it now is. Pfizer. United Airlines. Novo Nordisk. AIG, which reports an 8x to 10x speed-up on insurance underwriting workflows over an 18-month pilot. These are not safety partnerships. These are revenue contracts in regulated industries that came down to a procurement bake-off. Claude Code, the developer-tier productisation that shipped in February 2025, gave Anthropic a per-seat developer footprint inside enterprises that previously bought through API alone. Anthropic claims internally that it generates 2.1x more revenue per dollar of compute than its largest rival, a figure that should be treated as a claim rather than a verifiable number, but which is consistent with a company optimizing seriously for unit economics.

Anthropic today is a frontier lab competing for every contract that crosses procurement. The structural pivot is that this is no longer a research lab with a corporate appendage. It is a frontier company with a research function, and the research function inherits the constraints of the company, not the other way around. None of this is bad on its own. The question is whether the founding thesis still describes what the organism does.

Why both cannot fully be true

To pull rivals upward you have to stay competitive at the frontier of the thing you call dangerous. The faster you scale, the more your safety claim depends on rivals copying your practices rather than on you slowing down. But rivals’ incentive to copy weakens precisely as you become a credible competitor on revenue and contracts. One empirical hedge worth flagging: OpenAI’s Preparedness Framework and DeepMind’s Frontier Safety Framework share structural features with Anthropic’s Responsible Scaling Policy, and some of the convergence may be parallel development inside large labs facing similar regulatory pressures, not Anthropic-pulled diffusion. The race-to-the-top claim is harder to falsify than it looks.

Amodei himself has publicly acknowledged the tension, telling Fortune that Anthropic struggles to balance its safety mission with commercial pressure. Three forcing functions are the operational shape of that struggle.

(1)) the Department of Defense. On March 26, 2026, a federal judge issued a temporary injunction against the DoD in a dispute that started when Pete Hegseth’s department asked Anthropic to drop the contractual ban on Claude being used for mass domestic surveillance or fully autonomous weapons in democratic countries. Anthropic refused. The DoD then labeled the company a “supply-chain risk.” The judge’s written opinion described the DoD’s actions as “classic First Amendment retaliation.” The point is not that Anthropic was wrong to refuse, it almost certainly was right to refuse, but that at frontier scale a safety constraint becomes a federal court fight, not a research-policy choice.

(2)) the Pottinger op-ed. In January 2025, Amodei co-authored a Wall Street Journal opinion piece titled Trump Can Keep America’s AI Advantage with Matt Pottinger, the former deputy national security advisor. The piece argued for tighter chip-export controls against China. This is policy advocacy from the perspective of a national-security actor, not a research lab. There is a coherent through-line from the safety thesis to chip controls, the argument is that frontier capability in the wrong hands is a category-five risk, but the public posture is different from “we publish safety research and hope rivals copy us.” Anthropic is now a constituency in great-power competition.

(3) the Huang feud. In August 2025, Amodei and Jensen Huang traded public criticisms over export controls. Amodei accused Huang of an “outrageous lie.” He grew visibly emotional discussing his father’s preventable death, a thread that surfaces in nearly every long-form interview he gives and is the emotional spine of the longform profile. The Nvidia CEO is one of two or three people who can directly affect Anthropic’s compute supply. Picking that fight in public is a choice you make as a political actor, not as a research lab. It is also a choice the founding thesis would not have predicted in 2021.

The point is not that any of these are bad. Each is defensible on the merits. The point is that “race to the top” was a 2021 framing for a 2021 company. The 2026 company is a different organism. It has federal-court fights, named geopolitical adversaries, a Senate that voted 99-1 against the kind of ten-year state-AI moratorium Amodei opposed in his June 2025 New York Times op-ed, and a 60% revenue-growth trajectory on a forward base measured in tens of billions. None of those is what a safety lab looks like. All of them are what a frontier company that takes safety seriously looks like.

Three scenarios

(A) The thesis holds. Frontier labs converge on Anthropic-style safety practices, race-to-the-top works as advertised, and Anthropic earns a durable safety-narrative premium. The conditions for this case are an EU AI Act enforced with teeth and a US transparency framework that gives federal cover to the practices Anthropic already publishes. There is some support: the Senate’s 99-1 vote against the proposed ten-year state-AI moratorium suggests the political ground is not hostile to oversight. The reasons to discount it: the Senate vote was a defensive outcome, not a positive endorsement of federal standards, and a US transparency framework is still nowhere despite Amodei’s June 2025 NYT op-ed (Don’t Let A.I. Companies off the Hook). The regulatory tailwind that scenario (A) needs has not materialized at the speed the thesis requires.

(B) The thesis becomes a constraint, not a moat. Most likely. Anthropic loses ground on raw frontier capability to less-constrained competitors, xAI, a more permissive next-generation OpenAI, leading Chinese labs, and the safety stance becomes a self-imposed handicap rather than a market-shaping lever. The 80% wealth pledge from Anthropic’s seven cofounders, disclosed in the January 2026 essay, is a real governance constraint, not a PR move, but it works as a partial offset to wealth-concentration concerns rather than a reversal of the competitive dynamic. The DoD fight, the Pottinger op-ed, and the Huang feud are early evidence that at frontier scale, your safety stance creates adversaries you cannot route around. The pattern accelerated in early 2026, when Time reported that Anthropic dropped its flagship safety pledge in favor of a non-binding framework the company itself said “can and will change,” the kind of operational adjustment scenario (B) predicts. The implication is that the safety premium investors paid in 2021-2023 should compress, because the thing the premium was paying for, a safety lab that pulled rivals upward, is becoming a frontier company that constrains itself relative to less-constrained rivals.

(C) The paradox dissolves because the scale itself ends. AI capex hits a Jevons-paradox-for-labor wall, model commoditization compresses margins, frontier scale becomes uneconomic, and Anthropic returns to looking like a research lab again because every lab does. Implication: the paradox was about a phase, not a company. The reasons to discount this case are that Anthropic’s $26 billion and $70 billion forward revenue projections make a structural retreat harder for them than for capex-heavier hyperscalers, and that a commoditization scenario hits Anthropic’s gross-margin profile later than it hits the labs running on rented compute.

What this means for pricing AI labs

Three observations for an allocator.

(1) safety narrative is not moat at this scale, it is a constraint. Price it accordingly. The premium investors paid in 2021-2023 was for Anthropic against a counterfactual where there was no safety-aligned frontier lab. That counterfactual no longer exists. Anthropic is the frontier lab. The safety stance now binds the company in ways that show up as legal exposure, slower product cadence in some categories, and a narrower set of customers it can serve. None of these alone breaks the company. In aggregate, they change what an investor is paying for.

(2) the signal to watch is whether the rate of frontier-capability spread is faster than the rate of safety-practice diffusion. That ratio decides whether race-to-the-top is happening at all. If safety practices spread faster than capability, the thesis works. If capability spreads faster than safety practices, the thesis is a story the company tells about itself. The current evidence leans toward the latter, but it is the kind of variable that can move with one EU enforcement action or one US framework.

(3) Anthropic-the-company and Anthropic-the-thesis are now two different things. An investor can be long the company and short the thesis. The company has revenue, named customers, governance constraints that work as a partial offset to wealth-concentration risk, and a forward trajectory that is hard to bet against. The thesis is a 2021 framing under increasing structural strain. The longform makes the case that Dario sees this; here we make the case that the market should price it.

Based on twenty-five hours of Dario Amodei’s on-record interviews and six long-form essays. The full reported profile, Inside the Mind of Dario Amodei, runs 14,000 words at the author’s newsletter.

Karpathy's Software 3.0 Playbook

me@philippdubach.com (Philipp D. Dubach) — Fri, 01 May 2026 00:00:00 +0000

Andrej Karpathy is one of the few people who has both built modern AI and explained it for the rest of us. He co-founded OpenAI, ran computer vision at Tesla (where he got Autopilot working), and his courses on neural networks are some of the most-watched lectures on the internet. He also has a habit of naming the era we’re already in. “Vibe coding” was his. “Software 3.0” looks like the next one.

So when Karpathy says he has “never felt more behind as a programmer,” it is worth slowing down. That isn’t false modesty from a guy with his résumé. Something shifted under the field and most people haven’t recalibrated. The Sequoia interview below is his attempt to describe what shifted. The lessons here are pulled from it, ordered roughly by how much they should change what you do tomorrow.

1. Inflection point December 2025

Until late last year, agentic coding tools were “kind of helpful.” Good in stretches, often wrong in ways you had to babysit. Over the December break, the latest models crossed a line:

“I kept asking for more and just came out fine. And then I can’t remember the last time I corrected it. And then I just trusted the system more and more. And then I was vibe coding.”

He flagged it on the record, loudly:

“A lot of people experienced AI last year as a ChatGPT-adjacent thing. But you really had to look again, and you had to look as of December, because things have changed fundamentally — especially on this agentic, coherent workflow that really started to actually work.”

If your mental model of these tools was set by ChatGPT, it is already a generation stale. The agentic workflow is a different product, and it now works.

2. You can outsource your thinking, but not your understanding

The most quotable line of the interview:

“You can outsource your thinking, but you can’t outsource your understanding.”

As agents do more of the thinking, the bottleneck moves into your head. You still have to know what is worth building and why, and you still have to direct the work.

“I’m still part of the system, and I still have to — somehow, information still has to make it into my brain. And I feel like I’m becoming a bottleneck of just even knowing what are we trying to build, why is it worth doing, how do I direct my agents.”

Your value sits upstream of execution. The bottleneck of the next decade is less about compute than about how fast humans can deepen comprehension to keep directing systems that out-execute them. That is why Karpathy keeps building knowledge bases out of his own reading. He wants another projection of the same data, faster.

3. Verifiability is the map of what automates next

Why are these models freakishly good at code and math, and yet stupid about whether you should walk 50 meters to a car wash? Because frontier labs train via reinforcement learning, and RL needs verifiable rewards. Verifiable domains attract environments and signal, so they get the steepest gains. Everything else stays jagged.

“How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a hundred-thousand-line code base or find zero-day vulnerabilities, and yet tells me to walk to this car wash? This is insane.”

The GPT-3.5 to GPT-4 chess jump is the proof point. Capability tracks what the labs choose to feed in.

“We are slightly at the mercy of whatever the labs are doing, whatever they happen to put into the mix… If you’re in the circuits that were part of the RL, you fly. And if you’re in the circuits that are out of the data distribution, you’re going to struggle.”

Two things follow. If you are a founder and you can build a verifiable environment in your domain, even one the labs aren’t focused on, you can fine-tune a model that flies. That is real leverage. If you are a worker, the more useful question than “is my job safe?” is “is my job verifiable?” Karpathy thinks everything is automatable eventually. Verifiability mainly sets the order.

4. Software 3.0: prompting is the new programming

The frame that makes the rest of this make sense. Karpathy’s three eras:

Software 1.0: humans write explicit code.
Software 2.0: humans curate datasets and train neural networks; the weights are the program.
Software 3.0: humans write prompts; the LLM is the interpreter, and the context window is the program.

“Your programming now turns to prompting. And what’s in the context window is over the interpreter, that is the LLM, that is kind of like interpreting your context and performing computation in the digital information space.”

His sharpest example: installing OpenCode is no longer a shell script. It is a block of text you copy-paste to your agent, which reads your environment and figures the rest out.

“It’s just like, what is the piece of text to copy-paste to your agent? That’s the programming paradigm now.”

The unit of programming used to be a function. Now it is closer to a paragraph.

5. Vibe coding raises the floor; agentic engineering raises the ceiling

If you build software for a living, this is the lesson with the most direct implications:

“Vibe coding is about raising the floor for everyone in terms of what they can do in software… But agentic engineering is about preserving the quality bar of what existed before in professional software. You’re still responsible for your software just as before, but can you go faster?”

Karpathy thinks the ceiling is very high:

“People used to talk about the 10x engineer previously. I think that this is magnified a lot more — 10x is not the speed up you gain. People who are very good at this peak a lot more than 10x.”

The gap between mediocre and excellent users of these tools is widening. Worth taking seriously when you decide what to learn next.

6. The new human skill is taste, spec, and oversight

What humans should still do, in his telling, is design and judgment work. Holding the spec in your head. Setting the architecture. Making sure the agent is being asked for the right thing in the first place.

“You’re in charge of the taste, the engineering, the design, and that it makes sense, and that you’re asking for the right things… You’re doing some of the design and development, and the engineers are doing the fill in the blanks.”

The MenuGen anecdote is the kind of mistake only a human spec catches. The agent silently tried to associate Stripe and Google accounts by matching email addresses, with no persistent user ID. It worked until two emails diverged.

He is not sure this division will hold forever:

“When you actually look at the code, sometimes I get a little bit of a heart attack, because it’s not super amazing code… It’s very bloaty, and there’s a lot of copy-paste, and there’s awkward abstractions that are brittle and — like, it works, but it’s just really gross.”

Nothing fundamental stops the labs from training for taste. They just haven’t yet. Until they do, the taste layer is still your responsibility.

7. Some apps shouldn’t exist anymore

The MenuGen anecdote, again. Karpathy built an app: photo a restaurant menu, OCR it, generate images of each dish, render a new menu. Vercel deployment, the full stack.

Then he saw the Software 3.0 version. Hand the photo to Gemini, say “use NanoBanana to overlay the dishes onto the menu,” and a single model call returns the same menu with images rendered into the pixels.

“All of my MenuGen is spurious. It’s working in the old paradigm. That app shouldn’t exist.”

A lot of what we are building today is scaffolding around a capability the model could perform end-to-end. Before writing the next CRUD app, ask whether the model is the app.

8. New possibilities matter more than the speed-ups

The flip side of “some apps shouldn’t exist” is that some products could not have existed before. Karpathy’s knowledge-base project is the example. Take a pile of documents, ask the LLM to recompile them into a wiki, surface the connections you would never have stitched together by hand.

“This is not even a program. This is not something that could exist before, because there was no code that would create a knowledge base based on a bunch of facts. But now you can just take these documents and basically recompile them in a different way… I almost think that that’s more exciting.”

If you only ask what gets faster, you will miss the more interesting question, which is what becomes possible at all.

9. Jagged intelligence: ghosts, not animals

Karpathy’s metaphor: we are not building animals. Animal intelligence comes with intrinsic motivation, embodiment, drives shaped by evolution. What we have instead is more like a ghost. A statistical simulator shaped by pre-training, with RL bolted on.

“These things are not animal intelligences. Like, if you yell at them, they’re not going to work better. Or worse. Or it doesn’t have any impact. And it’s all just kind of these statistical simulation circuits where the substrate is pre-training. So, statistics. And then there’s RL bolting on top.”

The practical takeaway is to stop reasoning about LLMs by analogy to humans. Be suspicious of where the model seems confident, probe the edges, and figure out which circuits your task is actually landing in.

10. Build agent-native infrastructure

For infra builders, Karpathy’s pet peeve is also the opportunity:

“Why are people still telling me what to do? Like, I don’t want to do anything. What is the thing I should copy-paste to my agent?”

Rebuild the developer stack so the primary consumer of docs, configs, APIs, and deployment flows is an agent rather than a human. Data structures should be legible to LLMs by default, and sensors and actuators over the world should sit behind agent-callable interfaces.

His test: can you say “build and deploy MenuGen” and never touch a settings panel? When the answer is yes, the infrastructure has caught up.

11. Hire for big projects, not puzzles

A direct shot at hiring managers. Most companies have not refactored their interview loops for the agentic era.

“Hiring has to look like, give me a really big project and see someone implement that big project. Like, let’s write, say, a Twitter clone for agents, and then make it really good, make it really secure, and then have some agents simulate some activity on this Twitter. And then I’m going to use 10 Codex 5.4-X-high to try to break your website.”

Whiteboard puzzles measure the wrong thing. If your interview loop has not changed since 2022, you are selecting for the previous era.

12. Imagine the weird endpoint

The closing speculation is genuinely strange:

“In the early days of computing, people were a little bit confused as to whether computers would look like calculators or computers would look like neural nets. And in the ’50s and ’60s, it was not really obvious which way it would go… You could imagine that a lot of this will flip and that the neural net becomes kind of the host process, and the CPUs become kind of the coprocessor.”

UIs diffusion-rendered moment by moment from raw video and audio. No apps in between.

You do not have to buy this exact picture. The point is simply that the linear extrapolation, the same software but smarter, is almost certainly the wrong frame for where this ends up.

Based on Andrej Karpathy’s interview with Sequoia at AI Ascent

F3ED Can't Call an Ace: Fixing a NeurIPS 2024 Tennis Model

me@philippdubach.com (Philipp D. Dubach) — Wed, 29 Apr 2026 00:00:00 +0000

I built a tennis broadcast pipeline this spring and ended up running F3ED, the NeurIPS 2024 shot detector, on a couple of ATP Challenger matches. F3ED is a good model. It also kept labeling clear aces as “unforced errors”, which is what this post is about. Code: github.com/philippdubach/tennis-vision.

F3ED (NeurIPS 2024) detects shots well. The catch is the outcome head, which has 4 classes: in, winner, forced-err, unforced-err. There’s no class for ace, double_fault, or first_serve_fault. Those events aren’t shot properties; they’re score-grammar, and they need state from outside the shot itself.

I audited 11 single-shot serve rallies F3ED labeled unforced-err. 7 are first-serve-faults. 1 is an ace. Only 3 are genuine unforced-errors. 73% mislabeled by tennis’s own definition.

The fix is a 30-line reconciler that reads the scoreboard. OCR isn’t novel here. What I haven’t seen anyone do is plug it back into runtime label correction, which is what makes the difference. N=44 rallies across two matches; this is a hypothesis, not a finding. The structural argument doesn’t depend on N.

The current pipeline

Two phases, with a serializable artifact between them:

upstream.npz is a 5-field dataclass (ball_track, homography_matrices, kps_court, persons_top/bottom, bounces). It’s the contract between the GPU-bound detection layer and everything else. You re-run Phase 2 in seconds and don’t pay the GPU bill again until Phase 1 inputs change. This was a boring early decision that quietly carried the project. Every iteration runs in ~10 minutes instead of needing a fresh Colab session.

The score-delta reconciler sits at the end of Phase 2. It sees F3ED’s per-shot taxonomy and the OCR-derived score states. When they disagree, it overrides the outcome label.

Quality on TenniSet V006 (28 ground-truth points across 20 minutes, with ±12-frame tolerance):

F3ED has the highest F1, with fewer false positives at comparable recall. I’m not arguing it’s broken. I’m arguing about a specific thing it can’t do alone.

Note: numbers above are from the V006 baseline run on commit 0babb71 (2026-04-23). Bounce-dedup and reconciler work since then shift F1 marginally upward; full re-eval pending a Phase-1 v8x rerun on V006.

The audit

Tennis scoring is a finite-state machine. A point ends in exactly one of:

ace: server’s first or second serve, receiver doesn’t return
double_fault: both serves miss
first_serve_fault: first serve misses; second serve still to come
multi-shot rally → winner / forced-err / unforced-err

F3ED can only emit the bottom row. The first three depend on what happens between shots, or on what doesn’t happen at all, and the model doesn’t see between-shot stuff. It also has no class to put the answer in if it did: ace, double_fault, and first_serve_fault are not in F3ED’s published label set. The closest available emission for any of them is serve + unforced-err. The model can’t learn to distinguish them even if the training data did, because there’s nowhere to put the answer.

Here’s what tipped me off. set1 R4, t=107s:

F3ED raw_elements: ['T', 'ad', 'near', 'serve', 'unforced-err']
Score before rally: Poljicak 0 Dodig 0 (game start)
Score after rally: Poljicak 15 Dodig 0

Server scored, receiver didn’t move, F3ED labeled the serve “unforced error”. You can’t hit an unforced error and win the point. It was an ace, and F3ED doesn’t have an “ace” button to press, so it picked the closest available label.

The reconciler is short. For each single-shot serve rally, read the scoreboard before and after:

python# Verbatim from src/tennis_vision/scoreboard/reconcile.py
_POINT_RANK = {'0': 0, '15': 1, '30': 2, '40': 3, 'AD': 4}

def _delta_points_won(before: ScoreState, after: ScoreState) -> tuple[int, int]:
 """(top_pts_won, bot_pts_won) between two states. A game-counter increment
 counts as +1 (the lost-side rolls back to 0); same-game incremental points
 are tracked via the points-rank delta."""
 top = max(0, after.tg - before.tg)
 bot = max(0, after.bg - before.bg)
 if after.tg == before.tg and after.bg == before.bg:
 top += max(0, _POINT_RANK.get(after.tp, 0) - _POINT_RANK.get(before.tp, 0))
 bot += max(0, _POINT_RANK.get(after.bp, 0) - _POINT_RANK.get(before.bp, 0))
 return top, bot

def _classify_single_shot_serve(rally, before, after) -> tuple[str, str, str]:
 top_d, bot_d = _delta_points_won(before, after)
 server_d = top_d if rally.server == 'top' else bot_d
 receiver_d = bot_d if rally.server == 'top' else top_d
 receiver = 'bottom' if rally.server == 'top' else 'top'
 if server_d == 0 and receiver_d == 0:
 return ('first_serve_fault', 'unknown', 'ocr_score_delta_first_serve_fault')
 if server_d > 0 and receiver_d == 0:
 return ('ace', rally.server, 'ocr_score_delta_ace')
 if receiver_d > 0 and server_d == 0:
 return ('double_fault', receiver, 'ocr_score_delta_double_fault')
 return ('unknown', rally.winner, rally.method + '+ocr_inconclusive')

That’s the whole reconciler: 23 lines, microseconds per rally. The OCR sampling pass that produces the before / after states runs once during Phase 2 (~1 Hz over the broadcast); the reconciler itself is a constant-time lookup against the resulting state timeline.

Running it across set1 + match2 (44 rallies, 11 single-shot serve rallies) shows the structure F3ED missed:

8 of 11 unforced-err serves (73%) are something else by tennis’s actual rules. All 8 got the right label after reconciliation. Whether 73% holds up on a larger sample is a real question; the audit framework would answer it cheaply on more clips. The remaining error budget is OCR layout failures (next section) and ambiguous score deltas in multi-shot rallies, where neither F3ED nor OCR alone tells winner from forced-err.

The point here isn’t that F3ED is wrong. The model emits the labels it has classes for, which is what models do. The point is that shot detection and outcome classification look like the same problem and aren’t, and on broadcast tennis the cheapest outcome ground truth is text the broadcaster has already burned into the corner of every frame.

Does the OCR actually work?

Worth asking. The whole reconciler depends on the scoreboard reader being right. Honest answer: it depends heavily on whether the layout config is tuned. When it is, OCR is reliable. When it isn’t, individual fields collapse.

The pipeline is EasyOCR cropping a per-layout ROI (split_open_1080p, split_open_720p, bloomfield_720p), then a tennis-grammar decoder that rejects illegal transitions (40-30 → 0-0 without a game break, AD-15, and so on) and majority-votes within a sample window.

Field-parse rates on the two clips in this audit, sampled at ~1 Hz:

set1 is essentially perfect. match2’s bot_games parse drops below half because the ROI for split_open_720p is mistuned and crops too tight on the digit. Annoying, but the grammar decoder rescues enough frames to emit 28 valid score states across 1002 samples, which is plenty. The reconciler degrades gracefully: rallies without a clean before/after pair fall back to F3ED’s raw outcome rather than crashing.

The fix for match2 is layout cleanup, not architecture. None of these components are novel. TennisExpert (Liu et al. 2026, the paper that kicked off this whole project for me) and the TenniSet eval framework (Faulkner & Dick, DICTA 2017) both use OCR + grammar at the labeling stage. What I haven’t seen anyone do is plug the same signal back into runtime label correction.

Putting it in the render

After reconciling, the corrected outcome flows back onto the last shot of the rally and surfaces in the rolling event-timeline panel. Here’s a single point rendered end-to-end with all overlays live:

Same set1 R4 ace, mid-frame: Top-left RALLY panel reads 0.0s P2 Serve T ACE. The ACE suffix replaced F3ED’s UE. The scoreboard echo (bottom-left) mirrors what triggered the correction (Poljicak just picked up 15), and the per-player stats panel (bottom-right) ticks his ace counter by one. The model’s wrong answer gets quietly corrected because a different signal contradicted it. That’s the whole post in one frame.

The same panel surfaces F3ED’s other labels in real time: direction (T for down-the-T serves, CC/DL/DM/II/IO for groundstrokes) and shot type when not a basic groundstroke (Slice, Volley, Drop, Lob).

That panel is the F3ED 29-class taxonomy made human-readable, in real time. The reconciler doesn’t touch direction or technique. Those are pure shot properties, exactly the regime F3ED is designed for. It only fires on the score-grammar events the model can’t see.

A clean direction histogram comes for free as a side effect. 97 groundstrokes from match2:

36% down the middle, 31% cross-court, 17% inside-out, 12% down-the-line, 3% inside-in. The kind of stat broadcasters quote without showing where it came from. Here it’s a one-liner over shots.json.

Things that didn’t pay off

Two ideas I tried that I expected to be wins. Neither was.

YOLOv8x doesn’t help at 720p

Phase-1 person detector was YOLOv8m. Swapping in v8x looked like a free improvement: COCO AP@small bumps about 5 pp, and the camera-far (“top”) player on broadcast tennis is the smallest object in the frame, so that’s exactly where the gain should land.

set1 (1080p): top-player pose coverage 70.0% → 97.6%. match2 (720p): 70.3% → 68.6%, within noise. Two clips isn’t a study, but the mechanism is plausible: at 1080p the camera-far player is ~60-100 px tall, the regime where v8x’s AP@small advantage fires. At 720p the same player is ~30-50 px, below the COCO scale buckets where -x outperforms -m. The detector can’t recover what isn’t in the input. If you’re scraping ATP Challenger feeds, fight for 1080p sources. Everything downstream compounds on what the detector gives you.

CatBoost over-fires bounces

The bounce detector emitted 378 bounces on a 20-min match2 clip with 84 shots, 4.5× the realistic ratio. Most of the noise is the detector lighting up on the same physical bounce across consecutive frames, plus inter-rally footage where the ball is in a player’s hand or in a replay close-up.

Two cheap filters cut 21-27% of false bounces:

The first is a temporal dedup with ~400 ms minimum separation between bounces, fps-aware. It collapses CatBoost firing on three consecutive frames for one physical contact and drops 9-12%.

The second is a court-locality filter: project the ball pixel through the homography to canvas coordinates, drop if it falls outside the court polygon plus a 200 px buffer. This kills inter-rally noise where the ball is being held or replayed, dropping another 12-15%.

Real bounces don’t fire 200 ms apart and don’t land 3 m past the doubles alley. Neither filter is novel; both are roughly ten lines of code. If you’re using a CatBoost-style bounce detector you probably want both anyway.

What’s open

44 rallies isn’t enough to nail the percentage, just to expose the structure. Running the audit across 10+ matches is the obvious next step. It would also surface match-to-match variance in F3ED’s failure modes. Does it mislabel aces more often on hard courts than clay? I have no idea, and I’d like to know.

The reconciler currently only handles single-shot serve rallies. When both players hit clean balls and the point ends, the score delta is the same whether the winner came from a winner or a forced-err. Neither F3ED nor OCR alone disambiguates. A trajectory-aware classifier on the last two shots would close that gap. Haven’t tried it.

The longer-term move is closed-loop F3ED retraining: use the OCR-corrected labels as supervision for a small classifier head whose input is (F3ED 4-class outcome, OCR delta, single-shot flag) and whose output is the extended set {in, winner, forced-err, unforced-err, ace, double_fault, first_serve_fault, unreturnable}. About 5 minutes of training data per match. 10+ matches gets a usable head. The interesting move there is putting the OCR signal into training rather than just inference.

Inside PRAGMA: Revolut's Foundation Model for Banking

me@philippdubach.com (Philipp D. Dubach) — Sun, 26 Apr 2026 00:00:00 +0000

This month, Revolut Research and NVIDIA published PRAGMA: an encoder-only transformer trained on 26 million user histories spanning 24 billion events and 207 billion tokens across 111 countries. To my knowledge it is the largest encoder backbone for consumer banking event data anyone has put on arXiv. Nine months earlier, Nubank had published nuFormer, a similar premise with the opposite architecture. Can you train a transformer on raw transaction ledgers and replace the gradient-boosted-tree models running production credit, fraud, and recommendation pipelines.

Banking has spent the last decade lagging the rest of tech on representation learning. Production models still run on hand-crafted tabular features. Every team working on this knows it’s is suboptimal. Almost no team has the data, the GPUs, or the political budget to fix it. PRAGMA is what a banking foundation model looks like at the high end of the market.

The chart above is from the PRAGMA paper and it reads like a marketing slide. PR-AUC up 130.2% on credit scoring. AUUC up 163.7% on a communication uplift task. mAP up 40.5% on product recommendation. These are relative numbers against task-specific baselines and the absolute scores are commercially redacted, so calibrate accordingly. But Revolut publishing them under their own name, with author affiliations, is the meaningful signal here. Internal foundation models have moved from trade secret to competitive disclosure.

What Revolut built

PRAGMA is a BERT-style encoder, not a GPT. The choice matters. Revolut’s downstream targets are discriminative (default within 12 months, fraud, churn, product adoption), which is exactly what bidirectional masked modelling is good at. The model family scales from 10M to 100M to 1B parameters across three encoder branches: a profile-state encoder for static attributes, a per-event encoder, and a history encoder that fuses them.

The architectural decision that strikes me as most important is the input representation. Naive text serialization of a transaction record into JSON blows up sequence length: every key name, every delimiter, every digit becomes multiple BPE subword tokens. Worse, splitting “14.99” into “14” “.” “99” destroys the magnitude information that any credit model needs. Revolut’s answer is to tokenise each field as a triple of semantic key, typed value, and temporal coordinate. Numerical values map to learned percentile buckets. Categorical values map to single tokens. Text gets BPE. Timestamps get encoded twice, once as compressed log-seconds since the previous event and once as fixed-period sinusoids over hour-of-day, day-of-week, and day-of-month.

The figure above is what a single user looks like to PRAGMA: a stream of structured events leading up to an evaluation point at which the model is asked to predict something. Around 60 keys. Around 28,000 value tokens.

Pre-training is masked language modelling, but with three masking sources blended together: 15% standard token masking, 10% whole-event masking, and 10% semantic-type masking. The whole-event variant is interesting for banking. It teaches the model that when you cannot see the amount of a card payment but you can see the merchant, the time, and the surrounding behavioural pattern, the amount is often inferable. That is exactly the inductive bias you want in a credit or fraud model.

The numbers

(1) The LoRA versus train-from-scratch comparison. Revolut shows that fine-tuning a pre-trained backbone with LoRA, updating roughly 2-4% of parameters, consistently matches or beats training a fresh task-specific model on the same downstream data. This is the result that justifies the entire infrastructure investment. If pre-training did not transfer, you would not bother. Communication engagement gains 18.6% PR-AUC from LoRA over scratch. Credit scoring gains 13%. Product recommendation gains 10.3% mAP. That is the business case.

(2) The profile-state ablation. Removing the dedicated profile-state branch tells you which tasks are driven by static user characteristics versus event sequences. Credit scoring loses 31.8% PR-AUC without profile state, because account tenure and onboarding signals matter for identifying minority-class defaulters. Communication engagement actually gains 3% in PR-AUC without profile state, because re-engagement is a story about pre-drop-off behaviour, not about who the user is. The two-branch design keeps the static features when they help and ignores them when they do not.

(3) The failure. PRAGMA loses 47.1% on F-0.5 against the production baseline for anti-money-laundering detection, and Revolut wrote this into their paper. The reason is that AML is a relational problem. You catch laundering by looking across users and across accounts, and PRAGMA processes each user history in isolation. The lesson generalises: foundation models on individual ledgers are not graph-aware, and the production AML stack at any large bank includes graph-aware components that PRAGMA cannot replace. Knowing the limit is more useful than the headline gains.

How this compares to Nubank

Nubank’s nuFormer, published in July 2025, makes the opposite architectural choice. It is a causal GPT-style decoder pre-trained with next-token prediction, with a joint fusion finetuning step that bolts a DCNv2 tabular network onto the same gradient graph. The reported lift is +1.25% in test AUC on a single recommendation task, and a 4.4% reduction in user churn measured in production. Smaller numbers than PRAGMA, but Nubank published a real production deployment outcome. PRAGMA’s results are still backtests.

The two papers disagree on almost everything that is fun to argue about. Architecture: decoder versus encoder. Task scope: one task versus six. The role of static profile state: collapsed into the sequence versus given its own branch. What they agree on: Hand-crafted feature engineering can be replaced by self-supervised representation learning on raw transaction sequences, and doing so produces material lifts on real banking problems. The architectural debate is downstream of that.

The broader literature is moving the same way. TransactionGPT (Dou et al., 2025) introduces a 3D transformer for billion-scale payment trajectories aimed at anomaly detection. FinBERT, BloombergGPT, and FinGPT cover the text side. Time-LLM and Chronos cover numerical time series. PRAGMA and nuFormer are the two papers that target the actual structured event ledger sitting inside a retail bank, which is the asset that matters for credit, fraud, and product decisions.

Outlook

There is no public checkpoint. Revolut and Nubank both keep their weights inside their production stack, which is the right business decision and the wrong scientific one. You cannot run PRAGMA on your own data. You can only read the paper and decide whether the recipe is reproducible.

I think it is. The paper is detailed enough to rebuild from. The tokenisation scheme is fully specified. The architecture diagram is precise enough to follow. They even document the optimiser, Muon plus AdamW, and the hardware, 32 H100s for the 1B variant. The constraint is the pre-training corpus, not the model.

So the next project on this site is a faithful PRAGMA reimplementation at the small (10M) scale, trained on a synthetic or open-licensed transaction dataset, evaluated on a subset of the downstream tasks where public benchmarks exist. I will write that up here in instalments, including what works, what breaks, and where the paper is silent. The codebase will land in a public repository as I build it.

Do Not Disturb My Circles

me@philippdubach.com (Philipp D. Dubach) — Mon, 13 Apr 2026 00:00:00 +0000

If I’d had my way, we would have left it in the lab for longer and done more things like AlphaFold, maybe cured cancer or something like that.

That’s Demis Hassabis (I cannot recomend watching The Thinking Game enough and or read The Infinity Machine), the CEO of Google DeepMind and a Nobel Prize winner, describing the future he didn’t get.

He wanted a CERN for artificial intelligence. A decade or two of careful, methodical work. The world’s best scientists collaborating on each step toward general intelligence, understanding what they built before building the next thing. In the meantime, AI for science, narrow tools like AlphaFold, would ship real benefits: cures, new materials, maybe a crack at fusion. Not chatbots. He didn’t get that future. None of us did. Instead we got a commercial arms race, a $690 billion annual infrastructure buildout, and the greatest concentration of technical talent in human history pointed at making autocomplete better.

This is a story about capital misallocation. But it’s also a very old story.

Geometry in the sand

In 214 BC, the Roman general Marcellus brought a fleet to Syracuse. Standing between Rome and the richest city in Sicily was one man: Archimedes, the greatest scientist of the ancient world, a mathematician whose work on the lever, the screw, and the principles of buoyancy would outlast every empire he lived under.

Archimedes did not want to build weapons. Plutarch, writing in the Life of Marcellus, says Archimedes designed and contrived his machines “not as matters of any importance, but as mere amusements in geometry.” He regarded the whole business as ignoble, beneath the dignity of pure mathematics. But his patron King Hiero II needed defenses, and Archimedes was the only man who could provide them. So he built them. Catapults that could sink a ship at range. The Claw of Archimedes, an iron grappling device that could lift a Roman galley out of the water and drop it. Possibly parabolic mirrors that focused sunlight to set ships on fire, though historians still debate that one.

The machines worked. Plutarch writes that the Romans became so terrified that “whenever they saw a bit of rope or a stick of timber projecting over the wall, they cried ‘Archimedes is training some engine upon us,’ and turned their backs and fled.” They held off Rome for two years.

Then Syracuse fell anyway. In 212 BC, Roman soldiers breached the walls during a festival. A soldier found Archimedes drawing geometric figures in the sand. According to the tradition passed down through Valerius Maximus and others, his last words were “Noli turbare circulos meos”: do not disturb my circles.

Marcellus had ordered Archimedes taken alive. The order didn’t matter. The soldier killed him. The geometry died with him. The war machines, the things Archimedes considered beneath his real work, survived in military engineering textbooks for centuries. His mathematical treatises survived only by accident, through a single Byzantine manuscript scraped and overwritten with prayer texts in the 13th century.

I thought about this when I watched Demis Hassabis in a recent interview with Cleo Abram.

The conscription

He had been building learning systems at DeepMind for years. The work was pointed at science. AlphaFold was the first proof that AI could crack fundamental problems in biology. Move 37, AlphaGo’s famous creative play against Lee Sedol in 2016, was the proof that AI systems could discover things no human had considered.

Then ChatGPT happened. Google went code red. Hassabis, the man who wanted to solve protein folding and maybe crack fusion, became the man who runs all of Google’s AI, including the consumer products he’d never wanted to focus on.

He’s candid about what was lost:

My ideal was to approach the latter stages of building AGI using the scientific method, very carefully, very precisely, very thoughtfully, in a CERN-like way. That might take a decade, even two decades longer. But I think that would make sense given the enormity of what we’re dealing with.

And about the irony:

Language was a lot easier than we were all expecting. Even those of us who were obviously optimists about the whole technology. We thought maybe there would be one or two or three more breakthroughs needed. But it turned out transformers and some reinforcement learning on top was enough.

The ease of the advance was the thing that derailed the deeper work. Language models turned out to be good enough for consumer products, and consumer products generate revenue, and revenue attracts competition, and competition creates the arms race that now consumes everything. DeepMind had “fairly equivalent systems” to ChatGPT at the time, Hassabis says. They chose not to release them. That choice was taken from him.

What a dollar buys

The resource allocation case is simple enough to state in one line, though the implications are not.

AlphaFold 2 trained on 128 Google TPUv3 chips for approximately 11 days. At Google Cloud’s public pricing of roughly $32 per hour per TPU, the estimated training cost is somewhere under $1 million. It predicted the three-dimensional structures of 200 million proteins. Over 3 million scientists now use it. A pharma executive told Hassabis that “almost every drug developed from now on will have probably used AlphaFold in its process.”

Now the other side of the ledger. GPT-4’s training cost an estimated $78 million. Gemini Ultra ran to roughly $191 million. OpenAI’s Orion exceeded $500 million for a single training run, and the model was so disappointing they downgraded it from GPT-5 to GPT-4.5. OpenAI’s inference spending alone, just the cost of running the models after training, hit $2.3 billion in 2024. That is 15 times what they spent training GPT-4.5.

AlphaFold cost less to train than OpenAI spends on inference in a single day.

Zoom out further. The Big 4 hyperscalers, Amazon, Alphabet, Meta, Microsoft, are guiding to $610-665 billion in capital expenditure for 2026. Goldman Sachs projects cumulative 2025-2027 spending at $1.15 trillion. As I noted in Peter Thiel’s Physics Department, Big Tech spends 75 times more on AI than the entire US federal science budget: $250 billion versus $3.3 billion per year. The DOE Genesis Mission, the flagship US government program for AI-driven scientific discovery, received $320 million in its first round. That is less than Meta spends on AI infrastructure in a single week.

The infrastructure being built is not for protein folding. It is not for materials science or fusion plasma control or genomics. It is for chatbots, image generators, and coding assistants. Sequoia’s David Cahn calculated the AI ecosystem needs $600 billion in annual revenue to justify current infrastructure spending. It generates perhaps $80-120 billion. And nearly all of that revenue comes from commercial applications: subscriptions, API access, enterprise contracts for systems that summarize emails and draft marketing copy.

The bottleneck for AI for science was never money. AlphaFold proved that. It was always about who works on what, and the chatbot economy answered that question for an entire generation of researchers.

What the circles produced

When Hassabis’s teams were allowed to focus on science, when the circles were left undisturbed, this is what happened.

In The Thinking Game there’s a moment that captures it perfectly. The original plan for AlphaFold was conventional: build a server, let scientists submit protein sequences one at a time, email back the predicted structures. Standard approach, used by the whole field for 40 years. Then Hassabis started doing arithmetic on his phone in the middle of the meeting. Two hundred million known proteins. One fold every ten seconds. How many TPUs do we have? He looked up and said something like, “Why don’t we just fold everything?”

It would be, he realized, actually less work than building the server.

So they folded everything. AlphaFold predicted the structures of 200 million proteins and put them in a free database. The nuclear pore complex, one of the largest and most important proteins in the body, a donut-shaped gateway that controls nutrient flow in and out of the cell nucleus, was solved within months of AlphaFold’s release. Researchers working on neglected diseases, malaria, Chagas, leishmaniasis, diseases that affect hundreds of millions of people but attract little pharma funding, now get protein structures for free. Plant scientists working on climate-resilient crops can skip years of crystallography and go straight to the biology.

Isomorphic Labs, the DeepMind spinoff, is running 18-19 drug programs across cardiovascular disease, cancer, and immunology. IsoDDE, its drug design engine, hits 50% on the hardest protein-ligand benchmarks versus 23% for AlphaFold 3. AlphaGenome is decoding the 98% of the human genome that doesn’t code for proteins, the part where most disease-causing mutations hide. Jennifer Doudna, the CRISPR pioneer, asked Hassabis directly about combining AlphaGenome with CRISPR to identify and fix the exact genetic changes causing disease. His answer: “Still not probably good enough yet. But you can imagine a future version.”

AlphaEvolve found a 23% speedup inside Gemini’s own architecture, recovering 0.7% of Google’s total compute. DeepMind’s fusion work controlled plasma autonomously in a real tokamak. GNoME identified 2.2 million new crystal structures, equivalent to roughly 800 years of prior human discovery in materials science.

All of this on a fraction of the compute that powers the chatbot economy. I keep coming back to this: the entire portfolio of DeepMind’s scientific work, the Nobel Prize, the drug programs, the materials, the fusion experiments, consumed less compute than a single frontier chatbot burns through in inference costs per quarter.

The case for the war machines

I want to present the counterargument honestly, because it’s not trivial.

The commercial race funded a compute buildout that wouldn’t exist without chatbot demand. $690 billion in 2026 capex built data centers that can, in principle, be repurposed for scientific workloads. The talent pipeline expanded: a generation of ML engineers entered the field because consumer AI products made it exciting and lucrative. Millions of users stress-tested these models in ways internal testing never could, revealing failure modes and edge cases that improve the underlying systems. Hassabis himself acknowledges this. In the HUGE* interview he listed the benefits: “lightning speed” progress, democratized access to cutting-edge AI “perhaps only 3 to 6 months behind what is actually in the labs,” and societal normalization that prepares people for bigger changes ahead.

And there’s the funding argument. Google’s $132 billion in net income funds DeepMind. Gemini’s commercial revenue helps justify the research budget. Without the chatbot economy, would Alphabet spend billions on AI research at all?

The strongest version of this argument goes: you can’t have the cathedral without the wool merchants. Bell Labs needed AT&T’s monopoly revenue. The Apollo program needed Cold War spending. Scientific breakthroughs don’t fund themselves. The commercial race, ugly as it is, is the mechanism that makes the science possible.

Why the steelman breaks

I’ve thought about this for a while, and I think it’s wrong.

Start with the compute argument. The infrastructure being built is overwhelmingly inference infrastructure: data centers optimized for running chatbot queries at scale, not for training scientific models. AlphaFold trains on 128 TPUs. It doesn’t need a $75 billion annual capex program. The buildout serves commercial demand. Calling it a foundation for scientific AI is like calling a shopping mall a foundation for particle physics because they both use electricity.

The talent argument has the same problem. The pipeline filled, but it filled with the wrong skills and pointed in the wrong direction. Stanford HAI’s 2025 AI Index found that 70% of AI PhDs took private sector jobs in 2023, up from roughly 20% two decades ago. Bruce Schneier wrote in Nature that the exodus threatens “innovation driven by curiosity rather than profit.” The ML engineers entering the field are optimizing RLHF, fine-tuning chat models, building prompt engineering toolchains, and competing on Chatbot Arena leaderboards. These are not the skills that fold proteins or control plasma. The talent that cracks drug discovery needs computational chemistry, molecular dynamics, quantum mechanics. The talent attracted by the chatbot boom is, for the most part, not that talent.

The stress-testing argument is real but narrow. Millions of users proved that language models can summarize documents and brainstorm ideas. That tells you nothing about whether they can predict which genetic mutations cause disease. The applications share a model architecture but almost nothing else.

And the funding argument, the one that seems hardest to dismiss, actually argues the opposite of what its proponents think. The best historical parallel is Bell Labs. Founded in 1925 as the research arm of AT&T’s regulated telephone monopoly, Bell Labs produced the transistor, the laser, Unix, the C programming language, information theory, and the discovery of cosmic microwave background radiation. Ten Nobel Prizes. Five Turing Awards. Brian Potter in Construction Physics calls the conditions “unrepeatable”: a vertically integrated monopoly that could afford to fund research with no immediate commercial return.

Then AT&T was broken up in 1984. Commercial competition arrived. What happened next is instructive: the research workforce dropped from roughly 1,300 to 500 by 2002. Only one post-divestiture employee won a Nobel Prize. Bell Labs was passed from AT&T to Lucent to Alcatel to Nokia, each owner less interested in fundamental research than the last. By 2008, four physicists remained in basic research. By 2016, what had been the most productive research institution in human history was a division of a Finnish telecom company.

The irony is precise: the people who argue that commercial pressure funds great science are citing a lab that produced its greatest work under monopoly protection from commercial pressure, and died the moment that protection was removed.

Hassabis’s vision, the CERN model, is the Bell Labs model. Let fundamental research breathe. Shield it from quarterly earnings. Fund it with patient capital. He had that at DeepMind, funded by Google’s search advertising monopoly, insulated from product deadlines, free to spend six years building AlphaGo before it produced a single dollar of revenue. Then the commercial race consumed the insulation.

The funding was already there. What he lost was the institutional focus.

The circles

Archimedes held off Rome for two years. Then the soldier came. The war machines didn’t save Syracuse. They bought time, and that time ran out.

I don’t think the chatbot era saved AI for science. I think it ate the oxygen. The talent went to RLHF optimization. The compute went to inference farms. The institutional attention went to quarterly product launches. Hassabis is now simultaneously building the war machines and drawing the circles: running Gemini and funding Isomorphic, shipping chatbots and folding proteins. That he manages both is remarkable. But it’s a compromise, and the compromise has a cost measured in drug programs that don’t exist, diseases that aren’t being studied, materials that haven’t been found.

The question is not whether chatbots are useful. They are. I use them constantly. The question is whether future historians will look at 2023-2026 and see a period when the most capable scientific tool in human history was mostly pointed at drafting emails and generating stock photos, and wonder what we were thinking. The way we look at that Roman soldier: someone who destroyed something more valuable than he could understand.

In the interview, Hassabis is asked what he would want said at his funeral. His answer was immediate:

I would hope that they would say that my life was of benefit and service to humanity.

The circles are still there, drawn in the sand between product launches.

On-Device AI Models Will Be The New Reason to Upgrade Your Phone

me@philippdubach.com (Philipp D. Dubach) — Wed, 25 Mar 2026 00:00:00 +0000

The iPhone 17 runs a 3 billion parameter language model on-device at 30 tokens per second. Obviously, the average consumer has no idea what that sentence means, and Apple hasn’t figured out how to make them care.

I believe that’s about to change. Apple now has complete access to Google’s Gemini model in its own data centers, with the ability to distill it into smaller models built for iPhones and iPads. Knowledge distillation works like this: you take a large model, have it perform tasks with detailed reasoning, then feed those reasoning traces to a smaller model until the student learns to mimic the teacher. The smaller model ends up far more capable than if you’d trained it from scratch on the same data. Apple can now do this with the full Gemini, not just their own in-house models, and the distilled output runs locally. No internet required.

Smartphones haven’t had a real upgrade story in years. The camera is great. The screen is great. The processor was fast enough three generations ago. Battery life has overtaken price as the top purchase driver for the first time. The global replacement cycle has stretched to 3.5 years. People hold onto their phones because nothing about the new one feels different enough. Deloitte’s 2025 TMT Predictions report frames on-device generative AI as the feature that could break this cycle, if the experience delivers on the promise. On-device AI might become the next reason.

The spec

In the late 1990s it was megahertz: Intel and AMD raced clock speeds past the point where consumers could distinguish real-world performance differences, but the number on the box still drove purchases. Then it was megapixels. Samsung shipped a 200 MP camera sensor knowing that most phones use 16-to-1 pixel binning to output a 12.5 MP image by default.

Parameters could be next. The iPhone 17’s standard A19 chip has 8GB of RAM. The Pro gets 12GB with faster memory bandwidth, which determines how large a model the phone can run and how quickly. Samsung’s 2026 flagships with the Exynos 2600 hit 80 TOPS on a 2nm process, more than double the prior generation. These are already the numbers in press releases. It’s not hard to imagine an Apple keynote where someone says, with rehearsed enthusiasm, that the iPhone 18 Pro runs a 7 billion parameter model while the standard model is limited to 3 billion.

The difference from previous spec wars is that this one might actually correlate with user experience. Megahertz past a certain threshold didn’t make Word open faster. Megapixels past 12 MP didn’t make photos look better on a phone screen. But a 7 billion parameter model running locally outperforms a 3 billion one on nearly every task. It handles longer documents, follows more complex instructions, holds better conversational context.

Breaking the stalemate

Gartner projects GenAI smartphone spending will reach $393 billion in 2026, up 32% from $298 billion in 2025. IDC reports GenAI smartphone shipments growing 73% year over year. Samsung has publicly committed to 800 million AI-enabled devices by end of 2026, doubling its 2025 footprint. Morgan Stanley’s latest survey found iPhone upgrade intentions at 37%, an all-time high, with FY26 shipment forecasts of 260 million units sitting 3% above Street consensus.

On-device AI creates hard hardware requirements in a way that camera improvements and screen upgrades never did. You cannot run a 3 billion parameter model on an iPhone 14. The Neural Engine isn’t powerful enough and the memory bandwidth isn’t there. Apple Intelligence requires an A17 Pro or later, which means the feature itself creates an upgrade floor. Every year that floor rises. When Apple ships distilled Gemini models that need the A19 Pro’s 12GB of RAM, every phone older than 2025 is locked out.

The Gemini deal matters for the hardware cycle because of the distillation pipeline. Apple doesn’t need to build frontier-scale models from scratch. They can take Gemini’s best capabilities, run them through distillation, and compress the results into models sized for their hardware tiers. A 3 billion parameter model for the standard iPhone. A 5 billion version for the Pro. Maybe a 10 billion model for a future iPad Pro with enough memory and thermal headroom.

Google is playing a similar game from the other side. The original Gemini Nano shipped at 1.8 billion parameters; the updated Nano-2 rose to 3.25 billion. Samsung’s Galaxy S26 ships with on-device Gemini running on NPUs that are 39% faster than the prior generation. On-device models get larger every hardware generation. Each generation’s models don’t run well on older hardware. You see where this goes.

I find it plausible that within two product cycles, on-device model capability becomes the primary differentiator between phone tiers and between generations. The data isn’t there yet: only 17% of Americans say AI is a major purchase influence today, Apple Intelligence ranked seventh globally as a reason to upgrade in Morgan Stanley’s survey, and over 40% of users have privacy concerns about smartphone AI, with half unwilling to pay extra for it. But you can’t tell the difference between a 48 MP photo and a 12 MP photo on your phone screen. You can absolutely tell the difference between an AI assistant that understands your question and one that doesn’t. The feedback loop is immediate and personal. If the bigger model actually works better, and if the distillation pipeline from Gemini delivers real capability gains, the upgrade incentive is self-reinforcing. People will upgrade not because the spec sheet says they should, but because they tried their friend’s phone and the AI was better.

Whether this arrives with iOS 27 this fall or takes another generation to mature, I don’t know. But the next reason to buy a new phone will much more likely be the model than the camera.

AI Can Now Design Drugs in Seconds; We Still Can't Tell You If They Work.

me@philippdubach.com (Philipp D. Dubach) — Wed, 18 Mar 2026 00:00:00 +0000

No AI-discovered drug has ever received FDA approval. That sentence should sit uncomfortably next to every headline about Alphabet’s drug discovery spinoff.

On February 10, Isomorphic Labs, the Google DeepMind spinoff focused on computational drug design, released IsoDDE: its Drug Design Engine. This isn’t a model or an AlphaFold upgrade. IsoDDE is a unified in silico drug discovery system that runs protein structure prediction, ligand binding, affinity estimation, and pocket identification in concert, generating in seconds what used to take days of physics-based simulation. On the hardest molecular prediction tasks, the “Runs N’ Poses” benchmark designed to test generalization to unfamiliar proteins, IsoDDE hits a 50% success rate. AlphaFold 3 manages roughly 23%. On antibody-antigen modeling, IsoDDE beats AlphaFold 3 by 2.3× and the open-source Boltz-2 by 19.8×. On binding affinity prediction, it achieves a Pearson correlation of 0.85, beating the physics-based gold standard FEP+ at 0.78.

I would assume that these are large enough improvements that the computational bottleneck in drug design may no longer be the binding question.

What pharma believes

Isomorphic has signed partnerships with Eli Lilly, Novartis, and Johnson & Johnson worth a combined $4 billion+ in potential value. But look at the structure. Lilly paid $45 million upfront against $1.7 billion in milestones. Novartis paid $37.5 million upfront against $1.2 billion. That’s a 50:1 ratio between what pharma promises in biobucks and what it actually wires.

This ratio is standard across AI drug discovery deals in 2025. Pharma is enthusiastic enough to sign but cautious enough to make nearly all the economics contingent on clinical results that don’t exist yet. The upfront payments fund research. The milestone payments are structured so that pharma loses almost nothing if the drugs fail. The royalties only matter if a drug reaches blockbuster status, which for an AI-designed molecule has never happened.

Novartis expanded its partnership in February 2025, doubling the number of programs to six, targeting what Novartis described as “particularly challenging” and previously undruggable targets, on the same financial terms. That’s a positive signal: it means internal results impressed Novartis scientists enough to commit more targets. The J&J deal, announced January 2026, goes further, covering small molecules, antibodies, peptides, and molecular glues. But “expanded partnerships” and “approved drugs” remain separated by the most unforgiving filter in business: human biology.

Phase II wall

Most commentary on AI drug discovery stops too early. Jayatunga et al. (2024), in the first systematic analysis of AI-discovered drugs in clinical trials, showed AI-discovered molecules achieving 80-90% success rates in Phase I trials, well above the historical 40-65% average. AI is good at designing molecules that are safe and have decent pharmacokinetic properties: they get absorbed, distributed, metabolized, and excreted the way you’d want. Phase I is mostly about safety. AI passes it.

But Phase II is about efficacy. Does the drug actually treat the disease? And here the numbers are sobering: AI-discovered drugs show roughly 40% Phase II success rates, which is about the same as traditionally discovered drugs. AI has not yet demonstrated it can predict whether a molecule will work in a patient, only that it can predict whether a molecule will be tolerable in a patient.

If both trends hold, end-to-end success rates could rise from the historical 5-10% to something like 9-18%. That would roughly double R&D productivity, which in a trillion-dollar industry is worth an enormous amount. McKinsey estimates generative AI could generate $60-110 billion annually in economic value for pharma and medical products. But it’s a far cry from the narrative that generative AI will “solve” drug discovery. It would make drug development somewhat cheaper and faster. An improvement, not a revolution.

The counterargument, and it’s a reasonable one, is that IsoDDE represents a qualitative leap that could crack the efficacy problem. Its ability to model induced fits, where proteins reshape to accommodate a drug, and to identify cryptic binding pockets, like the cereblon site that took experimentalists 15 years to find, means it’s capturing biological dynamics that earlier AI systems missed entirely. If better structural understanding translates to better efficacy prediction, the Phase II wall might eventually come down.

I find this plausible but unproven. We’ll know more when Isomorphic’s first candidates enter trials, targeted for late 2026.

Where Isomorphic fits in the competitive stack

Isomorphic’s competitive position is unusual. It leads on computational benchmarks but trails on clinical progress. Insilico Medicine has the most advanced clinical portfolio: its IPF drug ISM001-055 (now called rentosertib) reached Phase IIa with positive results published in Nature Medicine in June 2025, and Insilico has 10+ IND approvals across 31 programs. Recursion Pharmaceuticals, which absorbed Exscientia in a $688 million merger, takes a different approach entirely, running millions of phenomics experiments weekly on 65 petabytes of biological imaging data. Both companies own wet-lab infrastructure that Isomorphic lacks.

What Isomorphic has: the AlphaFold lineage, Alphabet-scale compute, and a unified architecture where each prediction task informs the others. On talent, the company appears to be doing well: 4.7/5 on Glassdoor, 100% CEO approval. They hired Dr. Ben Wolf as CMO in June 2025, formerly at Relay Therapeutics with FDA approval experience for Ayvakit and Gavreto. They opened a Cambridge, Massachusetts office. These are the moves of a company staffing up for clinical reality, not just publishing papers.

The open-source threat is real but manageable in the near term. Chai Discovery (backed by OpenAI at a $1.3 billion valuation, now partnered with Lilly on biologics) and Boltz (partnered with Pfizer) are both making progress. But the gap between IsoDDE’s numbers and the best open-source alternatives is wide enough that Isomorphic has time, maybe 18-24 months, to convert its computational lead into clinical evidence before the field catches up.

Alphabet’s asymmetric position

For Alphabet, Isomorphic is a rounding error that could become a franchise. The Other Bets segment posted a $3.6 billion operating loss in 2025. Alphabet’s net income was $132 billion. The $600 million funding round led by Thrive Capital in March 2025 suggests the company understands the urgency of getting to the clinic, but Alphabet can sustain this bet indefinitely while the underlying science matures, and that patience is itself a competitive advantage most biotech startups don’t have. But does better computation translate to better medicine? IsoDDE’s benchmarks are the best evidence so far that AI can model molecular interactions at this resolution. But Demis Hassabis said it himself:

We know we’re never going to solve drug design with AlphaFold alone. We’ll need half a dozen more breakthroughs of that magnitude.

IsoDDE might be one of those breakthroughs. The clinical data, when it arrives, will tell us whether it’s the kind that matters.

The Last Architecture Designed by Hand

me@philippdubach.com (Philipp D. Dubach) — Mon, 16 Mar 2026 00:00:00 +0000

I bet there is another new architecture to find that is gonna be as big of a gain as transformers were over LSTMs.

Sam Altman, the CEO of the company most invested in the transformer is telling a room of students it isn’t the final form. So what comes after the transformer? He’s probably right that something will, and the evidence is no longer anecdotal. Several recent papers have proved that the transformer’s worst properties are structural, not engineering problems to be fixed with better data or more compute, but mathematical lower bounds.

The transformer, born from the 2017 paper “Attention Is All You Need,” took us from barely-coherent GPT-2 to GPT-4 in five years. An extraordinary run. But Duman Keles et al. proved that O(n²) attention complexity isn’t an implementation detail. It’s a necessary lower bound unless a foundational conjecture in complexity theory turns out to be wrong. Double the context, quadruple the cost. The KV cache for a 70B model at one-million-token context eats roughly 320 GB of GPU memory. Most hardware can’t hold it.

The problems run deeper than compute costs. Kalai and Vempala proved that any calibrated language model must hallucinate at a certain rate. A 2025 follow-up goes further: no computable LLM can be universally correct on unbounded queries. Not fixable with better training data. Not fixable with RLHF. A statistical property of how these models generate text.

On reasoning: Dziri et al. showed transformers collapse multi-step reasoning into pattern matching. Performance drops exponentially as task complexity rises. GPT-4 gets 59% on 3-digit multiplication. Chowdhury proved the “lost in the middle” problem, models performing 20-30% worse on information buried mid-context, is a geometric property of the architecture itself. Present at initialization already, before any training occurs.

These are theorems. The architecture that runs every frontier AI system has a ceiling, and the ceiling is proved.

The post-transformer stack is already in production

A survey by Fichtl et al. checked the top 10 models on every major benchmark. Zero were non-transformer. The transformer is still winning on the leaderboards. But the field is moving toward hybrid architectures. Over 60% of frontier models released in 2025 already use Mixture of Experts. DeepSeek-V3 has 671B total parameters but activates only 37B per token. It trained for 2.788 million H800 GPU hours, a fraction of what a comparable dense model would require, and matched frontier closed-source performance. By late 2025, DeepSeek-V3.2 reportedly hit GPT-5-level performance at 90% lower training cost. MoE doesn’t replace the transformer. It changes the economics so radically that it’s arguably the single biggest practical advance since the original architecture.

The more interesting part is what happens when you blend attention with state space models. Gu and Dao (2024) proved SSMs and attention are mathematically dual: two views of the same computation. That theoretical result is showing up in production. AI21’s Jamba runs a 1:7 attention-to-Mamba ratio and gets 256K context at 3x throughput over Mixtral. Alibaba’s Qwen3-Next shipped the first top-tier model with a hybrid backbone: Gated DeltaNet for linear attention at a 3:1 ratio with full attention. Microsoft’s Phi-4-mini-flash-reasoning is 75% Mamba layers with 10x throughput at 2-3x lower latency.

Diffusion language models are the wild card. LLaDA, the first 8B-parameter diffusion LLM, treats text generation as denoising rather than sequential token prediction. It matches Llama3-8B and does something no autoregressive model can: it solves the “reversal curse,” outperforming GPT-4o on reversal tasks. Gemini Diffusion hit 1,479 tokens per second. Over 50 papers on diffusion LLMs appeared in 2025. If parallel generation works reliably at scale, inference economics change completely.

Alman and Yu proved there are tasks where every subquadratic alternative has a fundamental theoretical gap. That’s the strongest mathematical argument for why hybrids, not clean replacements, are what comes next.

The search is no longer human-speed

The part of this I find most interesting is the recursion. AI systems are now running the search for their own architectural successors.

AlphaEvolve an evolutionary coding agent built on Gemini 2.0 found a way to multiply 4x4 complex matrices in 48 scalar multiplications: the first improvement on Strassen’s 56-year-old bound. Across 50+ open math problems, it matched the best known solutions 75% of the time and beat them 20% of the time. The recursive part: AlphaEvolve found a 23% speedup on a kernel inside Gemini’s own architecture, cutting Gemini’s training time by 1% and recovering 0.7% of Google’s total compute. Gemini making Gemini faster.

Karpathy’s AutoResearch, released March 7, 2026, is a 630-line Python script that lets an AI agent modify training code, run 5-minute experiments, check results, and iterate. He pointed it at his own highly-tuned “Time to GPT-2” codebase. The agent found about 20 additive improvements that transferred to larger models, cutting the metric by 11%. Shopify CEO Tobi Lutke tried it overnight: 37 experiments, 19% validation improvement, a 0.8B model outperforming a 1.6B one. Sakana AI’s AI Scientist v2 went further and produced the first AI-authored paper accepted through standard peer review. OpenAI said publicly in late 2025 that it’s researching how to safely build AI systems capable of recursive self-improvement. Two years ago this was a thought experiment.

What the hardware decides

The transformer won not because attention was theoretically prettier than recurrence. It won because it parallelized well on GPUs. Whatever comes next has to clear the same bar.

Pre-training scaling for dense transformers is flattening. OpenAI spent at least $500 million per major training run on Orion. The model hit GPT-4 performance after 20% of training; the remaining 80% gave diminishing returns. They downgraded it from GPT-5 to GPT-4.5. Sutskever at NeurIPS 2024: “Pre-training as we know it will end. The data is not growing because we have but one internet.” His startup SSI has raised to a $32 billion valuation with about 20 employees and zero revenue. A bet that the next leap requires something architecturally new.

But test-time compute opened a different axis entirely. OpenAI’s o3 hit 87.5% on ARC-AGI, beating most humans. DeepSeek-R1 matched o1-level reasoning at 70% lower cost. OpenAI’s inference spending reached $2.3 billion in 2024: 15x what they spent training GPT-4.5. Dario Amodei at Morgan Stanley in March 2026: “We do not see hitting the wall. We don’t see a wall.” He’s talking about this axis, inference-time compute and RL from verifiable rewards, not about pre-training bigger dense models. The Densing Law now shows capability per parameter doubling every 3.5 months through better data, MoE, and distillation. Last year’s frontier, matched with a fraction of the parameters.

Inference demand is projected to exceed training demand by 118x. Global data center power is heading toward 945 TWh by 2030, roughly Japan’s total electricity consumption. An architecture that scores 2x better on benchmarks but runs 3x worse at inference won’t win. What ships is whatever fits the hardware. The transformer isn’t going away. It’s becoming one component in a larger stack: attention for recall, SSMs for cheap sequence processing, MoE for capacity, maybe diffusion for parallel output. Jamba, Hymba, and Qwen3-Next already ship this way. That’s not a prediction. It’s what’s in production.

How fast the stack evolves is the open question. The answer, given AlphaEvolve and AutoResearch and AI Scientist v2, is faster than any previous architectural transition. I don’t know whether the transformer remains the dominant layer for two years or five. But I’m fairly confident that whatever comes next, humans won’t have designed it alone.

MCP vs A2A in 2026: How the AI Protocol War Ends

me@philippdubach.com (Philipp D. Dubach) — Sun, 15 Mar 2026 00:00:00 +0000

On March 26, 2025, Sam Altman posted the following three sentences

people love MCP and we are excited to add support across our products.

MCP is Anthropic’s Model Context Protocol. OpenAI is Anthropic’s most direct competitor. Altman was endorsing a rival’s standard. That post may be the most significant event in enterprise AI infrastructure this year. When your main competitor adopts your protocol, the war is close to over. I’ve been watching this play out since Anthropic launched MCP in November 2024, and I want to work through what’s happening: who controls what, what “interoperability” means in practice, and whether any of this follows patterns we’ve seen before.

What is MCP

MCP is a client-server protocol, licensed MIT, built on JSON-RPC 2.0. The mental model is simple: an AI agent (the host) connects through a client to MCP servers that expose tools, data sources, and context. Instead of building a bespoke integration every time Claude or GPT needs to talk to Salesforce, GitHub, or your internal database, you build one MCP server. Any compatible host can then use it.

The problem it solves, which explains why it spread so fast, is that without a standard like this, integration complexity grows quadratically. Every new AI model times every new tool equals a new custom integration. MCP tries to make it linear.

By December 2025, Anthropic’s own count put the public MCP server ecosystem at 10,000+ active servers and 97 million monthly SDK downloads across the Python and TypeScript SDKs. GitHub’s 2025 Octoverse report flagged MCP as a standout, hitting 37,000 stars in eight months. The unofficial registry mcp.so lists over 18,000 servers. Official SDKs now cover ten languages, including Python, TypeScript, Java, C#, Go, Kotlin, Rust, and Swift.

The companies building MCP integrations: Microsoft, Salesforce, Cloudflare, GitHub, Stripe, Atlassian, Figma, Snowflake, Databricks, New Relic. At Cloudflare’s MCP Demo Day in May 2025, Asana, PayPal, Sentry, and Webflow all shipped remote servers in a single afternoon. Gartner predicts 75% of API gateway vendors will have MCP features by 2026.

OpenAI’s adoption went beyond Altman’s post. MCP support rolled out across their Agents SDK (March 2025), Responses API (May 2025), Realtime API (August 2025), and ChatGPT Developer Mode (September 2025). The two companies later co-authored the MCP Apps Extension. You don’t see that often between direct competitors.

One performance claim circulates in blog posts and marketing materials: that organizations implementing MCP report “40–60% faster agent deployment times.” I have not found a primary source for this. No survey, no case study, no named company. I’d treat it as marketing content until someone produces the underlying data.

Google’s A2A fills a different layer

Google launched A2A, the Agent-to-Agent protocol, at Cloud Next on April 9, 2025, five months after MCP. Google didn’t position A2A as MCP replacement. They called it a complement. I think that’s honest, but it takes a minute to see why.

MCP connects an agent to tools; A2A connects agents to each other, the two protocols produce different behavior.

When an MCP host calls an MCP server, it knows exactly what it’s getting: structured tool descriptions, specific function signatures, predictable outputs. The agent can see inside the tool. A2A works differently. Agents remain opaque to each other. An A2A agent publishes an “Agent Card,” a JSON metadata document at a well-known URL, describing its capabilities and authentication requirements. Other agents discover it, negotiate tasks through a defined lifecycle (submitted, working, input-required, completed), and collaborate without sharing memory or internal state.

Google’s own documentation uses a repair shop analogy. MCP is how the mechanic uses diagnostic equipment. A2A is how the customer talks to the shop manager, or how the manager coordinates with a parts supplier. It works: both conversations happen in a real repair shop, and cutting either one doesn’t simplify anything.

A2A launched with 50+ partner organizations and grew to 150+ by July 2025. The list includes Atlassian, Salesforce, SAP, ServiceNow, McKinsey, BCG, Accenture. Google donated A2A to the Linux Foundation in June 2025. IBM’s competing Agent Communication Protocol merged into A2A in August, with IBM’s engineers joining the technical steering committee. As of February 2026, A2A has roughly 21,900 GitHub stars, about 40% of MCP’s total.

What history can tell us about how this ends

AI agent protocol wars have a consistent pattern. The winner is almost never the technically superior option. It’s the one that ships first and gets adopted before anyone can catch up.

TCP/IP and OSI are the canonical example. The OSI model, published by ISO in 1983, was architecturally more rigorous than TCP/IP’s four-layer stack. It had real institutional backing: the US Commerce Department published its GOSIP mandate in August 1988, with formal enforcement beginning in 1990. European governments followed. OSI still lost. TCP/IP won because it had running code, freely available implementations bundled with BSD Unix workstations, while OSI remained elegant theory trapped in committee processes. By 1994 the outcome was obvious. David Clark’s IETF motto captures why:

We reject kings, presidents and voting. We believe in rough consensus and running code.

VHS versus Betamax is the other lesson people cite, often incorrectly. Betamax had better picture quality. VHS won anyway, and the usual explanation is the movie library. That’s part of it. But JVC openly licensed VHS to manufacturers across the industry, which drove prices down and built a content ecosystem Sony couldn’t match. By 1987, VHS held 90% of the US VCR market. Sony conceded in 1988 by manufacturing VHS players. Ecosystem breadth, once established, creates a gravitational field that technical superiority alone can’t escape.

USB is a more recent example with a twist. The consortium, Compaq, DEC, IBM, Intel, Microsoft, NEC, Nortel, formed in 1994 and shipped USB 1.0 in January 1996. Adoption was sluggish until Apple shipped the iMac G3 in August 1998 with only USB ports, forcing the entire peripheral industry to follow. One player is so central to the ecosystem that their adoption forces everyone else’s hand. OpenAI adopting MCP in March 2025 is MCP’s iMac moment.

But USB also offers a warning. USB-C’s physical connector won universally, then the underlying protocol fragmented. The same connector could carry anything from USB 2.0 to USB4, 5W to 240W of power, depending on what you plugged together. The EU eventually legislated convergence through its Radio Equipment Directive, which took effect December 28, 2024. A standard can win and still fragment when nobody governs the details.

What now?

The Linux Foundation’s Agentic AI Foundation (AAIF), launched December 9, 2025 with Anthropic, OpenAI, and Block as co-founders, now has 146 member organizations, including JPMorgan Chase, American Express, Autodesk, Red Hat, and Huawei. A2A has its own Linux Foundation governance body. MCP sits within AAIF. Both are under the same umbrella, but they’re not the same project.

This is the governance structure you typically see after a standards war has been decided in principle but before the implementation details have been hammered out. Think of the W3C in 1994, not the W3C in 1998. For anyone making architectural decisions right now, the practical question isn’t MCP versus A2A. Most major enterprise platforms already support both. Salesforce, SAP, IBM, Microsoft, and AWS have committed to both. The question is sequencing and depth.

ISG analyst David Menninger put it clearly: “MCP first for sharing context; then A2A for dynamic interaction among agents.” That’s the sequence I’d follow. MCP is the more mature protocol with the larger server ecosystem. The 10,000+ existing servers represent integration work that doesn’t need to be rebuilt. Start there. Layer A2A on top when your use cases require multi-agent coordination across organizational boundaries, supply chain, cross-platform orchestration, which is exactly where the Tyson Foods and Adobe deployments have landed.

MCP security deserves a separate conversation. Astrix Security’s research found that 53% of MCP servers rely on static credentials rather than OAuth. A critical vulnerability in the mcp-remote npm package (CVE-2025-6514) exposed 437,000+ installations to shell injection. TCP/IP had its share of early-stage security problems in the 1980s, so I’m not calling this fatal. But these are real vulnerabilities, and they will cause real incidents before the posture matures.

Multiple analyst firms converge on an agentic AI market of roughly $7–8 billion in 2025, growing at 40–50% annually, with projections ranging from $50 billion by 2030 to $199 billion by 2034. NVIDIA’s CUDA is the comparison that matters: 4 million developers, 15 years of compounding library investment, and switching costs that produce $130.5 billion in annual revenue at 73% gross margins. MCP’s 97 million monthly downloads aren’t CUDA yet. But the trajectory points the same direction.

My best guess (and I want to be clear it’s a guess): MCP becomes the infrastructure layer, A2A becomes the coordination layer, much as TCP handles transport while HTTP handles application-layer communication. Different floors of the same building. The question remains whether 146 AAIF members can hold coherent standards against the competitive pressure of over 1,000 active agentic AI startups, each with economic incentives to differentiate.

AI Models Are the New Rebar

me@philippdubach.com (Philipp D. Dubach) — Wed, 11 Mar 2026 00:00:00 +0000

Qwen 3.5-35B-A3B, a model released by Alibaba in February 2026, runs on a single consumer GPU with 24 gigabytes of VRAM. A secondhand RTX 4090, available for around $2,000, generates 60 to 100 tokens per second with it. On select benchmarks per Alibaba’s own evaluations, it matches or beats Claude Sonnet 4.5. The Qwen 3.5 Flash tier costs $0.10 per million input tokens through Alibaba’s API. Claude Sonnet 4.5 costs $3.00.

That’s a 97 percent discount. For comparable performance.

I’m not cherry-picking. Zhipu AI’s GLM-5 scores 1,452 on the Chatbot Arena leaderboard, the highest Elo rating of any open-source model, and its developer’s own figures put it at roughly 95 percent of closed-model performance at around 15 percent of the cost. Moonshot AI’s Kimi K2.5, a trillion-parameter model, scores 99.0 on HumanEval and 96.1 on AIME 2025, with a Chatbot Arena Elo of 1,447, at roughly 88 percent less than Claude Opus 4.5 per token. The Stanford HAI 2025 AI Index found the performance gap between open-source and proprietary AI models on the Chatbot Arena leaderboard shrank from 8 percent to 1.7 percent in a single year.

This is not an IP story. It is not a China story. It is an industrial economics story. And we know how those end.

What the steel mills can tell us

In the mid-1960s, electric arc furnace mini-mills entered the steel market at the lowest-quality segment: rebar. Capital costs ran one-fifth to one-seventh of what an integrated plant required. Nucor, the most aggressive operator, built its first mill for $6 million when a comparable integrated facility cost $500 million or more. The response from companies like U.S. Steel was rational: retreat from low-margin rebar, harvest the better-margin products, improve average profitability in the short term. Sensible but wrong.

Each segment mini-mills conquered had higher margins than the last. From rebar to structural steel, from structural steel to sheet metal, the disruptors climbed the value chain until there was nowhere left to climb. The American steel industry lost money for five consecutive years in the early 1980s, posting aggregate losses of $3.38 billion in 1982 alone. U.S. Steel shed more than half its workforce, pivoted to oil and gas, and by June 2025 accepted a $14.9 billion acquisition by Nippon Steel, a fraction of its inflation-adjusted peak valuation. Nucor, the mini-mill, became the largest American steelmaker.

Clayton Christensen spent a career documenting this pattern of disruptive innovation. The incumbents never failed because they made bad decisions. They failed because they made good decisions for their existing customers while the market shifted beneath them. OpenAI is serving demanding enterprise customers with the most capable models available. Anthropic is building trust with regulated industries. These are the correct moves for their current customers. They may also be exactly the wrong moves for the next five years.

The cost decline eats strategy

Epoch AI’s research, published in 2025, found that AI inference prices are declining at a median rate of 50x per year for equivalent performance levels, with a range spanning 9x to 900x depending on the task. Achieving GPT-4’s original performance on PhD-level science questions cost $30 per million input tokens when GPT-4 launched in early 2023. Through open-source alternatives today, the same performance costs under $0.10. A roughly 300-fold reduction in three years, at a pace that dwarfs Moore’s Law.

David Cahn at Sequoia Capital put the structural problem plainly in his "$600 Billion Question" analysis: “GPU computing is increasingly turning into a commodity, metered per hour. Without a monopoly or oligopoly, high fixed cost plus low marginal cost businesses almost always see prices competed down to marginal cost, like airlines.” The airline analogy is more foreboding than it sounds. The global airline industry generated cumulative net profits of $36 billion between 1945 and 2000, a net margin of 0.8 percent across 55 years. In the 2000s, the industry lost more than it had earned in the prior half-century combined. Even today, IATA projects airlines’ return on invested capital at 6.8 percent, below their weighted average cost of capital of 8.2 percent.

The difference between AI and airlines is that switching a flight carrier requires rebooking. Switching an AI model requires changing two lines of code.

Switching costs that approach zero

The OpenAI API format has become the de facto industry standard, supported by virtually every major model provider and open-source inference engine. LiteLLM, an open-source gateway with approximately 37,000 GitHub stars, provides a unified interface to over 100 providers through a single configuration change. OpenRouter offers managed access to more than 400 models. Setup time: under five minutes.

Enterprise behavior already reflects this. Perplexity’s own data shows 92 percent of Fortune 500 employees use multi-model AI platforms, and their top enterprise accounts access an average of 30 different models. These are Perplexity’s internal figures, not independent market research: treat them as directional. The one meaningful source of lock-in is custom fine-tuned models, which are provider-specific and cannot be directly ported. That affects a small fraction of deployments. For the vast majority of inference calls, the model is interchangeable, and the customer buys on price.

What OpenAI’s numbers actually require

On February 27, 2026, OpenAI closed a $110 billion funding round, the largest private capital raise in history, at a post-money valuation of $840 billion. Amazon committed $50 billion. SoftBank $30 billion. Nvidia $30 billion. The valuation implies extraordinary confidence in OpenAI’s ability to maintain pricing power and grow revenue to somewhere between $200 and $280 billion by 2030. At 42x trailing revenue, it is priced not for today’s market but for a specific version of the future.

OpenAI reported $20 billion in annualized recurring revenue as of January 2026, up 233 percent year over year. Impressive. But the adjusted gross margin fell to 33 percent in 2025, down from 40 percent the prior year, as inference costs quadrupled to $8.4 billion. In the first half of 2025 alone, OpenAI lost $13.5 billion. Compute and technical talent costs consume approximately 75 percent of total revenue, and Microsoft takes another 20 percent through 2032. That leaves very little room for the margin expansion the valuation demands.

Anthropic tells a similar story at a smaller scale. At a $380 billion valuation on $14 billion in run-rate revenue, 27x, the company is also unprofitable, projecting positive cash flow somewhere around 2027 to 2028. Both companies are betting they can simultaneously grow revenue and expand margins. In commoditized markets, that is the bet that fails.

Part of the financing is also circular. Amazon invests $50 billion in OpenAI; a portion flows back to AWS as compute spending. Nvidia invests $30 billion; the same money returns as GPU purchases. This inflates revenue figures while obscuring how much of the demand is genuinely independent.

Who actually wins when the model layer is a commodity

Before writing off the incumbents, two historical cases are worth sitting with.

Amazon Web Services has cut prices 134 times since 2006, yet its operating margins expanded to a record 39.5 percent in Q1 2025. Apple captures roughly 80 to 85 percent of global smartphone operating profits with around 18 to 21 percent of unit shipments, while commodity Android manufacturers earn negligible margins. Both got there the same way: years of accumulated switching costs, vertical integration, ecosystems that cost real money to leave. The question is whether AI model providers can build any of that. I don’t think they can, not at the model layer. An API endpoint returning text is not an iPhone. You change it in a config file on a Tuesday afternoon.

So who does benefit? Nvidia and cloud providers collect rent regardless of which model runs on their hardware. That position is durable. The application layer looks better still: companies embedding AI into domain-specific workflows with proprietary data, where the model is an input rather than the product. As Andrew Lewis at EQT put it, “Over time, the value is likely to accrue to the application layer and the product companies.” And then there are the platforms with distribution so large they can integrate AI at near-zero marginal cost: Meta embedding Llama into Instagram and WhatsApp, Google weaving Gemini into Search and Workspace. When Mark Zuckerberg open-sources Llama, he is deliberately commoditizing the model layer to prevent any single player from owning the stack above his distribution. When a $1.6 trillion company is your most committed price-cutter, that tells you something about where the margins are going.

AI Capex Arms Race: Who Blinks First?

me@philippdubach.com (Philipp D. Dubach) — Sun, 08 Mar 2026 00:00:00 +0000

Alphabet’s free cash flow is projected to fall roughly 90% in 2026. Not because the business is in trouble. Because the company has committed to spending $83–93 billion more on capital expenditure than it did last year.

That is what $660–690 billion in AI capex looks like up close. Amazon guided to $200 billion alone. Meta’s long-term debt more than doubled to $58.7 billion to help finance its share. Goldman Sachs projects cumulative 2025–2027 spending across the Big 4 at $1.15 trillion, more than double the $477 billion spent over the prior three years combined. BofA credit strategists found this will consume 94% of operating cash flow minus dividends and buybacks.

At what revenue growth rate does any of this pay for itself? And what happens if inference costs fall 100-fold before the infrastructure is fully depreciated? We want to think about this the way a credit analyst would. Not as a technology story but as a corporate finance story. Because the numbers, assembled from earnings releases and analyst reports through February 2026, look less like a technology platform buildout and more like a leveraged buyout of the future.

The LBO

An LBO thesis goes like this: we borrow heavily today, acquire an asset, generate enough cash flow to service the debt, and eventually sell or refinance at a profit. The bet works if the returns from the acquired asset exceed the cost of capital. It fails if the asset underperforms, the cost of capital rises, or the timeline extends beyond what the capital structure can absorb.

The hyperscaler capex thesis has the same structure, substituting “equity” and “operating cash flow” for debt. Each company is telling shareholders: we will deploy enormous capital today, accept near-zero or negative free cash flow for 18 to 36 months, and recoup that investment through AI revenue growth. Sundar Pichai put the bull case plainly at Alphabet’s Q2 2024 earnings:

The risk of underinvesting is dramatically greater than the risk of overinvesting for us here.

At five-year straight-line on $175 billion in Alphabet capex, you get $35 billion in annual depreciation. Add a conservative 10% cost of capital on the incremental investment, and the hurdle gets harder still. For the full $690 billion in 2026 hyperscaler capex, the annual depreciation burden alone approaches $115–140 billion at five-year lives. That is before interest, power, operations, or the cost of next year’s upgrade cycle.

The revenue side of this ledger is far smaller than the capex side. Rough estimates place direct AI revenue across the ecosystem at $40–60 billion in 2025, against AI-specific capex of roughly $300 billion. Coverage ratio: approximately 0.15x. Sequoia’s David Cahn calculated that the AI ecosystem needs to generate $600 billion in annual revenue to justify current infrastructure spending, against perhaps $50–100 billion it is actually generating. By 2026, with AI revenue perhaps reaching $80–120 billion and AI capex at $450 billion, the ratio improves to roughly 0.25x. Still not a business.

What would have to be true

The spending is not obviously irrational. The bull case is worth taking seriously: the right moment to build infrastructure for a platform shift is before the platform fully exists. Railroads were overbuilt. Fiber was overbuilt. Both excesses funded genuinely useful infrastructure that later ran at capacity. If AI becomes the general-purpose technology that most proponents claim, the AI infrastructure being deployed today could look like the most prescient investment since Standard Oil.

But that argument requires you to believe some very specific things about revenue growth that have not yet materialized. The 2025–2030 revenue ramp embedded in current capex implies AI revenue growing from roughly $60 billion today to somewhere between $600 billion and $2 trillion by 2030, depending on which bullish scenario you pick. Bain calculates that even under the most aggressive adoption scenario, AI generates $1.2 trillion in revenue, against the $2 trillion the spending requires to break even.

MIT’s Daron Acemoglu, who won the 2024 Nobel Prize in Economics, projects AI will deliver a total GDP increase of just 1.1–1.6% over ten years: roughly a 0.05% annual productivity gain. Only about 5% of economic tasks, he estimates, are cost-effectively automatable at current prices. Goldman Sachs’ Jim Covello made a similar argument in a June 2024 note: “Replacing low-wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry.” Neither of these is a fringe view. If either is roughly right, the revenue scenarios baked into current capex budgets do not close. And yet the same market is destroying software stocks because AI adoption is supposedly too strong. Both readings cannot be true.

Dario Amodei, who is himself building the infrastructure, put it very bluntly on the Dwarkesh Podcast in February 2026: “If my revenue is not $1 trillion, if it’s even $800 billion, there’s no force on Earth, there’s no hedge on Earth that could stop me from going bankrupt if I buy that much compute.” He was describing his own spending discipline relative to peers. The companies spending three times as much as Anthropic apparently believe they have found the hedge he could not.

Depreciation time bomb

One risk most analysis underweights: AI hardware obsoletes faster than any previous infrastructure cycle.

Hyperscalers have extended server useful lives from four to five and six years, saving billions in annual depreciation. But Amazon reversed course: in Q4 2024 it took a $920 million charge to early-retire certain servers and networking equipment, then effective January 1, 2025 it shortened useful lives for a subset of servers from six to five years, citing “the increased pace of technology development, particularly in the area of artificial intelligence,” a decision expected to reduce 2025 operating income by a further $700 million. Jensen Huang, not a man known for underselling his own products, said of H100 GPUs once Blackwell shipped: “You couldn’t give Hoppers away.” Nvidia now releases new architectures annually, where it previously released them every two years.

Michael Burry, who spent 2005 correctly modeling the mortgage market’s hidden risks, estimates that hyperscalers will understate depreciation by roughly $176 billion in aggregate between 2026 and 2028, causing them to overreport earnings by more than 20%. I have no idea whether Burry is right on the specific number. But the direction is correct. If the useful life of a Blackwell GPU is closer to three years than five because Rubin replaces it in 2027, the depreciation math gets far worse.

Epoch AI measured inference costs falling at a median 50 times per year, accelerating to 200 times per year after January 2024. GPT-3-era processing cost around $20 per million tokens at launch in 2020. By early 2026, models of comparable capability cost roughly $0.07 per million tokens. That is a roughly 280-fold decline over five years, and there is no obvious reason for it to stop.

The hyperscaler response to this is Jevons, an argument I explored in January: cheaper inference will explode demand, and the total compute consumed will far exceed what efficiency gains removed. They may be right. But the timing matters. Infrastructure being deployed today, at today’s GPU prices, needs to generate enough revenue before the next architecture cycle renders it economically obsolete. The payback window is not 36 months. It may be 18.

Arms race logic

Mark Zuckerberg acknowledged the possibility of an AI bubble “definitely” in September 2025, then spent $72 billion anyway. This is not irrationality. It is game theory. If AI really does create winner-take-most outcomes, slowing down is a bet that the platform shift is smaller than your competitors believe. Most boards are not willing to make that bet. So everyone keeps spending, and as I wrote last week, every bulge bracket bank agrees they should.

But the same logic drove WorldCom’s Bernie Ebbers. The same logic drove Global Crossing. The specific claim driving the 1990s telecom bubble was that internet traffic was “doubling every 100 days.” It was false: researcher Andrew Odlyzko traced it to misleading WorldCom/UUNET claims, and actual traffic doubled roughly once per year. By 2001, only 5% of installed fiber capacity was in use. The infrastructure eventually ran at capacity; it just took a decade and several dozen bankruptcies to get there.

Howard Marks published a December 2025 memo asking, with characteristic deliberateness, “Is It a Bubble?” He noted hyperscalers’ capex was outpacing revenue momentum and lenders were sweetening terms to keep deal flow alive. J.P. Morgan projects $300 billion in investment-grade bonds for AI data centers in 2026 alone. That is the same fragility that destroyed the telecom builders: cheap debt financing infrastructure before anyone has proved the revenue exists to service it.

Without AI spending, Pantheon Macroeconomics calculated in February 2026, U.S. corporate capex would currently be negative. The entire infrastructure investment story depends on this cycle continuing: total U.S. GDP grew just 1.4% annualized in H1 2025, and AI-related investment accounted for essentially all of it.

Peter Thiel's Physics Department

me@philippdubach.com (Philipp D. Dubach) — Mon, 02 Mar 2026 00:00:00 +0000

On December 11, Jimmy Carr sat on the TRIGGERnometry podcast and delivered a riff that sounded like Peter Thiel’s stagnation thesis filtered through a comedian’s timing:

Minus the screens from any room, we’re living in the 1970s. Nothing’s happened in physics since ‘72. String theory has not got us anywhere. But if you take the compute power of AI and point it at physics, what happens? We could have a world of plenty. I hope that’s the world we live in. But it could go another way.

Two months later, on February 13, GPT-5.2 derived and formally proved a new result in theoretical physics: single-minus gluon scattering amplitudes, long assumed to vanish, are nonzero in the half-collinear regime. Nima Arkani-Hamed at the Institute for Advanced Study called the formulas “strikingly simple” after fifteen years of personal curiosity about the problem. Nathaniel Craig at UC Santa Barbara called it “journal-level research advancing the frontiers of theoretical physics.”

Thiel’s stagnation case

Carr was paraphrasing Thiel, who has been making this argument for fifteen years. The Founders Fund manifesto (2011) put it bluntly: “We wanted flying cars, instead we got 140 characters.” Thiel’s framework distinguishes progress in bits from progress in atoms: spectacular digital gains since 1970, physical-world stagnation. Tyler Cowen named the broader phenomenon the Great Stagnation. On the Douthat podcast Thiel was more measured: “The claim was that the velocity had slowed, it wasn’t zero.”

The data supports the velocity claim. Total factor productivity growth, the metric that captures genuine scientific progress and technological improvement, ran at roughly 1.7% annually from 1947 to 1973. Since 2004, it has averaged 0.4%. Robert Gordon’s The Rise and Fall of American Growth argues the “special century” of 1870 to 1970 was a one-time event. Bloom, Jones, Van Reenen, and Webb showed in the American Economic Review that maintaining Moore’s Law required 18x more researchers in 2014 versus 1971.

The Standard Model of particle physics was essentially complete by the early 1970s. Since then, we have confirmed things we already predicted: the Higgs boson (2012, 48 years after prediction), gravitational waves (2015, 99 years after Einstein), the accelerating expansion of the universe (1998). Important experimental work. But confirmations, not revolutions. No supersymmetric particles. No extra dimensions. No new fundamental energy sources. No unified field theory. String theory, the leading candidate for physics beyond the Standard Model, has produced zero experimentally confirmed predictions in 55 years and admits roughly 10^500 possible solutions, which is another way of saying it predicts everything and therefore nothing. Sabine Hossenfelder captured the frustration:

Theoretical physicists used to explain what was observed. Now they try to explain why they can’t explain what was not observed.

What AI has already done for science

AlphaFold predicted the three-dimensional structures of 214 million proteins, solving the protein folding problem for structural biology. It won the 2024 Nobel Prize in Chemistry for Demis Hassabis and John Jumper, and has been used by over 2 million researchers in 190 countries. DeepMind’s GNoME identified 2.2 million new crystal structures and 381,000 predicted-stable materials, equivalent to roughly 800 years of prior human discovery in materials science. Lawrence Berkeley Lab’s A-Lab robotically synthesized 41 of these in 17 days.

In fusion, DeepMind trained a reinforcement learning system to autonomously control plasma in a real tokamak at EPFL, sculpting it into configurations no human operator had achieved. Princeton researchers predicted tearing instabilities 300 milliseconds in advance and adjusted reactor parameters in real time: the first demonstration of preventing, not just suppressing, the instabilities that have plagued fusion for decades. TAE Technologies used AI-optimized beam injection to sustain plasma above 70 million degrees C. At Lawrence Livermore, the CogSim AI framework predicted a 74% probability of ignition days before the December 2022 shot that achieved it.

Microsoft and Pacific Northwest National Lab screened 32.6 million inorganic materials in roughly 80 hours, identified 18 finalists, and produced a working battery prototype using 70% less lithium within nine months. In drug discovery, at least 75 AI-discovered drugs have entered clinical trials, up from 3 in 2016, with Phase I success rates of 80 to 90% compared to the traditional 40%.

And then, GPT-5.2 produced a new result in theoretical physics. A proof that human physicists had not found. The mathematical reasoning timeline tells the story. AlphaGeometry solved 25 of 30 Olympiad geometry problems in January 2024. By July 2024, AlphaProof earned a silver medal at the International Mathematical Olympiad. By 2025, Gemini Deep Think scored gold: 5 of 6 problems, 35 points, end-to-end in natural language. Terence Tao revised his prediction for superhuman AI mathematics from 2029 to 2026.

75:1 compute gap

Here is the number that matters. Big Tech spent over $250 billion on AI infrastructure in 2024 and 2025. Total US federal AI R&D spending: $3.3 billion per year. That is a compute divide of roughly 75:1 between commercial and scientific AI investment. The NAIRR pilot allocated about 3.2 yottaFLOPs to academic researchers, enough to train GPT-3.5 once but not enough for a single GPT-4-class run.

The DOE’s Genesis Mission announced $320 million in December 2025. That is less than what Meta spends on AI infrastructure in a week. The FASST initiative authorized $2.4 billion per year for five years, $12 billion total, but congressional appropriations are still pending. The US has three exascale supercomputers at national labs. These serve all of science, not just AI.

If AI has already produced results in theoretical physics, materials science, fusion energy, and drug discovery with what amounts to scraps from the commercial table, what happens when someone makes a serious allocation? Hassabis told Fortune in February 2026 that in 10 to 15 years “we’ll be in a kind of new golden era of discovery, a kind of new renaissance.” He described a vision of “radical abundance” where AI has “successfully bottled the scientific method.”

Goldman Sachs estimates generative AI could raise global GDP by 7%, roughly $7 trillion. McKinsey pegs R&D-specific value at $360 to $560 billion annually, but explicitly noted they did not attempt to estimate

the value of truly breakthrough innovations that transform markets (if, for example, nuclear fusion was to enable limitless, clean electricity production).

The bear case: pattern matching is not physics

The bear case is simple and serious. AI is the best pattern-matching system ever built. Physics does not advance by pattern matching. It advances by conceptual revolution: Riemannian geometry for general relativity, an entirely new mathematical framework for quantum mechanics, gauge theory for the Standard Model. None of these were discoverable in existing data.

Noam Chomsky argued in the New York Times that AI’s deepest flaw “is the absence of the most critical capacity of any intelligence: to say not only what is the case … but also what is not the case and what could and could not be the case.” A commenter on Peter Woit’s blog at Columbia spent “over 100 hours probing these models” on open problems and found they “basically never try to come up with something new” when the answer is not already in the training data.

Dario Amodei was notably careful in “Machines of Loving Grace.” He predicted AI could compress 50 to 100 years of biological progress into 5 to 10 years, but on physics he hedged: particle physicists are “limited by data from particle accelerators” and “it’s not clear that they would do drastically better if they were superintelligent.” Some problems are not compute-limited. They are experiment-limited, or concept-limited, or both.

Stephen Wolfram’s principle of computational irreducibility poses the hardest theoretical limit: some systems cannot be predicted by any shortcut. The only way to know what they do is to run them. If fundamental physics contains computationally irreducible problems, no amount of AI compute will crack them.

But Mario Krenn at Max Planck offers a counterpoint from the lab bench. His team published in Physical Review X on AI-discovered gravitational wave detector designs that outperform human designs, and in Science Advances on an AI-discovered violation of Bell inequality with unentangled photons. He does not claim AI understands physics. He claims it finds things physicists miss: “I let the algorithm run, and within a few hours it found exactly the solution that we as human scientists couldn’t find for many weeks.”

Two roads

The nuclear parallel is the one that matters. Fission was discovered in Berlin in December 1938. Hiroshima was August 1945. Seven years from pure physics to weapon. The first nuclear power plant came nine years later. Oppenheimer captured the dynamic: “When you see something that is technically sweet, you go ahead and do it, and you argue about what to do about it only after you have had your technical success.”

Every AI-accelerated physics breakthrough is inherently dual-use technology. The IAEA reports 35 of 45 private fusion companies expect commercial pilot plants between 2030 and 2035. Commonwealth Fusion Systems has raised roughly $3 billion. China established a state-owned fusion company in July 2025. The fusion market is projected at $430 billion by 2030. The same plasma control AI that keeps a tokamak stable could, in principle, optimize weapons physics.

I don’t know which road we’re on. I’m not sure anyone does. But the velocity of AI scientific discovery, from Olympiad geometry problems to a gold medal at the International Mathematical Olympiad to a result in theoretical physics, all within 25 months, suggests the question will be answered empirically rather than philosophically. And probably sooner than the physicists expect.

The cost of intelligence has fallen roughly 150x in two years. The cost of pointing it at physics is a policy choice, not a technical constraint. The 75:1 compute gap between commercial and scientific AI spending is the number that determines how fast this goes. Whether it should go fast is a different question entirely.

Every Bulge Bracket Bank Agrees on AI

me@philippdubach.com (Philipp D. Dubach) — Sun, 01 Mar 2026 00:00:00 +0000

I spent the last week reading 12 bank AI research reports from nine of the world’s largest financial institutions: Goldman Sachs, JPMorgan, Morgan Stanley (three separate reports), UBS, Barclays, Bank of America, HSBC, Citi, Deutsche Bank, and Santander. I wanted to understand how institutions that collectively manage trillions of dollars and employ thousands of analysts actually see this technology heading into 2026: where they agree, where they diverge, and what they’re being less than forthcoming about.

What I found is useful, sometimes impressive, and (mostly) worth reading.

Concerning consensus

Every single institution frames AI as a general-purpose technology, not a product cycle. The analogies converge almost word-for-word: Goldman Sachs draws the line through railroads, electrification, and telecom. Santander deploys a formal three-stage GPT framework: steam, ICT, AI. Morgan Stanley’s semiconductor team writes that AI is “closer to electricity than consumer gadgets.” Deutsche Bank projects +$7 trillion in global GDP over the decade. UBS puts the AI revenue opportunity at $2.6 trillion by 2030.

Not one of the twelve reports seriously entertains the possibility that AI is more like 3D printing: genuinely useful in pockets, broadly disappointing in aggregate. Santander comes closest, citing Daron Acemoglu’s conservative +0.7% cumulative TFP estimate over ten years, but even Santander frames that as the floor of the range, not the central case. The optimistic end of the same distribution sits at +10–15%. That’s not a rounding error. It’s a fundamental disagreement about whether AI will re-run the productivity miracle of electrification or prove more modest in aggregate, and most banks quietly pick the point on the distribution that best supports their commercial positioning.

The chart below plots each bank by how bullish they are on AI’s economic impact against how grounded their analysis is in current empirical data versus forward projections. Bank of America sits alone in the top-right: data-driven and moderately bullish. Goldman sits at the bottom-right: maximally bullish, maximally projective. Santander is the lone occupant of the top-left: empirical and cautious.

That chart is an editorial interpretation, not a precise measurement. But the shape is right. Bank of America is the only institution that consistently anchors its claims to actual GDP data rather than projections. Goldman Sachs, at the other extreme, produces a report that reads as a pitch to every infrastructure CFO and sovereign wealth fund in the world. Both can be making valid arguments. They’re just not making the same kind.

What’s happening vs. what might happen

BofA and Santander are the two worth pausing on, because they’re doing something different from the rest: they’re reporting what’s happening rather than what might happen.

Bank of America, using Bureau of Labor Statistics and Bureau of Economic Analysis data, finds that AI capex contributed 1.4–1.5 percentage points to US GDP growth in H1 2025. Headline growth rates were running around 2% in that period. So AI infrastructure spending was the single largest driver of US economic expansion. That’s a real number from real data, and it’s the most important figure in any of these reports.

BofA also finds a positive correlation between AI adoption and employment in white-collar sectors: software developers are up +17.9%, while insurance appraisers, a role where AI substitutes directly for human judgment, are down -20%. The disruption is concentrated in specific tasks. It hasn’t shown up in aggregate employment. Yet.

Then there’s Santander, which writes the most academically rigorous report of the twelve and includes numbers the consensus would rather not linger on. The enterprise AI adoption rate data is sobering: only around 10% of US companies are actually using AI to produce goods and services. 42% of companies abandoned GenAI projects in 2024, a figure corroborated by MIT’s 2025 GenAI Divide research, which found 95% of enterprise pilots fail to reach production. Only 1% of companies describe their rollouts as mature. Meanwhile, 78% say they use AI in at least one function. The gap between “we have a pilot” and “this is generating value” is enormous.

Goldman’s $800 million per day in hyperscaler capex and Santander’s 42% abandonment rate aren’t as contradictory as they look. Capex precedes productivity in every infrastructure cycle. That part is historically unambiguous. The question is how long the gap lasts, and whether the eventual productivity gains justify what’s been spent getting there.

Dotcom comparison

Every report that addresses the bubble question reaches the same conclusion: this isn’t the late 1990s.

The primary evidence is valuation. Nvidia trades at 25–30x forward earnings versus Cisco’s ~140x at the March 2000 peak. The Magnificent 6 sit at roughly 35x versus 55x for the TMT index at its apex. Morgan Stanley’s Silicon Backbone report makes this comparison rigorously, and I think they’re right that the earnings quality is categorically different from dot-com era technology stocks.

But the comparison works less cleanly when you look at concentration rather than individual valuations. Deutsche Bank notes that the top 10 S&P 500 companies now represent 40% of total market cap, an extreme not seen at the dot-com peak. A Bank of America fund manager survey from October 2025 found 54% of global managers believe AI equities are in a bubble, and 60% view global equities as overvalued. You can simultaneously hold that Nvidia’s PE is reasonable and that a portfolio with 40% weight in ten companies carries concentration risk that PE comparisons don’t capture. Reassuring on one axis. Alarming on another. Most sell-side AI research cites whichever data point supports its preferred conclusion and leaves the tension sitting there unaddressed.

There’s also a subtler version of the bubble question that none of the twelve reports asks directly. The “infrastructure comes before productivity” argument is historically correct: railroads were overbuilt before they transformed commerce; the internet fibre glut of 1999–2000 eventually became the backbone of the digital economy. But the investors who financed Global Crossing and 360networks still lost everything. The infrastructure thesis being correct in the long run isn’t the same as every current valuation being justified. Goldman’s report is particularly careful to avoid addressing that distinction. The implicit message, “we financed the pipes before and it worked out,” skips past the question of which financiers got paid and which got wiped out in the transition.

Sell side

The following chart maps risk awareness against bullishness of tone, and the clustering is revealing.

Goldman and UBS are in the bottom-right: aggressively bullish, risk-dismissive. Santander and BofA are in the top-left, actually wrestling with the uncertainty. HSBC is the clearest case of motivated reasoning: the report is written explicitly to stop private banking clients from panic-selling their SaaS positions after multiple quarters of multiple compression. (Whether that advice turns out to be right is a separate question.)

I don’t think this makes any of these reports dishonest. But the reader needs to supply the discount rate that each institution’s interests warrant.

Goldman Sachs earns advisory fees on the data centre and energy deals it describes. Barclays lends to energy infrastructure projects. Morgan Stanley is selling both EM equity exposure and second-order stock-picking strategies through its asset management arm. UBS provides a clean three-layer investment framework that maps directly to its wealth management product shelf. Citi frames AI as accelerating the electronification of markets, the very trend that drives Citi’s trading revenue. Deutsche Bank, most self-aware of the ten, used AI to generate its AI report. The meta-commentary is right there in the methodology.

Not a single report concludes “this may be overhyped and you should meaningfully reduce exposure.” Every institution has a commercial interest in the AI narrative staying bullish. That doesn’t mean the narrative is wrong. It does mean unanimous conviction from nine sell-side AI research teams is not the same thing as nine independent analyses reaching the same conclusion.

Second-order AI beneficiaries

The next two charts contain what I think is the most interesting tension across all twelve reports.

Morgan Stanley’s Counterpoint Global team, in the second-order effects report, presents historical data that should make the rest of this collection at least slightly uncomfortable. In the railroad era, Walmart’s equivalent outperformed Ford’s equivalent by 1,622x to 23x. In the internet era, Netflix returned 519x versus Cisco’s 4x. It’s the same pattern every time: the companies that use the infrastructure to serve customers dramatically outperform the companies that build it.

Yet nearly every bank’s actual investment positioning sits in Nvidia, ASML, hyperscalers, data centre REITs, nuclear utilities, overwhelmingly first-order enablers. Either the historical pattern won’t repeat this time (possible, but not argued anywhere in these reports), or there’s a valid timing explanation (first-order wins in the buildout phase, second-order wins in deployment) or most of these recommendations will look dated within five years.

Morgan Stanley’s own three reports collectively make the case for second-order investing over the long run while still recommending first-order plays in the near term. That’s not quite inconsistent. But the tension deserves more acknowledgment than it gets.

Power

If I had to pick one analytical claim that holds up regardless of where the productivity debate lands, it’s this: power is the binding constraint, and the infrastructure required to relieve it is real, expensive, and already being built.

The numbers are consistent across institutions. US data centre power consumption runs at 150–175 TWh today. Barclays projects 560 TWh by 2030, approximately 13% of total US electricity. Goldman Sachs estimates 60% of new data centre power through 2030 will require net-new generation capacity. The US power grid has an average age of 40 years. Token consumption grew 4,274% in a single year. Data centre construction spending has grown roughly 60% year-on-year since ChatGPT launched in late 2022.

Barclays frames this as a Jevons paradox: efficiency improvements in model inference will, counterintuitively, increase total energy consumption because they make AI cheaper and drive higher usage. I think that’s right. It’s exactly how personal computing and the internet played out. Every report that addresses energy lands on nuclear as the preferred long-term solution: four executive orders in early 2025, a 400 GW capacity target by 2050, the Three Mile Island restart. That consensus may prove correct. It may also be the sector where the infrastructure-before-returns gap runs longest.

What the reports don’t say

The quadrant charts map where the banks are looking. They’re less revealing about what’s off the frame entirely.

No report models a structured downside scenario: AI capex producing disappointing returns, hyperscalers pulling back, or a major data centre financing default triggering something worse. The closest is Santander’s 42% abandonment statistic, but even Santander doesn’t ask what happens if that number climbs to 60%.

No report discusses AI safety or alignment risks. UBS notes that AI task completion duration has doubled every seven months and explicitly references the AGI trajectory, then moves directly to investment implications, as if “AGI trajectory” carries no risk premium at all. I find that strange.

The collision between AI energy demand and climate commitments gets almost no treatment. Only Barclays mentions that global CO2 emissions hit a record 37.7 gigatonnes in 2023. The institutions projecting AI consuming 13% of US electricity by 2030 don’t reconcile that with the net-zero commitments in their own sustainability reports.

JPMorgan, which provides the most detailed geopolitical analysis of the twelve, never models a Taiwan Strait disruption scenario. Morgan Stanley identifies Taiwan, Korea, and China as “irreplaceable” nodes in the AI hardware supply chain, while calling emerging market semiconductor exposure “long-term infrastructure participation.” Those two characterisations sit in very uncomfortable proximity, and neither report acknowledges it.

I came away from this with real respect for several of these pieces, particularly BofA’s empirical rigour and Santander’s willingness to cite unflattering numbers. The energy infrastructure thesis seems to me the most durable of the lot: the power bottleneck is real regardless of where you land on the productivity question.

But I also came away convinced that this consensus is shaped as much by institutional incentive as by analytical independence. When nine institutions with combined AI-related revenue exposure in the hundreds of billions all agree you should increase AI exposure, the interesting question isn’t whether they’re right. They may well be.

When AI Labs Become Defense Contractors

me@philippdubach.com (Philipp D. Dubach) — Sun, 01 Mar 2026 00:00:00 +0000

Lockheed started by building Amelia Earhart’s favorite plane. Then came a government loan guarantee in 1971 (the L-1011 TriStar nearly killed the company), a Cold War, decades of consolidation, and now a business that earns 92.5% of its revenue from government contracts, with the F-35 alone accounting for 26% of its $71 billion in annual sales. The process took about 50 years. AI labs becoming defense contractors will happen faster.

On February 27, 2026, two things happened within hours of each other. President Trump ordered every federal agency to “IMMEDIATELY CEASE all use of Anthropic’s technology” after CEO Dario Amodei refused to strip safety constraints from Claude’s Pentagon deployment, specifically prohibitions on mass domestic surveillance and fully autonomous weapons. Defense Secretary Pete Hegseth then labeled Anthropic a “Supply-Chain Risk to National Security,” a designation previously reserved for foreign adversaries like Huawei, never before applied to an American company. That evening, Sam Altman announced that OpenAI had signed a deal to deploy its models on the Pentagon’s classified network, posting that the Department of War “displayed a deep respect for safety.” (Whether that reflects the Pentagon’s actual position or Altman’s political optimism, remains unclear for now.)

Most coverage has framed this as an ethics dispute. I think that framing is going to age poorly. What I see is the economics of defense spending doing what they have always done to every company they touch, and the ethics arguments becoming less audible as the financial gravity increases.

The Last Supper and defense industry consolidation

In the summer of 1993, Secretary of Defense Les Aspin and Deputy Secretary William Perry invited the CEOs of America’s defense firms to dinner at the Pentagon and told them, in so many words, that most of them would not survive. Cold War budget cuts meant the government could sustain roughly one prime contractor per equipment category. Norman Augustine, then CEO of Martin Marietta, named it the Last Supper. The message was clear: consolidate or die, and the government would not stop you from consolidating.

The restructuring that followed was fast, even by M&A standards. Within four years, 51 prime defense contractors collapsed into five: Lockheed merged with Martin Marietta in 1995 ($10 billion), Boeing absorbed McDonnell Douglas in 1997 ($13.3 billion), Raytheon folded in Hughes Electronics and Texas Instruments’ defense unit. Between 2011 and 2015, an additional 17,000 U.S. companies exited the defense industry, a contraction that hollowed out the supplier base the Big Five still depend on today.

The revenue dependency data shows what happens to the companies on the inside of that consolidation. Boeing before 1997 was, as Bank of America analyst Ron Epstein put it, “a company where engineers were high church.” Post-merger, Boeing relocated its headquarters from Seattle’s engineering center to Chicago, physically separating leadership from manufacturing. Defense rose to 35.8% of Boeing’s FY2024 revenue ($23.9 billion). The cultural shift that merger carried, financial discipline over engineering judgment, is what most 737 MAX post-mortems eventually trace back to. Companies don’t plan to end up here. They respond to incentives, and the incentives compound.

The AI industry will face the same incentives, just faster, and through a different mechanism: not M&A but access to classified networks and government-funded compute.

How Pentagon AI spending reshapes a company

The FY2025 DoD AI budget was $1.8 billion, a figure that nearly everyone involved described as insufficient. The FY2026 budget request earmarks $13.4 billion for AI and autonomous systems, a roughly 7x increase in a single budget cycle, and the first time these technologies have their own standalone line item inside a total defense request of $892.6 billion. For context: Anthropic’s full annualized revenue as of February 2026 was approximately $14 billion. The Pentagon just made AI a budget category larger than most of the companies selling it.

Anthropic burns an estimated $3–5 billion annually; OpenAI burned approximately $8 billion in 2025. Neither has a clear path to profitability before 2027 at earliest. Government AI contracts offer something consumer businesses cannot: predictable, multi-year, politically protected revenue streams that don’t churn when a competitor releases a better model.

The defense procurement structures deepen that dependency over time. IDIQ contracts (Indefinite Delivery, Indefinite Quantity), which now account for roughly 56% of DoD contract award dollars, run five years with extension options. Palantir’s Maven Smart System contract started at $480 million and expanded to nearly $1.3 billion through 2029. The JWCC cloud contract, which replaced the cancelled $10 billion JEDI contract, placed over $3.9 billion in task orders within three years of award to AWS, Google, Microsoft, and Oracle. Once embedded in classified systems, switching costs become close to prohibitive. A competitor cannot simply offer better inference speed.

Security clearances are maybe the most underappreciated asset in the defense tech ecosystem. Processing a clearance takes an average of 243 days end-to-end, up to a year for TS/SCI with polygraph. Only around 4.2 million Americans hold active clearances, roughly 2.5% of the labor force, and an estimated 500,000 to 700,000 cleared positions currently sit unfilled. Average cleared professional compensation hit $119,131 in 2025; full-scope-polygraph holders averaged $141,299. For AI labs accustomed to hiring from MIT, Cambridge, and ETH Zürich, the cleared talent pool is thin and gets more expensive every year.

Any lab serious about classified work has to build a parallel organizational structure: separate hiring pipeline, separate facilities, separate operational security requirements. The lab that builds that structure first has a moat no competitor can cross quickly.

Palantir’s trajectory as the defense tech blueprint

The clearest view of where this ends is Palantir, which has been running the experiment at scale for a decade. It posted $4.48 billion in FY2025 revenue, up 56% year-over-year, with government comprising 53.7% of the total, down from a peak of 58.2% in 2021 as its commercial AIP platform gained traction. Its $10 billion U.S. Army Enterprise Agreement in July 2025 consolidated 75 existing software contracts into a single framework. Its market capitalization reached roughly $320 billion by late February 2026, making it worth nearly twice Boeing. The model, government as the client that funds and validates the technology, commercial as the client that justifies the valuation, is what the AI labs are now building toward.

OpenAI at an $840 billion valuation with a classified Pentagon network deal is already further down that road than most coverage acknowledges. It has appointed retired General Paul Nakasone, former NSA director, to its board. It hired Dane Stuckey, who spent a decade at Palantir and served as its CISO for six of those years, as its own CISO. It has active job postings for Government Account Directors in Defense requiring Top Secret clearance and defense revenue targets exceeding $2 million per year.

The publishing record is moving the same way. OpenAI’s 2015 founding post promised researchers “will be strongly encouraged to publish their work.” GPT-1 shipped with open-sourced code. GPT-2 was partially withheld in 2019, GPT-3 fully closed in 2020, GPT-4’s architecture undisclosed in 2023. OpenAI released smaller open-source models in August 2025 (its first since GPT-2, six years later) but they were text-only, trained on synthetic data, not frontier systems. Google removed the “AI applications we will not pursue” section from its principles in February 2025, including the explicit weapons prohibition. Meta opened Llama to defense agencies and contractors including Lockheed Martin and Anduril in November 2024. Anthropic has never open-sourced a Claude model. Every major lab is moving in the same direction.

The counterargument, and it’s a real one, is that defense R&D has historically generated civilian spillovers: ARPANET, GPS, jet engines, the semiconductor supply chain. Moretti, Steinwender, and Van Reenen, writing in the Review of Economics and Statistics (2025), found that a 10% increase in government-funded defense R&D generates a 5–6% increase in privately funded R&D in the same industry: crowding-in, not crowding-out. The estimated total effect: U.S. private R&D investment is $85 billion higher than it would be without government defense spending.

But there’s a difference between how much research gets done and what it gets pointed at. Lockheed’s R&D is now probably almost entirely classified hypersonics and directed-energy weapons. What it learns there does not flow back to commercial applications in any useful timeframe. The research volume expands; the scope narrows. Bell Labs devoted a substantial share of its personnel to government contracts at its Cold War peak; the 1956 AT&T Consent Decree forced royalty-free patent licensing on the transistor, which accidentally accelerated the civilian semiconductor industry by giving Texas Instruments and Fairchild Semiconductor access to the core technology. AI labs operating under classification will not be forced to open-license anything. That mechanism does not exist for software under ITAR.

I’m more confident in the direction of this analysis than in the timeline. The Anthropic supply-chain-risk designation may not survive legal challenge. The $13.4 billion FY2026 AI budget might not survive unchanged. Amodei might find a compromise that others in the industry treat as a ceiling rather than a floor. What I don’t think reverses is the structural pull. The defense budget is the largest single purchaser of advanced technology on earth, it’s growing, it operates on multi-year contract cycles that reward incumbents, and it is willing to use blunt regulatory tools against companies that don’t cooperate, as Anthropic learned in about six hours on February 27.

The Last Supper logic applies here too: the government will not block consolidation, and it will not save the AI defense contractors that don’t participate. It will just find a different partner who will.

The Impossible Backhand

me@philippdubach.com (Philipp D. Dubach) — Tue, 17 Feb 2026 00:00:00 +0000

In the latest issue of The AI Lab Newsletter, I featured a ByteDance Seedance 2.0 clip: two men playing tennis at what looked like an ATP tournament. Photorealistic. I probably wouldn’t be able to tell it wasn’t real footage if I didn’t know. A co-worker who played junior pro-am tennis watched the same clip and said: “That backhand doesn’t exist. Nobody plays it like that.” His domain expertise spotted an error that probably fooled everyone else.

We ended up in a long conversation about what that means. AI can get to maybe the 95th or 98th percentile of creating something that looks perfect, but then it isn’t, and if you have deep knowledge you can spot it immediately. The consensus narrative treats this as a temporary limitation. But it might be structural. And I think the evidence, once you lay it out, points to a genuinely contrarian conclusion: domain expertise is appreciating in value, not depreciating, precisely because AI hits a quality ceiling it can’t easily push past.

Approaching the AI quality ceiling

I’ve written before about Sara Hooker’s work on diminishing returns from scaling. The investment side of that argument, the $690 billion in hyperscaler capex chasing a 4% revenue coverage ratio, has been well covered. What hasn’t been covered as precisely is why AI output quality hits a ceiling, and why that ceiling is structural rather than temporary.

Ben Affleck, of all people, gave the clearest non-technical explanation on The Joe Rogan Experience in January 2026:

If you try to get ChatGPT or Claude or Gemini to write you something, it’s really shitty. And it’s shitty because by its nature it goes to the mean, to the average. Now, it’s a useful tool if you’re a writer… but I don’t think it’s actually very likely that it’s going to write anything meaningful, or that it’s going to be making movies from whole cloth. That’s bullshit.

He’s more right than he probably knows. The convergence to the mean isn’t a solvable engineering problem. It operates at three distinct levels, each compounding the others.

(1) The mathematics of next-token prediction. LLMs generate the most statistically probable continuation of a sequence. Probable, by definition, means average. The model isn’t trying to produce the best output; it’s producing the most expected one given the distribution it learned. Outlier quality, the kind that makes writing or analysis distinctive, lives in the tails of the distribution. The architecture systematically avoids those tails.

(2) RLHF makes it worse. Research shows that human annotators prefer familiar-sounding responses, and the learned reward function weights typicality at α=0.57. Models are quite literally being trained to sound typical rather than merely correct or good. The reinforcement signal pushes outputs toward the center of the quality distribution, not toward its upper bound.

(3) model collapse. Shumailov et al. documented this in their Nature paper: as models increasingly train on AI-generated content, they “forget the true underlying data distribution,” losing the tails first and converging toward a point estimate with minimal variance. The internet is filling with AI-generated text. The next generation of models trains on that text. The tails shrink further. This is a positive feedback loop running in the wrong direction.

MIT researchers Thompson, Greenewald, Lee, and Manso quantified the cost side: computational resources scale with at least the fourth power of improvement in theory, the ninth power in practice. To halve an error rate requires more than 500× the computational resources. When AlexNet trained on two GPUs in 2012, it took six days. By 2018, NASNet-A cut the error rate in half using more than 1,000× as much compute.

Affleck captured the commercial implication of this better than most analysts:

I think a lot of that rhetoric comes from people who are trying to justify valuations around companies where they go, “We’re going to change everything in two years.” Well, the reason they’re saying that is because they need to ascribe a valuation for investment that can warrant the capex spend they’re going to make on these data centers. Except that ChatGPT 5 is about 25 percent better than ChatGPT 4, and costs about four times as much in the way of electricity and data.

He’s describing the ninth-power curve in plain English. Each marginal improvement costs exponentially more. The curve bends away from you the harder you push.

Humanity’s Last Exam

The hardest measurement of where AI actually stands against domain expertise is Humanity’s Last Exam (HLE), published in Nature in early 2025 by the Center for AI Safety and Scale AI. Built with approximately 1,000 subject-matter experts across 500+ institutions, it consists of 2,500 expert-crafted questions spanning 100+ academic domains, designed to be “Google-proof”: questions that require genuine understanding rather than information retrieval.

As of February 2026, the top model (Gemini 3 Pro Preview) scores 37.5%. Most models sit below 30%. Human domain experts average roughly 90%. That’s a 53-point gap. In specialized domains like advanced chemical kinetics or medieval philology, AI barely outperforms random guessing while experts score comfortably in the 80s and 90s.

The models are also systematically overconfident. Calibration errors on HLE range from 34% to 89%, meaning AI systems are saying “I’m 90% sure” when they should be saying “I’m guessing.” That gap between confidence and accuracy, that AI overconfidence, is where real-world harm concentrates.

In legal applications, Yale researcher Matthew Dahl found hallucination rates of 69% to 88% on specific queries. Damien Charlotin’s database now tracks 914 cases of AI-generated hallucinated content in legal filings worldwide, growing from two cases per week to two to three per day. In medicine, the Annals of Family Medicine warns that AI hallucinations are “far more insidious” because “a subtle misstep like a misplaced clinical guideline, an incorrect dosage, or an invented side effect may not raise immediate suspicion.” These aren’t edge cases. They’re the expected behavior of systems operating in professional domains where training data is sparse.

The structural explanation is what Kandpal et al. demonstrated at ICML 2023: there’s a strong correlational and causal relationship between an LLM’s ability to answer questions and how many relevant documents appeared in pre-training data. Common knowledge gets learned well. Specialized knowledge appears infrequently online, so models learn it poorly. Ali Yahya of a16z framed it sharply: neural networks are “fantastic interpolators but terrible extrapolators,” powerful pattern matchers that are “blind to the mechanisms that generate the data in the first place.”

My colleague who spotted the impossible backhand is a fantastic extrapolator. He has an embodied model of how tennis biomechanics work that no amount of video footage can teach a diffusion model. The model can produce outputs that are statistically plausible. He can identify outputs that are physically impossible. That distinction is the gap.

The centaur model for human-AI collaboration

The consensus framing positions AI and human expertise as substitutes: AI gets better, humans become less relevant. The empirical evidence on AI augmentation versus replacement says the opposite. Human-AI collaboration, what researchers call the centaur model, outperforms either alone, consistently, across domains, and the quality of the human contribution matters a lot.

The Harvard/BCG study tested 758 consultants, 7% of BCG’s consulting workforce, on realistic tasks using GPT-4. The researchers described a “jagged technological frontier” where some tasks fall within AI’s capabilities and others, though seemingly similar, do not. For tasks within that frontier, consultants using AI completed 12.2% more tasks, finished 25.1% faster, and produced results 40% higher in quality. Below-average performers saw a 43% improvement in knowledge worker productivity. AI as skill equalizer. But for tasks outside AI’s frontier, consultants using AI were 19 percentage points less likely to produce correct solutions. The researchers observed that “professionals who had a negative performance when using AI tended to blindly adopt its output and interrogate it less.”

That second finding doesn’t get enough attention. It means the value of the human in the loop depends entirely on whether the human can identify when the AI is wrong. Which requires precisely the domain expertise that AI supposedly makes obsolete.

The “centaur analyst” study from LSU Finance (winner of the Fama-DFA Best Paper Award) confirmed this human-AI partnership over an 18-year dataset. AI alone beat human stock analysts in 54.5% of cases. The human-AI hybrid outperformed AI-only in nearly 55% of forecasts and reduced extreme prediction errors by roughly 90% compared to human analysts alone. In clinical decision-making experiments with the Mayo Clinic, the ranking was consistent: human-algorithm centaur, then algorithm alone, then human experts alone. The human adds most value at the extremes, catching the cases where the model’s convergence to the mean produces confidently wrong answers.

Affleck, who has thought about this more carefully than his reputation might suggest, landed on the same conclusion:

The way I see the technology and what it’s good at and what it’s not, it’s gonna be good at filling in all the places that are expensive and burdensome, and it’s always gonna rely fundamentally on the human artistic aspects of it.

Labor economics research broadly confirms this. Oxford researchers Mäkelä and Stephany analyzed 12 million U.S. job vacancies and found that complementary effects of AI are 1.7× larger than substitution effects. The World Economic Forum projects 170 million new jobs created by 2030 versus 92 million displaced, a net gain of 78 million. Acemoglu, Autor, Hazell, and Restrepo found that while AI-exposed firms reduce hiring in non-AI positions:

the aggregate impacts of AI-labor substitution on employment and wage growth… is currently too small to be detectable.

McKinsey captures the strategic implication: “When you have built a bench of AI-capable domain owners, your company has a real competitive advantage. That’s because these leaders are hard to replicate.” Yet only 23% of organizations believe they are building sustainable AI advantages, despite 79% reporting competitors are making similar investments.

AI deskilling is a trap

If a generation of junior analysts learns to use AI before developing independent judgment, they never build the pattern recognition that lets them spot when the model is wrong. If junior lawyers lean on AI for legal research before reading enough case law to develop intuition for what’s plausible, they can’t catch the 69-88% hallucination rates. If aspiring filmmakers generate scenes with Seedance 2.0 instead of learning how cameras, bodies, and physics actually interact, they can’t identify the impossible backhand. Gartner predicts that by 2030, half of enterprises will face irreversible skill shortages in at least two critical job roles because of unchecked automation. This AI skill erosion creates a vicious cycle: fewer skilled workers, greater dependence on AI, higher costs to fill the gaps.

Acemoglu warns that technology “does not automatically benefit workers.” In 19th-century England, the benefits of mechanization only spread after decades of worker activism. The parallel risk with AI isn’t mass unemployment. It’s a hollowing out of the skill base that makes the centaur model function. You lose not the jobs but the expertise that makes the jobs valuable.

David Autor’s vision is more optimistic: AI could “extend the relevance, reach, and value of human expertise,” democratizing it rather than eliminating it. I want to believe that’s right. But it requires treating AI as a tool that amplifies existing expertise rather than a shortcut that replaces the need to develop it. The 43% improvement that below-average BCG consultants saw from using GPT-4 is real. The 19-percentage-point penalty when those same consultants blindly trusted AI outside its frontier is equally real. The difference between those two outcomes is judgment. And judgment comes from experience, not from a larger context window.

I’m more confident in the centaur framework than in any specific prediction about timelines or magnitudes. The ninth-power scaling curve, the 53-point gap on Humanity’s Last Exam, the α=0.57 typicality bias in RLHF, the 69-88% hallucination rates in legal applications, and the 95% of enterprises seeing no measurable P&L returns from AI investments all point in the same direction. The question of AI augmentation versus replacement has an empirical answer: AI is a tool that makes good practitioners better and bad practitioners worse. The industry narrative demands a story about replacement. The data tells a story about partnership, one where the human’s contribution is not a relic of an earlier era but the irreducible ingredient that makes the whole system work.

The ability to spot the impossible backhand isn’t going away. If anything, it’s worth more every day.

The SaaSpocalypse Paradox

me@philippdubach.com (Philipp D. Dubach) — Fri, 13 Feb 2026 00:00:00 +0000

The market is simultaneously pricing AI capex failure and AI destroying all software. Both cannot be true.

Anthropic released 11 open-source plugins for Claude Cowork on January 30. Apache-2.0 licensed, file-based, running in a macOS-only research preview. Within a week, the IGV software ETF had fallen 32% from its September peak to a 52-week low of $79.65, roughly $2 trillion in market cap had evaporated, and hedge funds had made $24 billion shorting the sector. The RSI hit 18, the most oversold reading since 1990. JP Morgan titled their note “Software Collapse Broadens with Nowhere to Hide.” Jefferies coined the term SaaSpocalypse. It was the worst software stock crash since the dot-com bust.

Bank of America’s Vivek Arya identified the paradox at the center of this: investors are simultaneously punishing hyperscaler stocks because AI capex might generate weak returns, while destroying software stocks because AI adoption will be so pervasive it renders all existing software obsolete. Both cannot hold simultaneously. If AI tools aren’t generating meaningful ROI, they’re not replacing enterprise software at scale. If they are replacing enterprise software at scale, the hyperscalers are earning extraordinary returns on their infrastructure investment.

This paradox can only resolve in one of three ways: AI adoption is real and hyperscaler capex is justified, AI adoption stalls and software incumbents are fine, or the truth is somewhere in between and the market has mispriced both sides. The first two are internally consistent. The market is pricing neither.

The bear case for enterprise software

The structural argument against enterprise software is serious and worth stating on its own terms.

Enterprise software monetizes through per-seat licensing. The SaaS business model depends on a stable correlation between headcount and license count. AI agents break that correlation. If 10 agents do the work of 100 people, the software doesn’t get replaced directly, the headcount that justifies the seats does, and CRM seat revenue drops with it. AlixPartners estimates up to $500 billion in enterprise software revenue could be at risk over time. IDC predicts pure seat-based pricing will be obsolete by 2028.

The moat question is equally uncomfortable. Enterprise software’s traditional defense was the trained-user-interface moat: the years of institutional muscle memory that makes switching costs prohibitive. Databricks CEO Ali Ghodsi told TechCrunch that this moat collapses when the interface becomes natural language. If the value of Salesforce or ServiceNow lived in their UI rather than their data, and the UI can now be replicated by a general-purpose model, then the moat was shallower than anyone thought. VC has fled traditional SaaS entirely; as one investor noted, “an entrepreneur approaching a VC fund today with a SaaS startup won’t even reach the pitch stage.”

The build-versus-buy equation is inverting in real time. Klarna ditched Salesforce and Workday, consolidated onto its own AI-augmented stack, and used an OpenAI-powered bot to handle work that previously required 700 employees. SaaStr’s analysis of Gartner’s $1.43 trillion 2026 software spending forecast reveals that roughly 9 percentage points of the 14.7% headline growth is price increases on existing software, not net new demand. AI is eating SaaS budgets, redirecting IT spend toward infrastructure while reducing the headcount that generates software seats.

This is the case priced into the IGV at $80.

The bull case for software stocks

The structural argument for enterprise software rests on a distinction the current sell-off is ignoring entirely.

The bear case assumes a shrinking TAM. Goldman Sachs Research argues the opposite: the application software market grows to $780 billion by 2030 at a 13% CAGR, with agents accounting for over 60% of the total. The profit pool shifts from SaaS seats to agentic workloads, but the overall market gets larger, not smaller. a16z’s Alex Rampell takes it further: if AI enables software to not just enhance productivity but actually complete work, the addressable market isn’t roughly $350 billion in enterprise software spend (about 1% of GDP). It’s the ~$6 trillion white-collar services market (~20% of GDP), a 20x expansion into work that was never software-addressable before.

David Friedberg made the sharpest version of this argument on the All-In Podcast: software transitions from helping people do work, to completing work, to doing work humans cannot do. At that point, the SaaS pricing model transitions from per-seat to value-based, and “SaaS basically takes over the services economy.” His estimate: the combined market cap of software companies could be 4x to 10x higher in five years, but “not evenly distributed.”

The software vs semiconductor valuation picture strengthens this framing. The sector is delivering 17% aggregate earnings growth in 2026 while trading at November 2022 EV/Sales multiples, back when the Fed was aggressively hiking into recession fears. The Russell 1000 Software subsector now trades at 32.4x forward earnings versus 43.6x for semiconductors. Recurring-revenue businesses with 90%+ gross margins and 95%+ renewal rates trade at a lower multiple than cyclical chipmakers with 40-60% margins and concentrated customer bases. Historically that’s an inversion that has not persisted.

This is the case that BofA called a paradox and JP Morgan called a mispricing.

The hyperscaler AI capex question that connects both sides

There is a number that both cases have to account for, and it’s the one that determines which side of the paradox resolves first.

Combined 2026 capex guidance from Microsoft, Alphabet, Amazon, Meta, and Oracle now approaches $700 billion, more than doubling from $256 billion in 2024. Bank of America calculates this consumes 94% of operating cash flows after capital returns. The Big Five raised $108 billion in bonds in 2025. AI-related services generate roughly $25 billion in direct revenue against $400+ billion in annual infrastructure spending, a coverage ratio of about 4%.

If the bear case is right and AI agents are replacing enterprise software at scale, this capex should already be generating enormous returns. It isn’t. If the bull case is right and AI is expanding the TAM into the services economy, this capex is early-stage infrastructure investment that will compound over a decade. In that reading, $700 billion in annual spend is the foundation of a $6 trillion market, not a write-off. Both interpretations require the same capex figure to mean something fundamentally different. The market hasn’t decided which.

Microsoft is the sharpest illustration of this tension. Quarterly capex went from $1 billion in early 2015 to a record $37.5 billion in Q2 FY2026, with roughly two-thirds going to short-lived GPU/CPU assets. And yet Microsoft is the only hyperscaler that can fund this buildout from operating cash flow. Azure grew 39% in Q2 FY2026, crossing $50 billion in quarterly cloud revenue for the first time. The company is simultaneously the biggest AI capex spender, the one best positioned to generate returns on that spend, and the company whose products (365, Dynamics, Azure) are supposedly being disrupted by Claude plugins. The market is punishing all three at once.

Bifurcation, not extinction: the SaaSpocalypse resolved

A 60% recession probability, a partial government shutdown, elevated tariffs, and a structural pricing transition are being sold as a single story. They aren’t. Separating the macro from the structural requires asking which software categories are genuinely at risk and which are being sold by association.

Janus Henderson makes a useful distinction between “systems of record” and “systems of engagement.” Systems of record are deeply embedded in business processes, require regulatory compliance, and carry enormous switching costs: ERP, core finance, cybersecurity, observability. PitchBook described replacing one as “effectively open-heart surgery for an enterprise.” Systems of engagement are user-facing workflow tools where the interface is the product: content creation, tier-1 support, basic analytics. When the interface becomes natural language, that moat collapses.

The bear case is correct about the second category. The bull case is correct about the first. The market is wrong to price them identically. Selling both at the same multiple compression implies that switching costs, regulatory requirements, data gravity, and enterprise procurement cycles have all vanished simultaneously. Gartner predicts over 40% of agentic AI projects will be cancelled by 2027. Salesforce’s Agentforce reached 18,500 customers in its first year, the fastest-adopted organic product in company history. These are not the behaviors of a category that has been disrupted. They are the behaviors of incumbents absorbing a new paradigm.

Stated precisely: the bear case is a zero-sum repricing where AI agents compress existing software revenue by eliminating seats and commoditizing interfaces. The bull case is a positive-sum expansion where the surviving software companies capture the $6 trillion in white-collar services that was never software-addressable before. The cost of intelligence has fallen 99.7% in two years (Stanford AI Index). Cumulative AI infrastructure investment is expected to exceed $3 trillion by 2030. That kind of capital deployment doesn’t produce a world where software shrinks. It produces a world where the definition of “software” expands to include most of the services economy.

I wrote recently about how passive flows create mechanical, price-insensitive selling that overwhelms fundamental buyers. This software sell-off is a textbook case. JP Morgan’s Murphy described index arbitrage basket selling, programmatic de-grossing, and passive flow liquidity vacuums. The IGV recorded its highest single-day trading volume in 25 years. JP Morgan’s follow-up argued the sell-off has gone far enough. BofA called it a paradox that “doesn’t make any sense.” History suggests these kinds of extremes, the 2016 LinkedIn panic, the 2022 rate-shock drawdown, the January 2025 DeepSeek crash, tend to mark inflection points rather than starting points for further decline.

The hardest trade right now is the one that requires distinguishing between stocks that are cheap because they’re broken and stocks that are cheap because the market is broken. The SaaSpocalypse priced into the IGV at $80, with a 30-year-extreme RSI, pricing in an extinction event that operating results don’t remotely support, looks a lot more like the latter.

Don't Go Monolithic; The Agent Stack Is Stratifying

me@philippdubach.com (Philipp D. Dubach) — Tue, 10 Feb 2026 00:00:00 +0000

The defensible asset in enterprise AI is not the model. It’s the organizational world model.

Every major compute era decomposes into specialized layers with different winners at each level. Cloud split into IaaS, PaaS, and SaaS. The modern data stack split into ingestion, warehousing, transformation, and BI. Each time, specialists beat the generalists because the layers have fundamentally different economics: different rates of change, different capital requirements, different sources of lock-in.

The enterprise AI agent stack is doing the same thing right now. Arvind Jain, the CEO of Glean, recently published a structural analysis of the emerging enterprise agent architecture that crystallized something I’d been thinking about. His framing describes a stack decomposing into six layers (security, context, models, orchestration, agents, and interfaces) with different defensibility profiles at each level. Glean sits in the context layer so the usual positioning caveats apply, but the structural argument is sound regardless of who makes it.

I want to take it further. There are three claims embedded in this agentic AI architecture that I think are underappreciated, and together they form a thesis about where durable advantage actually accrues in enterprise AI.

I. Models are converging toward shared infrastructure

The model layer is the one most people obsess over, and it’s also the one converging fastest toward commodity economics. Training costs scale roughly 2.4x per year, with current frontier runs costing hundreds of millions and billion-dollar training runs already underway, according to Anthropic’s Dario Amodei. Only a handful of organizations on Earth can operate at this scale: OpenAI, Google DeepMind, Anthropic, Meta, and a few others including xAI and Mistral. This is textbook capital-intensive infrastructure, structurally identical to semiconductor fabs or cloud hyperscalers. The logical conclusion: foundation models become shared utilities, not enterprise moats.

The industry has already internalized this. 37% of enterprises now use five or more models in production, up from 29% the prior year. Different tasks demand different models: Claude for code and tool use, GPT for extended reasoning, Gemini Flash for low-latency routing, specialized models for image generation and embeddings. Betting your enterprise stack on a single model provider is the new version of single-cloud risk. Open standards like Anthropic’s Model Context Protocol, now hosted by the Linux Foundation with 97 million monthly SDK downloads, and Google’s Agent-to-Agent protocol are making this multi-model enterprise AI architecture practical.

If models are infrastructure, the differentiation question moves up the stack. And that’s where it gets interesting.

II. The enterprise AI context layer has two depths, and most people only see the first

This is the part of the thesis I find most intellectually compelling, and where I think the conventional understanding falls short.

Most enterprise AI efforts operate at what I’d call Layer 1 context: connecting data sources, indexing content, enforcing permissions, retrieving relevant documents. This is the RAG-era problem set: familiar, well-understood, and increasingly commoditized. Virtually every enterprise AI platform offers connectors, vector stores, and retrieval pipelines. It matters, but it’s not where defensibility lives.

Layer 2 is where the thesis gets genuinely novel: process-level understanding. Most enterprise knowledge systems capture decisions. What ends up in the CRM, the ticketing system, the ERP. But they don’t capture how those decisions were made: the meetings, Slack threads, document iterations, handoffs, and informal coordination that produced the recorded outcome.

Through a machine learning lens, the distinction is sharp: systems of record give you labels. Context graphs give you the feature space and trajectory data you’d actually need to learn the decision boundary. Consider a concrete example. Your CRM records that Deal X closed at $500K. That’s a label. The context graph captures the 14 meetings, 3 stakeholder handoffs, the pricing negotiation pattern, and the competitive displacement sequence that produced that outcome. Those are the features and the trajectory. An agent trained on labels alone can’t replicate the process that generated them.

This is why so many early enterprise AI deployments produce outputs that are technically plausible but operationally useless. The agent has access to the what but not the how. It can retrieve the right documents but can’t reconstruct the reasoning process that a human would follow. Closing that gap, building systems that capture and encode process knowledge rather than just decision records, is the highest-value problem in enterprise AI right now.

III. Context and orchestration form a compounding flywheel

There’s a reinforcement learning analogy here that I think is underappreciated. The orchestrator is the policy. The context graph is the learned world model. Agent traces are the trajectories. Every successful execution reinforces good patterns. Every failure surfaces where context is missing or stale. Over time, the system builds an increasingly accurate representation of how the organization actually operates.

And this loops back: more deployment produces richer traces, which improve the context graph, which improves agent decisions, which builds trust, which drives more deployment.

This is the same compounding mechanism that makes recommendation engines and autonomous driving systems improve with scale. Netflix gets better at recommendations because every viewing session generates training signal. Waymo gets better at driving because every mile generates edge cases. The difference here is that the asset being built isn’t a product feature. It’s an organizational world model, a learned representation of how your specific company works.

And unlike model weights, which any well-funded lab can approximate, your organization’s accumulated process knowledge is genuinely unique. No one else has your meeting patterns, your escalation sequences, your informal decision-making topology. That’s a moat.

Where this breaks, and why the agentic AI failure rate will be high

Gartner predicts 40% of enterprise applications will feature task-specific AI agents by 2026, up from less than 5% in 2025. McKinsey’s latest survey shows 23% of organizations are already scaling agentic AI, with another 39% experimenting. But Gartner also warns that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs and unclear business value.

The gap between ambition and execution is the context problem in disguise. Without process knowledge, agents produce plausible outputs that don’t match how the organization actually works. They retrieve the right policy document but apply it without understanding the exceptions your team has developed over years. They draft the right kind of email but miss the relationship dynamics that would change the tone. The failure mode isn’t that the model is bad. It’s that the context is shallow.

This chart tells the strategic story in one image. Models, interfaces, and agents cluster in the commodity zone: low lock-in, easy to replace. Context sits alone in the danger zone: highest lock-in risk and hardest to rebuild. That’s exactly where your due diligence should concentrate.

What to actually do about your agentic AI architecture

Don’t go monolithic. Each layer evolves at a different rate. Models improve quarterly, context infrastructure evolves over months, security requirements shift with regulation. Coupling them into one vendor’s all-in-one platform forces you to upgrade at the speed of the slowest-moving layer. You inherit their architectural bets, their integration timeline, their roadmap priorities. The history of enterprise software is littered with platforms that tried to own every layer and ended up mediocre at all of them.

Insist on interoperability. MCP, A2A, open connectors. If your vendor doesn’t support open standards, you’re absorbing limitations you can’t see yet. The pace of AI innovation is faster than any prior technology cycle, and you need the ability to swap in new capabilities the moment they appear without rebuilding your stack. The organizations that locked into single-vendor cloud stacks in 2015 spent years migrating out. Don’t repeat that mistake at the agent layer.

Treat context as portable IP. Your organizational world model (process knowledge, interaction history, learned workflow patterns) is the hardest-to-rebuild and most valuable asset in the stack. Ensure it is not locked to any single vendor or model provider. The right architecture separates accumulated context from the model layer so you retain your organizational IP regardless of which models or platforms you use tomorrow.

Start the flywheel early. The compounding advantage in context accrues with deployment, not with time spent evaluating. Every agent execution generates organizational learning. Companies that wait to “see how it plays out” forfeit years of compounding to first movers. This isn’t speculative. It’s the same math that governs every data flywheel business. The question isn’t whether to start. It’s whether you can afford the cost of starting late.

The stack will stratify. Specialists will outperform monoliths. Models will converge toward shared infrastructure. The defensible asset in enterprise AI is not the model. It’s the organizational world model. The organizations that start building it now, maintaining it carefully, and keeping it portable will compound their lead in the agent era. Everyone else will be buying commodity inference and wondering why their agents don’t work.

Where Mobile Money Goes Now

me@philippdubach.com (Philipp D. Dubach) — Sat, 07 Feb 2026 00:00:00 +0000

Sensor Tower’s State of Mobile 2026 report confirms what had been building for years: the mobile app economy has permanently shifted. For the first decade of mobile, games made more money than everything else combined. Clash of Clans and Candy Crush built empires on freemium. King went public. Supercell sold for $10 billion. That changed in 2025.

Apps Overtake Games in Mobile Revenue

Non-game applications now generate more in-app purchase revenue than games. Apps crossed $85.6 billion in 2025, up 21% year-over-year. Games managed $81.8 billion, barely moving from the year before.

Games peaked in 2021 and flatlined. Apps kept compounding. Subscriptions, which seemed like a novelty in 2018, became the dominant mobile monetization model for cloud storage, language learning, and now AI.

GenAI: The $3.5 Billion Growth Engine

Generative AI was the biggest contributor to consumer spending on mobile apps. The category added $3.5 billion in IAP revenue in 2025, more than Movies & TV ($2.2B) or Social Media ($2.1B). It went from near-zero in 2022 to the top growth category in three years.

GenAI apps went from 50 million downloads in 2021 to 1.45 billion in 2024. Revenue jumped from essentially nothing to $1.25 billion. ChatGPT alone accounts for 40% of the category’s consumer spend. This is just in-app purchases and does not count subscriptions billed outside the app store or enterprise contracts.

Who Actually Uses AI Apps

The demographics are interesting: AI app users look nothing like the broader internet population.

GenAI users cluster with Reddit and X. Young, male, tech-adjacent. They look nothing like Instagram (young women) or Pinterest (older women) or even Facebook (everyone’s parents). The AI audience is still a niche, even as GenAI app revenue scales.

The AI Advertising Playbook

This explains where AI companies advertise:

LinkedIn gets 45% more GenAI ad impressions than its share of the general population would suggest. Pinterest and YouTube get less. The AI advertising playbook is simple: find professionals, not consumers.

AI-Driven Retail Referral Traffic

One place where AI has found consumers: shopping.

Referral traffic from AI tools to major retailers grew roughly 7x between October 2024 and December 2025. People are asking ChatGPT what to buy, and then buying it. Amazon captures the largest share, but Walmart, Target, and Home Depot have all seen triple-digit percentage growth in AI-driven traffic. Still less than 1% of total retail traffic. But growing fast.

YouTube’s Cross-Generational Dominance

One pattern stands out:

YouTube is the top app across every age demographic. Every single one. 18-24, 25-34, 35-44, 45+. No other app has achieved this. Not TikTok (appears for youngest and oldest, vanishes in the middle). Not Instagram (fades with age). Not Facebook (rises with age). YouTube alone spans generations.

Waymo’s Quiet Expansion

Finally, Waymo:

Waymo accounts for about 4% of Lyft users and 3% of Uber users nationally, despite operating in only a handful of cities. In its active markets (San Francisco, Phoenix), market share is closer to 15%. The company has driven 127 million autonomous miles and tripled its ride volume to 15 million trips in 2025.

Mobile is no longer a platform question. It is a distribution question. The app economy winners so far: AI companies targeting professionals, YouTube serving everyone, and autonomous vehicles growing quietly in the background.

Claude Opus 4.6: Anthropic's New Flagship AI Model for Agentic Coding

me@philippdubach.com (Philipp D. Dubach) — Thu, 05 Feb 2026 00:00:00 +0000

Anthropic just released Claude Opus 4.6, the latest frontier AI model in the Claude family. It’s a big upgrade over Opus 4.5 and probably the most agentic-focused LLM release from any lab this year.

Key upgrades: better agentic AI coding capabilities (plans more carefully, sustains longer tasks, catches its own mistakes), a 1M token context window (a first for Opus-class models), and 128K output tokens. Pricing holds at $5/$25 per million tokens.

LLM Benchmark Results: How Claude Opus 4.6 Compares

The benchmark numbers are strong across the board. Opus 4.6 hits state-of-the-art on Terminal-Bench 2.0 (65.4% for agentic coding in the terminal), Humanity’s Last Exam (complex multidisciplinary reasoning), and BrowseComp (agentic web search). It beats GPT-5.2 by roughly 144 Elo points on GDPval-AA, the benchmark that measures real-world knowledge work across 44 professional occupations.

The standout is ARC-AGI-2, which tests abstract reasoning on problems easy for humans but hard for AI. Opus 4.6 scores 68.8%, a dramatic leap from Opus 4.5’s 37.6%. For comparison, GPT-5.2 scores 54.2% and Gemini 3 Pro hits 45.1%. That gap matters because ARC-AGI-2 resists memorization — it measures whether models can actually generalize.

On coding-specific evaluations, Terminal Bench 2.0 rises to 65.4% (from 59.8% for Opus 4.5), and OSWorld for agentic computer use jumps from 66.3% to 72.7%, putting Opus ahead of both GPT-5.2 and Gemini 3 Pro on those particular tests. SWE-bench Verified shows a small regression — worth watching, though the model excels on the benchmarks that better reflect real production work.

What Can You Do With a 1 Million Token Context Window?

The 1M context window paired with the new context compaction feature is the upgrade that matters most in practice. To put it in perspective: 1M tokens covers roughly 750 novels, an entire enterprise codebase of several thousand files, or a full legal discovery set — processed in a single prompt.

Compaction automatically summarizes older context when approaching limits, which means agents can theoretically run indefinitely without hitting the wall that’s plagued long-running AI tasks. Combined with the model’s improved ability to catch its own mistakes through better code review and debugging, you’re looking at agents that can actually finish what they start.

The long-context retrieval jump tells the story. On MRCR v2, which tests whether a model can find and reason over specific facts buried in massive prompts, Opus 4.6 scores 76% compared to Sonnet 4.5’s 18.5%. That’s not an incremental improvement — it’s a different capability class.

That said, bigger context doesn’t automatically mean better. Research from Factory.ai and others shows attention degrades across very long sequences, and prefill latency at 1M tokens can exceed two minutes before you get your first output token. The premium pricing tier for prompts exceeding 200K tokens ($10/$37.50) reflects this cost — Anthropic isn’t subsidizing power users anymore. The real question for enterprise deployments is whether stuffing your entire codebase into context beats a well-designed RAG pipeline. The answer, as usual, depends on the use case.

Agentic AI Coding: Agent Teams and Claude Code Updates

The headline numbers impress, but the real story is the agentic focus. Anthropic isn’t just making Claude smarter. They’re making it more useful for the actual work people want AI to do: sustained, multi-step tasks in large codebases.

New API features reinforce this direction: adaptive thinking lets the model decide when to reason deeper based on contextual cues, effort controls give developers fine-grained tradeoffs between intelligence, speed, and cost (low/medium/high/max), and context compaction keeps long-running agents within limits without manual intervention.

Claude Code gets the headline feature: Agent Teams that work in parallel. Multiple subagents can coordinate autonomously on read-heavy work like codebase reviews, with each agent handling a different branch via git worktrees before merging back. This ships as a research preview, but it’s clearly aimed at the production workflows where agentic coding tools like Cursor, GitHub Copilot, and OpenAI’s Codex are competing hard. The timing isn’t accidental — Apple just announced Xcode 26.3 with native support for Claude Agent and OpenAI’s Codex via MCP (Model Context Protocol), making agentic coding a standard part of the developer toolchain rather than an experiment.

Enterprise Deployment: Why GDPval-AA Matters

The GDPval-AA benchmark matters because it measures performance on real-world knowledge work — not toy problems or academic puzzles. Beating GPT-5.2 by 144 Elo points (and Opus 4.5 by 190) suggests meaningful improvements in the tasks that matter for enterprise AI adoption: financial analysis, legal reasoning, and multi-step professional workflows.

The product expansions signal where Anthropic sees the market going. Claude in Excel now handles long-running tasks and unstructured data. Claude in PowerPoint reads layouts and slide masters for brand consistency. These aren’t research demos — they’re enterprise-ready integrations designed for knowledge workers who need AI that fits into existing toolchains.

For teams evaluating which frontier model to standardize on, the picture is nuanced. Claude Opus 4.6 leads on agentic coding and enterprise knowledge work. GPT-5.2 still holds advantages in abstract reasoning (ARC-AGI-2, though the gap narrowed significantly) and math. Gemini 3 Pro offers the best cost efficiency and multimodal processing with its own 1M context window. The multi-model workflow trend is real — the smartest enterprise teams aren’t picking one model; they’re routing tasks to whichever model handles them best.

Safety Profile and the Zero-Day Question

One detail worth noting: the safety profile. Anthropic claims Opus 4.6 is “just as well-aligned as Opus 4.5, which was the most-aligned frontier model to date.” Given the enhanced cybersecurity capabilities — Opus 4.6 independently discovered over 500 previously unknown zero-day vulnerabilities in open-source code during Anthropic’s pre-release testing — they developed six new detection probes specifically for this release.

Whether that’s reassuring or concerning depends on your priors about AI capabilities research. The vulnerabilities ranged from system-crashing bugs to memory corruption flaws in widely-used tools like GhostScript and OpenSC. As Logan Graham, head of Anthropic’s frontier red team, put it: it’s a race between defenders and attackers, and Anthropic wants defenders to have the tools first.

What This Means for the Competitive Landscape

The competitive picture just got more interesting. GPT-5.2 and Gemini 3 Pro now have a new benchmark to chase, and Anthropic has clearly staked its claim on agentic coding as the primary battleground. With pricing unchanged at $5/$25 per million tokens — significantly more expensive than GPT-5.2 at $2/$10 but competitive for the performance tier — the value proposition comes down to whether the agentic improvements translate to fewer retries, less hand-holding, and faster task completion in your specific workflow.

For developers, the move is straightforward: swap in claude-opus-4-6 via the API and test it on your hardest tasks. For enterprise decision makers, the GDPval-AA results and Agent Teams feature are worth a serious evaluation cycle. The model is available now on claude.ai, the API, and all major cloud platforms (AWS Bedrock, Azure Foundry, GCP Vertex AI).

Buying the Haystack Might Not Work This Year

me@philippdubach.com (Philipp D. Dubach) — Sat, 31 Jan 2026 00:00:00 +0000

I’ve been reading the January 2026 state of markets reports from Andreessen Horowitz and AQR, and their conclusions on the AI bubble question in 2026 are almost impossible to reconcile.

The a16z view is straightforward: AI fundamentals are real, and current prices reflect that reality. Their evidence is compelling. The top 50 private AI companies now generate $40.6 billion in annual revenue. Companies like ElevenLabs and Cursor are hitting $100 million ARR faster than Slack or Twilio ever did. GPUs are running at 80% utilization, compared to the 7% utilization rate for fiber optic cables during the dotcom bubble. This isn’t speculation, they argue. It’s demand exceeding supply.

AQR looks at the same market and sees something else entirely. Their capital market assumptions put the U.S. CAPE ratio at the 96th percentile since 1980. Expected real returns for U.S. large cap equities over the next 5-10 years? 3.9%. For a global 60/40 portfolio, just 3.4%, well below the long-term average of roughly 5% since 1900. Risk premia, in their framework, are compressed across nearly every asset class. The narrative doesn’t enter their models.

a16z points to earnings growth. The market rally hasn’t been driven by multiple expansion, they note, but by actual EPS growth. Tech P/E multiples sit around 30-35x, elevated but nowhere near the 70-80x of 2000. Tech margins have “lapped the field” at 25%+ compared to 5-8% for the rest of the S&P 500. The fundamentals, they insist, are doing the work.

AQR’s response would be that fundamentals always look good near peaks. Their research shows a 50% probability that realized equity returns will miss estimates by more than 3 percentage points annually over the next decade. Compressed premia don’t announce themselves with blaring headlines. They just quietly erode returns until investors notice they’ve been running in place.

Cumulative hyperscaler capex is projected to reach $4.8 trillion by 2030. To achieve a 10% hurdle rate on that investment, AI revenue needs to hit roughly $1 trillion annually by 2030, about 1% of global GDP excluding China. Goldman Sachs estimates that $9 trillion in revenue could flow from the AI buildout, which at 20% margins and a 22x P/E multiple would create $35 trillion in new market cap. Only about $24 trillion has been pulled forward so far, leaving $11 trillion “on the table.“

Or not. AQR would point out that the expected return for U.S. buyouts, private equity’s bread and butter, is now 4.2%. That’s barely above the 3.9% for public large caps. The illiquidity premium has essentially vanished. If sophisticated PE firms can’t find excess returns, why should AI capex be different?

I find myself uncertain, which feels like the more honest position. Neither source is disinterested. a16z manages billions in venture capital and growth equity; bullish AI narratives support their portfolio valuations and fundraising. AQR runs systematic strategies that benefit when investors diversify away from concentrated U.S. tech exposure toward international equities and alternatives. Both are talking their book, which doesn’t make either wrong, but it’s worth noting.

The a16z data on utilization and revenue growth is hard to dismiss. 80% GPU utilization isn’t vaporware. Harvey users nearly tripled their time on the platform in nine months. Navan’s AI handles half of all customer interactions at satisfaction levels matching human agents. These are real products generating real engagement. But AQR’s valuation work has a longer track record. Their models don’t care about narratives, and historically that discipline has been valuable. When they say U.S. equities offer the lowest expected returns among major markets, that’s not pessimism. It’s arithmetic.

The reconciliation might be this: AI winners could thrive spectacularly while broad market indices disappoint. a16z’s portfolio companies operate in a different universe than the average S&P 500 constituent. Compressed risk premia can coexist with individual companies generating enormous returns. The question is whether you’re buying the index or picking the winners.

Non-U.S. developed markets, by the way, offer expected returns of around 5%, versus 3.9% for U.S. large caps. The valuation gap is real even if the AI story is true.

Bandits and Agents: Netflix and Spotify Recommender Stacks in 2026

me@philippdubach.com (Philipp D. Dubach) — Fri, 30 Jan 2026 00:00:00 +0000

Hyperscalers spent over $350 billion on AI infrastructure in 2025 alone, with projections exceeding $500 billion in 2026. The trillion-dollar question is not whether machines can reason, but whether anyone can afford to let them. Hybrid recommender systems sit at the center of this tension. Large Language Models promised to transform how Netflix suggests your next show or how Spotify curates your morning playlist. Instead, the industry has split into two parallel universes, divided not by capability but by cost.

On one side sits what engineers call the “classical stack”: matrix factorization, two-tower embedding models, and contextual bandits. These methods respond in microseconds, scale linearly with users, and run on nothing more complicated than dot products. A query costs a fraction of a cent. On the other side is the “agentic stack”: LLM-based reasoning engines that can handle requests like “find me a sci-fi movie that feels like Blade Runner but was made in the 90s.” This second approach consumes thousands of tokens per recommendation. The cost difference is not incremental; it is orders of magnitude. LLM inference cost economics, more than any algorithmic breakthrough, is now the dominant force shaping recommender architecture.

The 2026 consensus is a hybrid architecture: use the cheap, fast models for candidate generation from millions of items, then invoke the expensive reasoning layer only for the final dozen items a user actually sees. This “funnel” pattern — retrieval, then ranking, then re-ranking — is the only way to make the economics work. The smartest model is reserved for the fewest items.

What makes this work in practice goes back to a formalism from 1933: the multi-armed bandit. Imagine a gambler facing a row of slot machines, each with an unknown payout rate. She wants to maximize her winnings over a night of play. If she always pulls the arm with the highest observed payout, she might miss a better machine she never tried. If she explores too much, she wastes money on losers. The mathematics of this exploration–exploitation tradeoff define regret:

$$ R(T) = \mu^* \cdot T - \sum_{t=1}^{T} \mu(a_t) $$

Here μ* is the best possible average reward, and μ(aₜ) is the reward from whatever arm she actually pulled at time t. Total regret is how much she left on the table by not knowing the optimal choice in advance. The goal of every multi-armed bandit algorithm in recommender systems is to drive this quantity sublinear in T — to learn fast enough that the cost of exploration vanishes relative to the horizon.

The three main exploration strategies each take a different approach: epsilon-greedy adds random noise to avoid getting stuck; Upper Confidence Bound (UCB) prefers actions with uncertain values; Thompson Sampling selects actions according to the probability they are optimal. In practice, Thompson Sampling tends to outperform the others because its exploration is guided by posterior uncertainty rather than arbitrary randomness — it explores where it matters most.

Every recommendation you see on Netflix’s homepage is the output of an algorithm trying to minimize exactly this quantity, whether it realizes it or not.

Netflix’s recommendation algorithm architecture runs this optimization across three computation layers. Offline systems crunch terabytes of viewing history to train deep collaborative filtering models, a process that takes hours and happens on a schedule. Nearline systems update user embeddings seconds after a click, keeping the recommendations fresh without the cost of full retraining. Online systems respond to each page load in milliseconds, combining the precomputed signals with real-time context like time of day and device type. The architecture is a latency-cost tradeoff: deep analysis happens in batch, while the user-facing layer stays fast.

What Netflix learned from a decade of experimentation is counterintuitive. The goal is not to recommend what users will definitely watch, but what they would not have found on their own. They call this “incrementality.” A greedy algorithm that always surfaces the highest-probability titles just confirms what users already knew — it exploits without exploring, and in doing so collapses the discovery space. A better approach is to measure the causal effect of the recommendation: how much does showing this thumbnail increase the probability of a play compared to not showing it? Some titles have low baseline interest but high incrementality. Those are the ones worth featuring. This is the exploration–exploitation tradeoff made concrete: the value of a recommendation is not its predicted rating, but its marginal contribution to discovery.

Spotify’s AI DJ recommender system takes a different approach to the same problem. Their “AI DJ” feature uses what engineers internally call the “agentic router.” When you ask for “music for a rainy reading session in 1990s Seattle,” the router decides whether to invoke the expensive LLM reasoning layer or just fall back to keyword matching against collaborative filtering embeddings. Complex queries get the big model; simple ones get the fast path. This router is the economic governor of the entire system — an inference cost optimizer disguised as a product feature. Underneath the DJ’s personality, built on Spotify’s Sonantic voice synthesis and LLM-generated contextual narratives, sits a bandit framework called BaRT (Bandits for Recommendations as Treatments) that quietly balances what you know you like against what you might not yet know you need.

Not everyone is convinced the algorithms are making us better off. My own analysis of social media success prediction found that sophisticated language models often just memorize temporal patterns rather than learning what actually makes content good. They learn the news cycle, not the news.

The risk is that we build hybrid recommender systems that are technically brilliant but experientially hollow, engineering away the serendipity that made discovery meaningful in the first place. The recommender is becoming a curator, and the curator is becoming an agent. The architecture will keep evolving — foundation models for recommendations, reinforcement learning from human feedback applied to discovery, inference costs that continue their 10× annual decline — but the open question for 2026 is whether we want to be the curators of our own lives, or merely consumers of an optimized feed.

Slides courtesy of “A Multi-Armed Bandit Framework for Recommendations at Netflix” by Jaya Kawale, Netflix.

The Most Expensive Assumption in AI

me@philippdubach.com (Philipp D. Dubach) — Mon, 26 Jan 2026 00:00:00 +0000

Sara Hooker’s paper arrived with impeccable timing. On the slow death of scaling dropped just as hyperscalers are committing another $500 billion to GPU infrastructure, bringing total industry deployment into the scaling thesis somewhere north of a trillion dollars. I’ve been tracking these capital flows for my own portfolio. Either Hooker is early to a generational insight or she’s about to be very publicly wrong.

The core argument is very simple: bigger is not always better. Llama-3 8B outperforms Falcon 180B. Aya 23 8B beats BLOOM 176B despite having only 4.5% of the parameters. These are not isolated flukes. Hooker plots submissions to the Open LLM Leaderboard over two years and finds a systematic trend where compact models consistently outperform their bloated predecessors. The bitter lesson, as Rich Sutton framed it, was that brute force compute always wins. Hooker’s counter is that maybe we’ve been held hostage to “a painfully simple formula” that’s now breaking down.

Scaling laws, she notes, only reliably predict pre-training test loss. When you look at actual downstream performance, the results are “murky or inconsistent.” The term “emergent properties” gets thrown around to describe capabilities that appear suddenly at scale, but Hooker points out this is really just a fancy way of admitting we have no idea what’s coming. If your scaling law can’t predict emergence, it’s not much of a law.

Gary Marcus has been making a related argument from a different angle. The cognitive scientist, whose 2001 book predicted hallucination problems, calls LLMs “glorified memorization machines” that work because the internet contains answers to most common queries. His framing is less academic and more market-oriented: the jump from GPT-1 to GPT-4 showed obvious qualitative leaps requiring no benchmarks. The jump from GPT-4 to GPT-5? Marginal improvements requiring careful measurement. The textbook definition of diminishing returns.

The market signals are worth watching. According to Goldman Sachs data, hedge fund short interest in utilities now sits at the 99th percentile relative to the past five years. Utilities. The bet appears to be that AI data center demand, the premise on which American Electric Power trades at $65 billion, may not materialize as expected. Meanwhile, names like Bloom Energy, Oracle, and various AI-adjacent plays are showing up on heavily-shorted lists. Hedge funds aren’t yet betting against Nvidia directly, but they’re circling the weaker members of the herd.

There’s a certain irony here that Hooker captures well. Academia was effectively priced out of meaningful AI research by the compute arms race. The explosion in necessary compute “marginalized academia from meaningfully participating in AI progress.” Industry labs stopped publishing to preserve commercial advantage. Now, as scaling hits diminishing returns, the skills that matter shift back toward algorithmic cleverness, data quality, and architectural innovation. Things that don’t require a billion-dollar data center. If you got priced out of the game, the game may be coming back to you. Hooker writes,

The less reliable gains from compute makes our purview as computer scientists interesting again

The quiet tell is how frontier labs are actually behaving. Major players are now incorporating classical symbolic tools, things like Python interpreters and code execution, into LLM pipelines. These symbolic components run on CPUs, not GPUs. Ilya Sutskever, coauthor of the 2012 ImageNet paper and OpenAI cofounder, publicly stated that

We need to go back to the age of research

Shorting the scaling thesis has been a widow-maker trade for the better part of three years. Nvidia is up roughly 800% since 2022. As I’ve written before, the market can remain irrational longer than you can remain solvent, and that applies to both directions. OpenAI reportedly burns around $3 billion monthly with a $40 billion funding round implying perhaps 13 months of runway. If the next mega-round prices down or requires distressed terms, that’s your signal. Until then, the thesis may be directionally correct on the technical limitations while the timing remains treacherous.

We can only see a short distance ahead, but we can see plenty there that needs to be done.

As Alan Turing put it, and Hooker quotes approvingly. The scaling era produced real capabilities alongside real capital misallocation. What comes next is genuinely uncertain. That uncertainty cuts both ways.

Enterprise AI Strategy is Backwards

me@philippdubach.com (Philipp D. Dubach) — Thu, 22 Jan 2026 00:00:00 +0000

That’s the claim made by LinkedIn co-founder Reid Hoffman. It’s a bold assertion, so I set out to investigate whether the data supports it.

The result is a comprehensive report, backed by more than 30 sources. You can download the full report and the accompanying presentation for free.

Global AI spending hit $13.8 billion; a six-fold increase since late 2023. Yet 85% of AI projects never reach production. Only 26% of companies can translate pilots into outcomes. The gap between ambition and execution has become so predictable that Gartner now officially places generative AI in the “trough of disillusionment.”

There’s an economic concept called Jevons paradox (yes, I referenced this before). When efficiency improves for a resource, consumption increases, not decreases. Coal-efficient steam engines didn’t reduce coal usage, they made coal so useful that demand exploded. The same logic applies to organizational communication. Email was supposed to reduce meetings. Slack was supposed to reduce email. AI was supposed to reduce everything.

Instead, the average employee now spends 57% of their workday on coordination: communicating, updating, aligning. Meetings alone cost the US economy $532 billion per year. This is the coordination layer, where organizations actually run, and where organizations quietly bleed.

Three observations:

(1) Only 26% of companies have the maturity to translate AI pilots into outcomes. The rest are layering AI on legacy workflows instead of redesigning them.
(2) Language models bridge the gap between messy human communication and structured data. Transcripts to CRM fields. Teams using these tools report 30% higher win rates and 80% less manual work.
(3) AI gains compound when shareable. A summary helps one person. A system that captures and distributes knowledge helps everyone downstream.

The coordination layer isn’t glamorous. It’s transcripts, status updates, action items, CRM entries. It’s the administrative exhaust of getting anything done with other people. And it’s almost entirely composed of language. We have language models now. Models that extract structured data from messy transcripts, convert meeting notes into CRM fields with 99% accuracy. Sales teams using these tools report 30% higher win rates and 80% less manual work.

Yet most enterprise AI strategies ignore this entirely. They’re focused on chatbots and demos for board presentations. Meanwhile, the language processing that constitutes the primary workload of any modern business remains stuck in the same recursive loops. The winners won’t be companies with great AI announcements. They’ll be the ones building daily habits early enough for the gains to stack.

Does AI mean the demand on labor goes up?

me@philippdubach.com (Philipp D. Dubach) — Thu, 15 Jan 2026 00:00:00 +0000

Joe Weisenthal from Bloomberg, this week:

All my shower thoughts now are about designing efficient workflows for synthesizing, collecting, labeling and annotating data.

Same. Since I started building every app and tool I thought would make my life easier, my workflow more efficient, I haven’t stopped. Apparently non-developers are now writing apps instead of buying them. This is the AI productivity paradox in miniature: the tools get better and we do more, not less.

The assumed narrative is still AI displaces jobs, humans collect UBI, society figures out leisure. But the trajectory might be more work, not less. A recent NBER study found that workers in AI-exposed occupations now work roughly 3 extra hours per week—and leisure time has dropped by the same amount. Upwork’s research puts it bluntly: 77% of employees say AI tools have added to their workload.

The Jevons paradox is 160 years old: when James Watt made steam engines more efficient, coal consumption didn’t fall. It exploded. Efficiency made coal useful in new ways. Satya Nadella referenced this for AI after DeepSeek rattled the markets. Erik Brynjolfsson argues it applies to AI-augmented occupations—coders, radiologists, translators. Make something more efficient and you find more things to do with it.

When I can build an app in a weekend that used to take months, I don’t build one. I build six. When I can write a report in an hour, I write five. The friction that once protected us from infinite expectations evaporates. This is the Jevons paradox applied not just to markets or coal, but to our own time and cognitive capacity—a kind of psychological rebound effect where internal expectations outrun what’s actually sustainable.

Keynes predicted a 15-hour work week by now. We got the productivity gains. We work longer hours than ever. Only 21% of employees actually use the time AI saves them for personal life. The rest reinvest it right back into work. When capability expands, so does the definition of “enough.” The bar rises.

If AI makes me 10x more productive, that’s not 10x more free time. That’s 10x more I could be doing. In a competitive environment—founding, climbing, anything with stakes—someone who uses that 10x while I rest will outrun me. The fear was displacement. The reality might be inescapability.

Parkinson’s Law: work expands to fill time available. The AI corollary: work expands to fill capabilities available. More capability means more possibility—and more obligation. We should know where this points.

Social Media Success Prediction: BERT Models for Post Titles

me@philippdubach.com (Philipp D. Dubach) — Sat, 10 Jan 2026 00:00:00 +0000

Last week I published a Hacker News title sentiment analysis based on the Attention Dynamics in Online Communities paper I have been working on. The discussion on Hacker News raised the obvious question: can you actually predict what will do well here?

The honest answer is: partially. Timing matters. News cycles matter. Who submits matters. Weekend versus Monday morning matters. Most of these factors aren’t in the title. But titles aren’t nothing either. “Show HN” signals something. So does phrasing, length, and topic selection. The question becomes: how much signal can you extract from 80 characters?

Hacker News (HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator.

This isn’t new territory. Max Woolf built a Reddit submission predictor back in 2017, and ontology2 trained an HN classifier using logistic regression on title words. Both found similar ceilings; around 0.76 AUC with classical approaches. I wanted to see what modern transformers could add.

The baseline was DistilBERT, fine-tuned on 90,000 HN posts. ROC AUC of 0.654, trained in about 20 minutes on a T4 GPU. Not bad for something that only sees titles. Then RoBERTa with label smoothing pushed it to 0.692. Progress felt easy.

What if sentence embeddings captured something classification heads missed? I built an ensemble: SBERT for semantic features, RoBERTa for discrimination, weighted average at the end. The validation AUC jumped to 0.714.

The problem was hiding in the train/test split. I’d used random sampling. HN has strong temporal correlations: topics cluster, writing styles evolve, news cycles create duplicates. A random split let the model see the future. SBERT’s semantic embeddings matched near-duplicate posts across the split perfectly.

When I switched to a strict temporal split, training on 2022-early 2024 and testing on late 2024 onward, the ensemble dropped to 0.693. More revealing: the optimal SBERT weight went from 0.35 to 0.10. SBERT was contributing almost nothing. The model had memorized temporal patterns, not learned to predict.

I kept RoBERTa, added more regularization, dropped from 0.1 to 0.2 dropout, weight decay from 0.01 to 0.05, froze the lower six transformer layers. The model got worse at fitting training data. Train AUC dropped from 0.803 to 0.727.

But the train-test gap collapsed from 0.109 to 0.042. That’s a 61% reduction in overfitting. Test AUC of 0.685 versus the ensemble’s 0.693, a difference that vanishes once you account for confidence intervals. And now inference runs on a single model, half the latency, no SBERT dependency, 500MB instead of 900MB.

The other lesson was calibration. A model that says 0.8 probability should mean “70% of posts I give this score actually hit 100 points.” Neural networks trained on cross-entropy don’t do this naturally. They’re overconfident. I used isotonic regression on the validation set to fix the mapping. Expected calibration error (ECE) measures this gap:

$$ECE = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|$$

where you bin predictions by confidence, then measure how far off the actual accuracy is from the predicted confidence in each bin. ECE went from 0.089 to 0.043. Now when the model says 0.4, it’s telling the truth.

In practice, the model provides meaningful lift. If you only look at the top 10% of predictions by score, 62% of them are actual hits, roughly 1.9x better than random selection:

About training speed: I used the NVIDIA H100 GPU, which runs around 18x more expensive than the T4 per hour on hosted (Google Colab) runtimes. A sensible middle ground would be an A100 (40 or 80GB VRAM) or L4, training 3-5x faster than T4, maybe 5-7 minutes instead of 20-30. But watching epochs fly by at ~130 iterations per second after coming from T4’s ~3 iterations per second was a different experience.

The model learned some intuitive patterns. “Show HN” titles score higher. Deep technical dives do well. Generic news aggregation doesn’t. Titles between 40-80 characters perform better than very short or very long ones. Some of this probably reflects real engagement patterns. Some of it is noise the model hasn’t been sufficiently regularized to ignore.

Running a few titles through the model shows what it picks up on:

Vague claims score low. Specificity helps. First-person “I built” framing does well, which matches what actually gets upvoted. The model isn’t learning to game HN; it’s learning what HN already rewards.

The model now runs, scoring articles in an RSS reader pipeline I built. Does it help? Mostly. I still click on things marked low probability. But the high-confidence predictions are usually right. It’s a filter, not an oracle.

Model on HuggingFace — Download the weights and run inference locally
RSS Reader Pipeline — Full scoring pipeline with feed aggregation
Training Notebook — Colab-ready notebook with the complete training code

On a side note: The patterns here aren’t specific to Hacker News or online communities. Temporal leakage shows up whenever you’re predicting something that evolves over time: credit defaults, client churn, market regimes. The fix is the same: validate on future data, not random holdouts. Calibration matters anywhere probabilities drive decisions. A loan approval model that says “70% chance of repayment” needs that number to mean something. Overfitting to training data is how banks end up with models that look great in backtests and fail in production.

I’ve built similar systems for other domains: sentiment-based trading signals, glycemic response prediction, portfolio optimization. The ML fundamentals transfer. What changes is the domain knowledge needed to avoid the obvious mistakes, like training on data that wouldn’t have been available at prediction time, or trusting metrics that don’t reflect real-world performance.

Beyond Vector Search: Why LLMs Need Episodic Memory

me@philippdubach.com (Philipp D. Dubach) — Fri, 09 Jan 2026 00:00:00 +0000

You’ve seen this message before. Copilot pausing; In long sessions, it happens often enough that I started wondering what’s actually going on in there. Hence this post.

The short answer: context windows grew larger. Claude handles 200K tokens, Gemini claims a million. But bigger windows aren’t memory. They’re a larger napkin you throw away when dinner’s over.

For som time I was convinced that vector databases would solve this. Embed everything, store it geometrically, retrieve by similarity. Elegant in theory. Try encoding “first we did X, then Y happened, which caused Z.” Sequences don’t live naturally in vector space. Neither do facts that change over time. Your database might confidently tell you Bonn is Germany’s capital if you fed it the wrong decade of documents.

What caught my attention is EM-LLM. The approach is basically “what if we just copied how brains do it?” They segment conversation into episodes using surprise detection; when something unexpected happens, that’s a boundary. Retrieval pulls not just similar content but temporally adjacent content too. You don’t just remember what you’re looking for. You remember what happened next. Their event boundaries actually correlate with where humans perceive breaks in experience. Either a coincidence or we’re onto something.

Knowledge graphs are the other path. Persona Graph treats memory as user-owned, with concepts as nodes. The connection between “volatility surface” and “Lightning McQueen” exists in my head (for some reason) but probably not yours. A flat embedding can’t capture that your graph is different from mine. HawkinsDB pulls from Thousand Brains theory. Letta just ships, production-ready blocks you can use today. OpenMemory goes further, separating emotional memory from procedural from episodic, with actual decay curves instead of hard timeouts. Mem0 reports 80-90% token cost reduction while quality goes up 26%. I can’t validate the claim, but if it holds, that’s more than optimization.

HeadKV figured out that attention heads aren’t created equal: some matter for memory, most don’t. Throw away 98.5% of your key-value cache, keep the important heads, lose almost nothing. Sakana AI went weirder: tiny neural networks that decide per-token whether to remember or forget, evolved rather than trained. Sounds like it shouldn’t work. Apparently works great.

Here’s what I keep coming back to: in any mature system, most of the graph will be memories of memories. You ask me my favorite restaurants, I think about it, answer, and now “that list I made” becomes its own retrievable thing. Next time someone asks about dinner plans, I don’t re-derive preferences from first principles. I remember what I concluded last time. Psychologists say this is how human recall actually works; you’re not accessing the original, you’re accessing the last retrieval. Gets a little distorted each time.

Should the model control its own memory? Give it a “remember this” tool? I don’t think so, not yet. These things are overconfident. Maybe that changes. For now, memory probably needs to happen around the model, not through it. Eventually some learned architecture will make all this scaffolding obsolete. Train memory into the weights directly. I have no idea what that looks like. Sparse mixture of experts with overnight updates? Some forgotten recurrent trick? Right now it’s all duct tape and cognitive science papers.

65% of Hacker News Posts Have Negative Sentiment, and They Outperform

me@philippdubach.com (Philipp D. Dubach) — Wed, 07 Jan 2026 00:00:00 +0000

Negativity Bias and Engagement on Hacker News

This Hacker News sentiment analysis began with a simple observation: posts with negative sentiment average 35.6 points on Hacker News. The overall average is 28 points. That’s a 27% performance premium for negativity.

This finding comes from an empirical study I’ve been running on HN attention dynamics, covering decay curves, preferential attachment, survival probability, and early-engagement prediction. The preprint is available on SSRN. I already had a gut feeling. Across 32,000 posts and 340,000 comments, nearly 65% register as negative. This might be a feature of my classifier being miscalibrated toward negativity; yet the pattern holds across six different models.

Six-Model Sentiment Comparison: Transformers vs LLMs

I tested three transformer-based classifiers (DistilBERT, BERT Multi, RoBERTa) and three LLMs (Llama 3.1 8B, Mistral 3.1 24B, Gemma 3 12B). The distributions vary, but the negative skew persists across all of them (inverted scale for 2-6). The results I use in my dashboard are from DistilBERT because it runs efficiently in my Cloudflare-based pipeline.

What counts as “negative” here? Criticism of technology, skepticism toward announcements, complaints about industry practices, frustration with APIs. The usual. It’s worth noting that technical critique reads differently than personal attacks; most HN negativity is substantive rather than toxic. But, does negativity cause engagement, or does controversial content attract both negative framing and attention? Probably some of both.

HackerBook Dataset: Cross-Validation With 22GB of Hacker News Data

Related to this, I also saw this Show HN: 22GB of Hacker News in SQLite, served via WASM shards. Downloaded the HackerBook export and ran a subset of my paper’s analytics on it.

Caveat: HackerBook is a single static snapshot (no time-series data). Therefore I could not analyze lifecycle analysis, early-velocity prediction, or decay fitting. What can be computed: distributional statistics, inequality metrics, circadian patterns.

Score Distribution and Power-Law Fit

Attention Inequality: Lorenz Curve and Gini Coefficient

Circadian Posting Patterns

Score vs Comment Engagement

RSS Swipr: Find Blogs Like You Find Your Dates

me@philippdubach.com (Philipp D. Dubach) — Mon, 05 Jan 2026 00:00:00 +0000

Algorithmic timelines are everywhere now. But I still prefer the control of RSS. Readers are good at aggregating content but bad at filtering it. What I wanted was something borrowed from dating apps: instead of an infinite list, give me cards. Swipe right to like, left to dislike. Then train a model to surface what I actually want to read. So I built RSS Swipr.

The frontend is vanilla JavaScript—no React, no build steps, just DOM manipulation and CSS transitions. You drag a card, it follows your finger, and snaps away with a satisfying animation. Behind the scenes, the app tracks everything: votes (like/neutral/dislike), time spent viewing each card, and whether you actually opened the link. If I swipe right but don’t click through, that’s a signal. If I spend 0.3 seconds on a card before swiping left, that’s a signal too.

Feed management happens through a simple CSV import. Paste a list of name,url pairs, click refresh, and the fetcher pulls articles with proper HTTP caching (ETag/Last-Modified) to avoid hammering servers. You can use your own feed list or load a predefined list. Thanks to Manuel Moreale who created blogroll I was able to get an OPML export and load all curated RSS feeds directly. Something similar works with minifeed or Kagi’s smallweb. Or you use one of the Hacker News RSS feeds. If that feels too adventurous, I created curated feeds for the most popular HN bloggers.

Building the model, I started with XGBoost and some hand-engineered features (title length, word count, time of day, feed source). Decent—around 66% ROC-AUC. It learned that I dislike short, clickbaity titles. But it didn’t understand context.

The upgrade was MPNet (all-mpnet-base-v2 from sentence-transformers) to generate 768-dimensional embeddings for every article’s title and description. Combined with engineered features—feed preferences, temporal patterns, text statistics—this gets fed into a Hybrid Random Forest.

pythondef predict_preference(article):
 # Generate semantic embeddings (768 dims)
 embeddings = mpnet.encode(f"{article.title} {article.description}")

 # Extract behavioral + text features
 features = feature_pipeline.transform(article)

 # Predict with Hybrid RF
 X = np.hstack([embeddings, features])
 return model.predict_proba(X)

Training happens on Google Colab (free T4 GPU or even faster with H100 or A100 on a subscription). Upload your training CSV, run the notebook, download a .pkl file.

The notebook handles everything: installing sentence-transformers, downloading the feature engineering pipeline, checking GPU availability, and running 5-fold cross-validation.

With ~1400 training samples, the model achieves 75.4% ROC-AUC (± 0.019 std). Not state-of-the-art, but enough to noticeably improve my reading experience. The model now understands that I like systems programming and ML papers, but skip most crypto and generic startup advice.

The problem with transformer models is latency. Generating MPNet embeddings takes ~1 second per article. In a swipe interface, that lag is unbearable. The next best thing is a preload queue. While you’re reading the current card, the backend is scoring and fetching the next 3-5 articles in the background. By the time you swipe, the next card is already waiting.

javascriptasync loadNextBatch() {
 const excludeIds = this.cardQueue.map(c => c.id).join(',');
 const response = await fetch(`/api/posts/batch?count=3&exclude=${excludeIds}`);
 const data = await response.json();
 this.cardQueue.push(...data.posts);
}

Article selection uses Thompson Sampling: 80% of the time it shows what the model thinks you’ll like (exploit), 20% it throws in something unexpected (explore). This prevents the filter bubble problem and lets the model discover if your tastes have changed.

The whole system is designed as a closed loop:

Swipe → votes get stored in SQLite
Export → download training CSV with votes + engagement data
Train → run Colab notebook, get new model
Upload → drag-drop the .pkl file back into the app

The export includes everything the model needs: article text, feed metadata, your votes, link opens, and time spent. You can also import a previous training CSV to restore your voting history on a fresh install—useful if you want to clone the repo on a new machine without losing your data.

Uploaded models show their ROC-AUC score so you can compare performance across training runs. Activate whichever one works best.

Backend: Python, Flask, SQLite Frontend: Vanilla JS, CSS variables ML: scikit-learn, XGBoost, sentence-transformers (MPNet) Training: Google Colab (free GPU tier)

Total infrastructure cost: zero. Everything runs locally. No accounts, no cloud dependencies, no tracking.

bashgit clone https://github.com/philippdubach/rss-swipr.git
cd rss-swipr
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python app.py

The full source and Colab notebook are available on GitHub.

Apple's AI Bet: Playing the Long Game or Missing the Moment?

me@philippdubach.com (Philipp D. Dubach) — Tue, 30 Dec 2025 00:00:00 +0000

The Information published a piece today arguing that Apple’s restrained AI approach may finally pay off in 2026. The thesis: while OpenAI, Google, and Meta pour hundreds of billions into data centers and model training, Apple has kept its powder dry, sitting on $157 billion in cash and marketable securities as of Q4 2025. If the AI spending bubble deflates, Apple’s position looks rather clever. This piqued my interest, from a strategy point of view: Apple hasn’t been absent from AI. They’ve been making a specific bet that large language models will commoditize, and that value will flow to distribution and customer relationships rather than to whoever has the best model. The revamped Siri expected in spring 2026 will reportedly be powered by Google’s Gemini through a deal worth $1 billion annually. The custom Gemini model will run on Apple’s Private Cloud Compute servers.

This is consistent with Apple’s history. They didn’t build their own search engine. They took Google’s money to be the default on Safari. John Giannandrea’s retirement earlier this month, with Siri now under Mike Rockwell, signals internal recognition that something had to change.

The iPhone distribution advantage is underappreciated. Apple can push AI features through software updates to over 2.3 billion active devices. When Apple Intelligence features ship, they just appear. This is the same advantage that made Apple Music competitive against Spotify, or keeps Safari relevant despite Chrome’s benchmarks.

The commoditization evidence is suggestive. I’ve written before about these dynamics. GPT-4 launched with a substantial lead; within months, Claude and Gemini were comparable. DeepSeek proved frontier models can be built for a fraction of OpenAI’s cost. API pricing has dropped 97% since GPT-3’s launch. The hyperscalers are spending $400 billion collectively on AI infrastructure in 2025, more than global telecom capex. The question isn’t whether this produces capable models. It’s whether it produces defensible advantages.

Mark Gurman’s Bloomberg reporting suggests Apple views LLMs as commodities not worth proprietary development costs. The counterargument is obvious: what if the next capability jump makes current models look like toys?

But the AI investment boom resembles previous cycles. Enormous capital flowing into a sector where barriers keep falling. That pattern often ends with winners who have distribution and customer relationships, not winners who spent the most on R&D. Apple’s bet isn’t guaranteed to be correct, but it’s defensible.

The spring Siri update will matter. Reports that Apple employees have concerns about performance in early iOS 26.4 builds aren’t encouraging. But Apple delayed the launch multiple times, suggesting they’re trying to get it right rather than shipping half-baked.

Apple’s $157 billion cash pile provides optionality. If AI startups face a funding crunch, Apple can acquire capability. If someone achieves a breakthrough, Apple has resources to respond. Apple has preserved its options.

Is AI Really Eating the World? AGI, Networks, Value [2/2]

me@philippdubach.com (Philipp D. Dubach) — Mon, 24 Nov 2025 00:00:00 +0000

Start by reading Is AI Really Eating the World? What we’ve Learned [1/2]

All current recommendation systems work by capturing and analyzing user behavior at scale. Netflix needs millions of users watching millions of hours to train its recommendation algorithm. Amazon needs billions of purchases. The network effect comes from data scale. What if LLMs can bypass this? What if an LLM can provide useful recommendations by reasoning about conceptual relationships rather than requiring massive behavioral datasets? If I ask for “books like Pirsig’s Zen and the Art of Motorcycle Maintenance but more focused on Eastern philosophy,” a sufficiently capable LLM might answer well without needing to observe 100 million readers. It understands (or appears to understand) the conceptual space. I’m uncertain whether LLMs can do this reliably by the end of 2025. The fundamental question is whether they reason or pattern-match at a very sophisticated level. Recent research suggests LLMs may rely more on statistical correlations than true reasoning. If it’s mostly pattern-matching, they still need the massive datasets and we’re back to conventional network effects. If they can actually reason over conceptual spaces, that’s different. That would unbundle data network effects from recommendation quality. Recommendation quality would depend on model capability, not data scale. And if model capability is commoditizing, then the value in recommendations flows to whoever owns customer relationships and distribution, not to whoever has the most data or the best model. I lean toward thinking LLMs are sophisticated pattern-matchers rather than reasoners, which means traditional network effects still apply. But this is one area where I’m genuinely waiting to see more evidence.

Now, on AGI. The Silicon Valley consensus, articulated by Sutskever, Altman, Musk, and others, is that we’re on a clear path to artificial general intelligence in the next few years, possibly by 2027 or 2028. The argument goes: scaling laws continue to hold, we’re seeing emergent capabilities at each scale jump, and there’s no obvious wall before we reach human-level performance across all cognitive domains. I remain unconvinced. Not because I think AGI is impossible, but because the path from “really good at pattern completion and probabilistic next-token prediction” to “general reasoning and planning capabilities” seems less straightforward than the AI CEOs suggest. Current LLMs still fail in characteristic ways on tasks requiring actual causal reasoning, spatial reasoning, or planning over extended horizons. They’re getting better, but the improvement curve on these specific capabilities looks different from the improvement curve on language modeling perplexity. That suggests to me that we might need architectural innovations beyond just scaling, and those are harder to predict.

But let’s say I’m wrong. Let’s say AGI arrives by 2028. Even then, I find it hard to model why this would be tremendously economically beneficial specifically to the companies that control the models. Here’s why: we already have multiple competing frontier models (ChatGPT, Claude, Gemini, Microsoft’s offerings, and now DeepSeek). If AGI arrives, it likely arrives for multiple players at roughly the same time, given how quickly capabilities diffuse in this space. Multiple competing AGIs means price competition. Price competition in a product with near-zero marginal cost means prices collapse toward marginal cost. Where does economic value flow in that scenario? It flows to the users of AI, not the providers. Engineering firms using AGI for materials development capture value through better materials. Pharmaceutical companies using AGI for drug discovery capture value through better drugs. Retailers using AGI for inventory management capture value through better margins. The AGI providers compete with each other to offer the capability at the lowest price. This is basic microeconomics. You capture value when you have market power, either through monopoly, through differentiation, or through control of a scarce input. If models are commodities or near-commodities, model providers have none of these.

The counterargument is that one provider achieves escape velocity and reaches AGI first with enough of a lead that they establish dominance before others catch up. This is the OpenAI/Microsoft theory of the case. Maybe. But the evidence so far suggests capability leads are measured in months, not years. GPT-4 launched in March 2023 with a substantial lead. Within six months, Claude 2 was comparable. Within a year, multiple models clustered around similar capability. The diffusion is fast. Another counterargument is vertical integration. Maybe the hyperscalers that control cloud infrastructure plus model development plus customer relationships plus application distribution can capture value even if models themselves commoditize. This is more plausible, essentially the AWS playbook. Amazon didn’t make money by having the best database. They made money by owning the infrastructure, the customer relationships, and the entire stack from hardware to application platform. Microsoft is clearly pursuing this strategy with Azure plus OpenAI plus Copilot plus Office integration. Google has Search plus Cloud plus Gemini plus Workspace. This could work, but it’s a different thesis than “we have the best model.” It’s “we control the distribution and can bundle.”

Evans shows a scatter plot (Slide 34) of model benchmark scores from standard evaluations like MMLU and HumanEval. Leaders change weekly. The gaps are small. Meanwhile, consumer awareness doesn’t track model quality. ChatGPT dominates with over 700 million weekly active users not because it has the best model anymore, but because it got there first and built brand. If models are commodities, value moves up the stack to product design, distribution, vertical integration, and customer relationships. This is exactly what happened with databases. Oracle didn’t win because they had the best database engine. They won through enterprise sales, support contracts, and ecosystem lock-in. Microsoft didn’t beat them with a better database. They won by bundling SQL Server with Windows Server and offering acceptable performance at a lower price. The SaaS pattern suggests something similar happens here. The model becomes an input. The applications built on top, the customer relationships, the distribution, those become the valuable assets. Why do I think this pattern applies rather than, say, the search pattern where Google maintained dominance despite no fundamental technical moat? Two reasons: (1) Search had massive data network effects. Every search improved the algorithm, and Google’s scale meant they improved faster. LLMs have weaker data network effects because the pretraining data is largely static and publicly available, and fine-tuning data requirements are smaller. (2) Search had winner-take-all dynamics through defaults and single-answer demand. You pick one search engine and use it for everything. AI applications look more diverse. You might use different models for different tasks, or your applications might switch between models transparently based on price and performance. The switching costs are lower.

So where does this leave us? The technology exists and the underlying capabilities are real. But I think the current evidence points toward a world where value flows to applications and customer relationships, and where the $400 billion the hyperscalers are spending buys them competitive positioning rather than monopoly. The integrators are making money now by helping enterprises navigate uncertainty. Some of that will produce real productivity gains. Much of it is expensive signaling and competitive positioning. The startups unbundling existing software will see mixed results, the ones that succeed will do so by owning distribution or solving really specific problems where switching costs are high, not by having better access to AI. The biggest uncertainty is whether the hyperscalers can use vertical integration to capture value anyway, or whether the applications layer fragments and value flows to thousands of specialized companies. That depends less on AI capabilities and more on competitive dynamics, regulation, and whether enterprises prefer integrated platforms or best-of-breed solutions. My guess is we end up somewhere in between. The hyperscalers maintain strong positions through bundling and infrastructure control. A long tail of specialized applications captures value in specific verticals. The model providers themselves, unless they’re also infrastructure providers, struggle to capture value proportional to the capability they’re creating. But I’m genuinely uncertain, and that uncertainty is where the interesting bets are.

What makes Evans’ presentation valuable is precisely what frustrated me about it initially: his refusal to collapse uncertainty prematurely. I’ve spent this entire post arguing for a specific view of how value will flow in AI markets, but Evans is right that we’re pattern-matching from incomplete data. Every previous platform shift looked obvious in retrospect and uncertain in real time. The PC revolution, the internet boom, mobile, they all had credible skeptics who turned out wrong and credible bulls who were right for the wrong reasons. Evans’ discipline in laying out the full range of possibilities, from commodity to monopoly to something entirely new, is the intellectually honest position. I’ve made specific bets here because that’s useful for readers trying to navigate the space, but I’m more confident in my framework than in my conclusions.

Is AI Really Eating the World? [1/2]

me@philippdubach.com (Philipp D. Dubach) — Sun, 23 Nov 2025 00:00:00 +0000

In August 2011, Marc Andreessen wrote “Why Software Is Eating the World”, an essay about how software was transforming industries, disrupting traditional businesses, and revolutionizing the global economy. Recently, Benedict Evans, a former a16z partner, gave a presentation on the generative AI platform shift three years after ChatGPT’s launch. His argument in short:

we know this matters, but we don’t know how.

In this article I will try to explain why I find his framing fascinating but incomplete, and why the evidence points toward AI model commoditization rather than durable competitive advantages at the model layer. Evans structures technology history in cycles. Every 10-15 years, the industry reorganizes around a new platform: mainframes (1960s-70s), PCs (1980s), web (1990s), smartphones (2000s-2010s). Each shift pulls all innovation, investment, and company creation into its orbit. Generative AI appears to be the next platform shift, or it could break the cycle entirely. The range of outcomes spans from “just more software” to a single unified intelligence that handles everything. The pattern recognition is smart, but I think the current evidence points more clearly toward commoditization than Evans suggests, with value flowing up the AI value chain to applications rather than to model providers.

The hyperscalers are spending historic amounts on AI infrastructure. In 2025, Microsoft, Google, Amazon, and Meta will invest roughly $400 billion in AI capex, more than global telecommunications capex. Microsoft now spends over 30% of revenue on capex, double what Verizon spends. What has this produced? Models that are simultaneously more capable and less defensible. When ChatGPT launched in November 2022, OpenAI had a massive quality advantage. Today, dozens of models cluster around similar performance. DeepSeek proved that anyone with $500 million can build a frontier AI model. LLM pricing has collapsed. OpenAI’s API pricing has dropped by 97% since GPT-3’s launch, and every year brings an order of magnitude decline in inference cost.

Now, $500 million is still an enormous barrier. Only a few dozen entities globally can deploy that capital with acceptable risk. GPT-4’s performance on complex reasoning tasks, Claude’s extended context windows of up to 200,000 tokens, Gemini’s multimodal capabilities, these represent genuine breakthroughs. But the economic moat isn’t obvious to me (yet). Open-source AI models from Meta and Mistral keep narrowing the gap, and if the model layer commoditizes fully, the competitive advantage shifts to data, distribution, and integration.

Evans uses an extended metaphor: automation that works disappears. In the 1950s, automatic elevators were AI. Today they’re just elevators. As Larry Tesler noted in 1970,

AI is whatever machines can’t do yet. Once it works, it’s just software.

The question: will LLMs follow this pattern, or is this different?

Current enterprise AI deployment shows clear winners but also real constraints. Software development has seen massive adoption, with GitHub reporting that 92% of developers now use AI coding tools. Marketing has found immediate uses generating ad assets at scale. Customer support has attracted investment, though with the caveat that LLMs produce plausible answers, not necessarily correct ones. Beyond these areas, the enterprise AI adoption rate looks scattered. Deloitte surveys from June 2025 show that roughly 20% of U.S. consumers use generative AI chatbots daily, with another 34% using them weekly or monthly. Enterprise deployment is further behind. McKinsey data shows most AI “agents” remain in pilot or experimental stages. A quarter of CIOs have launched something. Forty percent don’t expect production deployment until 2026 or later.

But I think here’s where Evans’ “we don’t know” approach misses something important. Consulting firms are booking billions in AI integration contracts right now. Accenture alone expects $3 billion in GenAI bookings for fiscal 2025. The revenue isn’t coming from the models. It’s coming from integration projects, change management, and process redesign. The pitch is simple: your competitors are moving on this, you can’t afford to wait. If your competitors are investing and you’re not, you risk being left behind. If everyone invests and AI delivers modest gains, you’ve maintained relative position. If everyone invests and AI delivers nothing, you’ve wasted money but haven’t lost competitive ground. Evans notes that cloud adoption took 20 years to reach 30% of enterprise workloads and is still growing. New technology platform cycles always take longer than advocates expect. His most useful analogy is spreadsheets. VisiCalc in the late 1970s transformed accounting. If you were an accountant, you had to have it. If you were a lawyer, you thought “that’s nice for my accountant.” ChatGPT today has the same dynamic. Certain people with certain jobs find it immediately essential. Everyone else sees a demo and doesn’t know what to do with the blank prompt. This is right, and it suggests we’re early. But it doesn’t tell us where value will accumulate in the AI value chain.

The standard pattern for deploying technology goes in stages: (1) Absorb it (make it a feature, automate obvious tasks). (2) Innovate (create new products, unbundle incumbents). (3) Disrupt (redefine what the market is). We’re mostly in stage one. Stage two is happening in pockets. Y Combinator’s recent batches are overwhelmingly AI-focused, betting on thousands of new companies unbundling existing software (startups are attacking specific enterprise problems like converting COBOL to Java or reconfiguring telco billing systems). Stage three remains speculative. From an economic perspective, there’s the automation question: do you do the same work with fewer people, or more work with the same people? This echoes debates about labor-augmenting technical change in economics. Companies whose competitive advantage was “we can afford to hire enough people to do this” face real pressure. Companies whose advantage was unique data, customer relationships, or distribution may get stronger. This is standard economic analysis of labor-augmenting technical change, and it probably holds here too.

Continue reading Is AI Really Eating the World? AGI, Networks, and Value [2/2]

Weather Forecasts Have Improved a Lot

me@philippdubach.com (Philipp D. Dubach) — Sat, 22 Nov 2025 00:00:00 +0000

Reading the press release for Google DeepMind’s WeatherNext 2, I wondered: have weather forecasts actually improved over the past years?

Turns out they have, dramatically. A four-day forecast today matches the accuracy of a one-day forecast from 30 years ago. Hurricane track errors that once exceeded 400 nautical miles for 72-hour forecasts now sit below 80 miles. The European Centre for Medium-Range Weather Forecasts reports three-day forecasts now reach 97% accuracy, with seven-day forecasts approaching that threshold.

Google’s new model accelerates this trend. The hurricane model performed remarkably well this season when tested against actual paths. WeatherNext 2 generates forecasts 8 times faster than its predecessor with resolution down to one hour. Each prediction takes under a minute on a single TPU compared to hours on a supercomputer using physics-based models. The speed comes from a smarter training approach. WeatherNext 2 (along with neuralgcm) uses a continuous ranked probability score (CRPS) objective rather than the L2 losses common in earlier neural weather models. The method adds random noise to parameters and trains the model to minimize L1 loss while maximizing differences between ensemble members with different noise initializations.

This matters because L2 losses blur predictions when models roll out autoregressively over multiple time steps. Spatial features degrade and the model truncates extremes. Models trained with L2 losses struggle to forecast high-impact extreme weather at moderate lead times. The CRPS objective preserves the sharp spatial features and extreme values needed for cyclone tracking and heat wave prediction. These improvements stem from better satellite and ground station data, faster computers running higher-resolution models, and improved communication through apps and online services. AI systems like WeatherNext 2 and Pangu-Weather (which performs forecasts up to 10,000 times faster than traditional methods) are accelerating progress that has been building for decades.

The Bicycle Needs Riding to be Understood

me@philippdubach.com (Philipp D. Dubach) — Fri, 14 Nov 2025 00:00:00 +0000

Some concepts are easy to grasp in the abstract. Boiling water: apply heat and wait. Others you really need to try. You only think you understand how a bicycle works, until you learn to ride one.

You should write an LLM agent—not because they’re revolutionary, but because the bicycle needs riding to be understood. Having built agents myself, Ptacek’s central insight resonates: the behavior surprises in specific ways, particularly around how models scale effort with complexity before inexplicably retreating.

Ptacek walks through building a functioning agent in roughly 50 lines of Python, demonstrating how an LLM with ping access autonomously chose multiple Google endpoints without explicit instruction, a moment that crystallizes both promise and unpredictability. His broader point matches my experience: context engineering isn’t mystical but straightforward programming—managing token budgets, orchestrating sub-agents, balancing explicit loops against emergent behavior. The open problems in agent design—titrating nondeterminism, connecting to ground truth, allocating tokens—remain remarkably accessible to individual experimentation, each iteration taking minutes rather than requiring institutional resources.

AI Models as Standalone P&Ls

me@philippdubach.com (Philipp D. Dubach) — Sun, 09 Nov 2025 00:00:00 +0000

Microsoft reported earnings for the quarter ended Sept. […] buried in its financial filings were a couple of passages suggesting that OpenAI suffered a net loss of $11.5 billion or more during the quarter.

For every dollar of revenue, they’re allegedly spending roughly $5 to deliver the product. These OpenAI losses initially sound like a joke about “making it up on volume,” but they point to a more fundamental problem facing OpenAI and its competitors. AI companies are locked into continuously releasing more powerful (and expensive) models. If they stop, open-source alternatives will catch up and offer equivalent capabilities at substantially lower costs. This creates an uncomfortable dynamic. If your current model requires spending more than you earn just to fund the next generation, the path to profitability becomes unclear—perhaps impossible.

Anthropic CEO Dario Amodei (everybody’s favorite AI CEO) recently offered a different perspective in a conversation with Stripe co-founder John Collison. He argues that treating each model as an independent business unit reveals a different picture than conventional accounting suggests.

Let’s say in 2023, you train a model that costs $100 million, and then you deploy it in 2024 and it makes $200 million of revenue.

So far, this looks profitable, a solid 2x return on the training investment. But here’s where it gets complicated.

Meanwhile, because of the scaling laws, in 2024, you also train a model that costs $1 billion. If you look in a conventional way at the profit and loss of the company you’ve lost $100 million the first year, you’ve lost $800 million the second year, and you’ve lost $8 billion in the third year, so it looks like it’s getting worse and worse.

The pattern continues:

In 2025, you get $2 billion of revenue from that $1 billion model trained the previous year.

Again, viewed in isolation, this model returned 2x its training cost.

And you spend $10 billion to train the model for the following year.

The losses appear to accelerate dramatically, from $100 million to $800 million to $8 billion.

This is where Amodei’s reframing becomes interesting.

If you consider each model to be a company, the model that was trained in 2023 was profitable. You paid $100 million and then it made $200 million of revenue."

He also acknowledges there are inference costs (the actual computing expenses of running the model for users) but suggests these don’t fundamentally change the picture in his simplified example. His core argument:

If every model was a company, the model in this example is actually profitable. What’s going on is that at the same time as you’re reaping the benefits from one company, you’re founding another company that’s much more expensive and requires much more upfront R&D investment.

This is essentially an argument that AI companies are building a portfolio of profitable products, but the accounting makes it look terrible because each successive “product” costs 10x more than the last to develop. The losses stem from overlapping these profitable cycles while exponentially increasing investment scale. But this framework only works if two critical assumptions hold: (1) Each model consistently returns roughly 2x its training cost in revenue, and (2) The improvements from spending 10x more justify that investment—meaning customers will pay enough more for the better model to maintain that 2x return.

Amodei outlines two ways this resolves:

So the way that it’s going to shake out is this will keep going up until the numbers go very large and the models can’t get larger, and, you know, then it’ll be a large, very profitable business.

In this first scenario, scaling hits physical or practical limits. You’ve maxed out available compute, data, or capability improvements. Training costs plateau because you literally can’t build a meaningfully larger model. At that point, companies stop needing exponentially larger investments and begin harvesting profits from their final-generation models. The second scenario is less optimistic:

Or at some point the models will stop getting better, right? The march to AGI will be halted for some reason.

If the improvements stop delivering proportional returns before reaching natural limits, companies face what Amodei calls overhang.

And then perhaps there’ll be some overhang, so there’ll be a one-time, ‘Oh man, we spent a lot of money and we didn’t get anything for it,’ and then the business returns to whatever scale it was at.

What Amodei’s framework doesn’t directly address is the open-source problem. If training Model C costs $10 billion but open-source alternatives reach comparable performance six months later, that 2x return window might not materialize. The entire argument depends on maintaining a significant capability lead that customers will pay premium prices for. There’s also the question of whether the 2x return assumption holds as models become more expensive. The jump from $100 million to $1 billion to $10 billion in training costs assumes that customers will consistently value the improvements enough to double revenue.

Working with Models

me@philippdubach.com (Philipp D. Dubach) — Sat, 08 Nov 2025 00:00:00 +0000

There was this “I work with Models” joke which I first heard years ago from an analyst working on a valuation model (see my previous post). I guess it has become more relevant than ever:

This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions.

If you want to get into this topic in the first place, be sure to check out Stefano Ermon’s CS236 Deep Generative Models Course. Lecture recordings of the full course can also be found on YouTube.

Sentiment Trading Revisited

me@philippdubach.com (Philipp D. Dubach) — Mon, 07 Jul 2025 00:00:00 +0000

Interesting new paper on news sentiment embeddings for stock price forecasting that builds on many of the ideas I explored in this project. The research, by Ayaan Qayyum, an Undergraduate Research Scholar at Rutgers, shows that the core concept of using advanced language models for sentiment trading is not only viable but highly effective. The study takes a similar but more advanced approach. Instead of using a model like GPT-3.5 to generate a simple sentiment score, it uses OpenAI’s embedding models to convert news headlines into rich, high-dimensional vectors. By training a battery of neural networks including

Gated Recurrent Units (GRU), Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and a Feed-Forward Neural Network (FFNN). All were implemented using PyTorch.

on these embeddings alongside economic data, the study found it could reduce prediction errors by up to 40% compared to models without the news data.

The most surprising insight to me, and one that directly addresses the challenge of temporal drift I discussed, was that Qayyum’s time-independent models performed just as well, if not better, than the time-dependent ones. By shuffling the data, the models were forced to learn the pure semantic impact of a headline, independent of its specific place in time. This suggests that the market reacts to the substance of news in consistent ways, even if the narratives themselves change.

Not All AI Skeptics Think Alike

me@philippdubach.com (Philipp D. Dubach) — Thu, 12 Jun 2025 00:00:00 +0000

Apple’s recent paper “The Illusion of Thinking” has been widely understood to demonstrate that reasoning models don’t ‘actually’ reason. Using controllable puzzle environments instead of contaminated math benchmarks, they discovered something fascinating: there are three distinct performance regimes when it comes to AI reasoning complexity. For simple problems, standard models actually outperform reasoning models while being more token-efficient. At medium complexity, reasoning models show their advantage. But at high complexity? Both collapse completely. Here’s the kicker: reasoning models exhibit counterintuitive scaling behavior—their thinking effort increases with problem complexity up to a point, then declines despite having adequate token budget. It’s like watching a student give up mid-exam when the questions get too hard, even though they have plenty of time left.

We observe that reasoning models initially increase their thinking tokens proportionally with problem complexity. However, upon approaching a critical threshold—which closely corresponds to their accuracy collapse point—models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.

The researchers found something even more surprising: even when they provided explicit algorithms—essentially giving the models the answers—performance didn’t improve. The collapse happened at roughly the same complexity threshold.

On the other hand, Sean Goedecke is not buying Apple’s methodology: His core objection? Puzzles “require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems.”

You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”.

From his own testing, models “decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start.” That’s strategic behavior, not reasoning failure. This matters because it shows how evaluation methodology shapes our understanding of AI capabilities. Goedecke argues Tower of Hanoi puzzles aren’t useful for determining reasoning ability, and that the complexity threshold of reasoning models may not be fixed.

The Model Said So

me@philippdubach.com (Philipp D. Dubach) — Wed, 28 May 2025 00:00:00 +0000

LLMs make your life easier until they don’t.

Their intrinsic complexity and lack of transparency pose significant challenges, especially in the highly regulated financial sector

Unlike other industries where “the model said so” might suffice, finance demands audit trails, bias detection, and explainable decision-making—requirements that sit uncomfortably with neural networks containing billions of parameters. The research highlights a fundamental tension that’s about to reshape fintech: the same complexity that makes LLMs powerful at parsing market sentiment or generating investment reports also makes them regulatory nightmares in a sector where you need to explain every decision to examiners.

Trading on Market Sentiment

me@philippdubach.com (Philipp D. Dubach) — Thu, 20 Feb 2025 00:00:00 +0000

This post is based in part on a 2022 presentation I gave for the ICBS Student Investment Fund and my seminar work at Imperial College London.

As we were looking for new investment strategies for our Macro Sentiment Trading team, OpenAI had just published their GPT-3.5 Model. After first experiments with the model, we asked ourselves: How would large language models like GPT-3.5 perform in predicting sentiment in financial markets, where the signal-to-noise ratio is notoriously low? And could they potentially even outperform industry benchmarks at interpreting market sentiment from news headlines? The idea wasn’t entirely new. Studies [2] [3] have shown that investor sentiment, extracted from news and social media, can forecast market movements. But most approaches rely on traditional NLP models or proprietary systems like RavenPack. With the recent advances in large language models, I wanted to test whether these more sophisticated models could provide a competitive edge in sentiment-based trading. Before looking at model selection, it’s worth understanding what makes trading on sentiment so challenging. News headlines present two fundamental problems that any robust system must address.

First, headlines are inherently non-stationary. Unlike other data sources, news reflects the constantly shifting landscape of global events, political climates, economic trends, etc. A model trained on COVID-19 vaccine headlines from 2020 might struggle with geopolitical tensions in 2023. This temporal drift means algorithms must be adaptive to maintain relevance.

Second, the relationship between headlines and market impact is far from obvious. Consider these actual headlines from November 2020: “Pfizer Vaccine Prevents 90% of COVID Infections” drove the S&P 500 up 1.85%, while “Pfizer Says Safety Milestone Achieved” barely moved the market at -0.05%. The same company, similar positive news, dramatically different market reactions.

When developing a sentiment-based trading system, you essentially have two conceptual approaches: forward-looking and backward-looking. Forward-looking models try to predict which news themes will drive markets, often working qualitatively by creating logical frameworks that capture market expectations. This approach is highly adaptable but requires deep domain knowledge and is time-consuming to maintain. Backward-looking models analyze historical data to understand which headlines have moved markets in the past, then look for similarities in current news. This approach can leverage large datasets and scale efficiently, but suffers from low signal-to-noise ratios and the challenge that past relationships may not hold in the future. For this project, I chose the backward-looking approach, primarily for its scalability and ability to work with existing datasets.

Rather than rely on traditional approaches like FinBERT (which only provides discrete positive/neutral/negative classifications), I decided to test OpenAI’s GPT-3.5 Turbo model. The key advantage was its ability to provide continuous sentiment scores from -1 to 1, giving much more nuanced signals for trading decisions. I used news headlines from the Dow Jones Newswire covering the 30 DJI companies from 2018-2022, filtering for quality sources like the Wall Street Journal and Bloomberg. After removing duplicates, this yielded 2,072 headlines. I then prompted GPT-3.5 to score sentiment with the instruction: Rate the sentiment of the following news headlines from -1 (very bad) to 1 (very good), with two decimal precision. To validate the approach, I compared GPT-3.5 scores against RavenPack—the industry’s leading commercial sentiment provider.

The correlation was 0.59, indicating the models generally agreed on sentiment direction while providing different granularities of scoring. More interesting was comparing the distribution of the sentiment ratings between the two models. This could have been approximated closer through some fine tuning of the (minimal) prompt used earlier.

I implemented a simple strategy: go long when sentiment hits the top 5% of scores, close positions at 25% profit (to reduce transaction costs), and maintain a fully invested portfolio with 1% commission per trade. The results were mixed but promising. Over the full 2018-2022 period, the GPT-3.5 strategy generated 41.02% returns compared to RavenPack’s 40.99%—essentially matching the industry benchmark. However, both underperformed a simple buy-and-hold approach (58.13%) during this generally bullish period. Relying on market sentiment when news flow is low can be a tricky strategy. As can be seen from the example of the Salesforce stock performance**,** the strategy remained uninvested over a large period of time due to a (sometimes long-lasting) negative sentiment signal.

When I tested different timeframes, the sentiment strategy showed its strength during volatile periods. From 2020-2022, it outperformed buy-and-hold (22.83% vs 21.00%). As expected, sentiment-based approaches work better when markets are less directional and more driven by news flow. To evaluate whether the scores generated by our GPT prompt were more accurate than those from the RavenPack benchmark, I calculated returns for different holding windows. The scores generated by our GPT prompt perform significantly better in the short term (1 and 10 days) for positive sentiment and in the long term (90 days) for negative sentiment.

(Note: For lower sentiment, negative returns are desirable since the stock would be shorted)

While the model performed well technically, this project highlighted several practical challenges. First, data accessibility remains a major hurdle—getting real-time, high-quality news feeds is expensive and often restricted. Second, the strategy worked better in a more volatile environment, which prompted many individual trades, creating substantial transaction costs that significantly impact returns. Perhaps most importantly, any real-world implementation would need to compete with high-frequency traders who can act on news within milliseconds. The few seconds required for GPT-3.5 to process headlines and generate sentiment scores are far from being competitive. Despite these challenges, the project demonstrated that LLMs can match industry benchmarks for sentiment analysis—and this was using a general-purpose model, not one specifically fine-tuned for financial applications. OpenAI (and others) today offer more powerful models at very low cost as well as fine-tuning capabilities that could further improve performance. The bigger opportunity might be in combining sentiment signals with other factors, using sentiment as one input in a more sophisticated trading system rather than the sole decision criterion. There’s also potential in expanding beyond simple long-only strategies to include short positions on negative sentiment, or developing “sentiment indices” that smooth out individual headline noise. Market sentiment strategies may not be optimal for long-term investing, but they show clear promise for shorter-term trading in volatile environments. As LLMs continue to improve and become more accessible, this might offer an opportunity to revisit this project.