Tech on Philipp D. Dubach | Quantitative Finance & AI Strategy

Karpathy's Software 3.0 Playbook

me@philippdubach.com (Philipp D. Dubach) — Fri, 01 May 2026 00:00:00 +0000

Andrej Karpathy is one of the few people who has both built modern AI and explained it for the rest of us. He co-founded OpenAI, ran computer vision at Tesla (where he got Autopilot working), and his courses on neural networks are some of the most-watched lectures on the internet. He also has a habit of naming the era we’re already in. “Vibe coding” was his. “Software 3.0” looks like the next one.

So when Karpathy says he has “never felt more behind as a programmer,” it is worth slowing down. That isn’t false modesty from a guy with his résumé. Something shifted under the field and most people haven’t recalibrated. The Sequoia interview below is his attempt to describe what shifted. The lessons here are pulled from it, ordered roughly by how much they should change what you do tomorrow.

1. Inflection point December 2025

Until late last year, agentic coding tools were “kind of helpful.” Good in stretches, often wrong in ways you had to babysit. Over the December break, the latest models crossed a line:

“I kept asking for more and just came out fine. And then I can’t remember the last time I corrected it. And then I just trusted the system more and more. And then I was vibe coding.”

He flagged it on the record, loudly:

“A lot of people experienced AI last year as a ChatGPT-adjacent thing. But you really had to look again, and you had to look as of December, because things have changed fundamentally — especially on this agentic, coherent workflow that really started to actually work.”

If your mental model of these tools was set by ChatGPT, it is already a generation stale. The agentic workflow is a different product, and it now works.

2. You can outsource your thinking, but not your understanding

The most quotable line of the interview:

“You can outsource your thinking, but you can’t outsource your understanding.”

As agents do more of the thinking, the bottleneck moves into your head. You still have to know what is worth building and why, and you still have to direct the work.

“I’m still part of the system, and I still have to — somehow, information still has to make it into my brain. And I feel like I’m becoming a bottleneck of just even knowing what are we trying to build, why is it worth doing, how do I direct my agents.”

Your value sits upstream of execution. The bottleneck of the next decade is less about compute than about how fast humans can deepen comprehension to keep directing systems that out-execute them. That is why Karpathy keeps building knowledge bases out of his own reading. He wants another projection of the same data, faster.

3. Verifiability is the map of what automates next

Why are these models freakishly good at code and math, and yet stupid about whether you should walk 50 meters to a car wash? Because frontier labs train via reinforcement learning, and RL needs verifiable rewards. Verifiable domains attract environments and signal, so they get the steepest gains. Everything else stays jagged.

“How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a hundred-thousand-line code base or find zero-day vulnerabilities, and yet tells me to walk to this car wash? This is insane.”

The GPT-3.5 to GPT-4 chess jump is the proof point. Capability tracks what the labs choose to feed in.

“We are slightly at the mercy of whatever the labs are doing, whatever they happen to put into the mix… If you’re in the circuits that were part of the RL, you fly. And if you’re in the circuits that are out of the data distribution, you’re going to struggle.”

Two things follow. If you are a founder and you can build a verifiable environment in your domain, even one the labs aren’t focused on, you can fine-tune a model that flies. That is real leverage. If you are a worker, the more useful question than “is my job safe?” is “is my job verifiable?” Karpathy thinks everything is automatable eventually. Verifiability mainly sets the order.

4. Software 3.0: prompting is the new programming

The frame that makes the rest of this make sense. Karpathy’s three eras:

Software 1.0: humans write explicit code.
Software 2.0: humans curate datasets and train neural networks; the weights are the program.
Software 3.0: humans write prompts; the LLM is the interpreter, and the context window is the program.

“Your programming now turns to prompting. And what’s in the context window is over the interpreter, that is the LLM, that is kind of like interpreting your context and performing computation in the digital information space.”

His sharpest example: installing OpenCode is no longer a shell script. It is a block of text you copy-paste to your agent, which reads your environment and figures the rest out.

“It’s just like, what is the piece of text to copy-paste to your agent? That’s the programming paradigm now.”

The unit of programming used to be a function. Now it is closer to a paragraph.

5. Vibe coding raises the floor; agentic engineering raises the ceiling

If you build software for a living, this is the lesson with the most direct implications:

“Vibe coding is about raising the floor for everyone in terms of what they can do in software… But agentic engineering is about preserving the quality bar of what existed before in professional software. You’re still responsible for your software just as before, but can you go faster?”

Karpathy thinks the ceiling is very high:

“People used to talk about the 10x engineer previously. I think that this is magnified a lot more — 10x is not the speed up you gain. People who are very good at this peak a lot more than 10x.”

The gap between mediocre and excellent users of these tools is widening. Worth taking seriously when you decide what to learn next.

6. The new human skill is taste, spec, and oversight

What humans should still do, in his telling, is design and judgment work. Holding the spec in your head. Setting the architecture. Making sure the agent is being asked for the right thing in the first place.

“You’re in charge of the taste, the engineering, the design, and that it makes sense, and that you’re asking for the right things… You’re doing some of the design and development, and the engineers are doing the fill in the blanks.”

The MenuGen anecdote is the kind of mistake only a human spec catches. The agent silently tried to associate Stripe and Google accounts by matching email addresses, with no persistent user ID. It worked until two emails diverged.

He is not sure this division will hold forever:

“When you actually look at the code, sometimes I get a little bit of a heart attack, because it’s not super amazing code… It’s very bloaty, and there’s a lot of copy-paste, and there’s awkward abstractions that are brittle and — like, it works, but it’s just really gross.”

Nothing fundamental stops the labs from training for taste. They just haven’t yet. Until they do, the taste layer is still your responsibility.

7. Some apps shouldn’t exist anymore

The MenuGen anecdote, again. Karpathy built an app: photo a restaurant menu, OCR it, generate images of each dish, render a new menu. Vercel deployment, the full stack.

Then he saw the Software 3.0 version. Hand the photo to Gemini, say “use NanoBanana to overlay the dishes onto the menu,” and a single model call returns the same menu with images rendered into the pixels.

“All of my MenuGen is spurious. It’s working in the old paradigm. That app shouldn’t exist.”

A lot of what we are building today is scaffolding around a capability the model could perform end-to-end. Before writing the next CRUD app, ask whether the model is the app.

8. New possibilities matter more than the speed-ups

The flip side of “some apps shouldn’t exist” is that some products could not have existed before. Karpathy’s knowledge-base project is the example. Take a pile of documents, ask the LLM to recompile them into a wiki, surface the connections you would never have stitched together by hand.

“This is not even a program. This is not something that could exist before, because there was no code that would create a knowledge base based on a bunch of facts. But now you can just take these documents and basically recompile them in a different way… I almost think that that’s more exciting.”

If you only ask what gets faster, you will miss the more interesting question, which is what becomes possible at all.

9. Jagged intelligence: ghosts, not animals

Karpathy’s metaphor: we are not building animals. Animal intelligence comes with intrinsic motivation, embodiment, drives shaped by evolution. What we have instead is more like a ghost. A statistical simulator shaped by pre-training, with RL bolted on.

“These things are not animal intelligences. Like, if you yell at them, they’re not going to work better. Or worse. Or it doesn’t have any impact. And it’s all just kind of these statistical simulation circuits where the substrate is pre-training. So, statistics. And then there’s RL bolting on top.”

The practical takeaway is to stop reasoning about LLMs by analogy to humans. Be suspicious of where the model seems confident, probe the edges, and figure out which circuits your task is actually landing in.

10. Build agent-native infrastructure

For infra builders, Karpathy’s pet peeve is also the opportunity:

“Why are people still telling me what to do? Like, I don’t want to do anything. What is the thing I should copy-paste to my agent?”

Rebuild the developer stack so the primary consumer of docs, configs, APIs, and deployment flows is an agent rather than a human. Data structures should be legible to LLMs by default, and sensors and actuators over the world should sit behind agent-callable interfaces.

His test: can you say “build and deploy MenuGen” and never touch a settings panel? When the answer is yes, the infrastructure has caught up.

11. Hire for big projects, not puzzles

A direct shot at hiring managers. Most companies have not refactored their interview loops for the agentic era.

“Hiring has to look like, give me a really big project and see someone implement that big project. Like, let’s write, say, a Twitter clone for agents, and then make it really good, make it really secure, and then have some agents simulate some activity on this Twitter. And then I’m going to use 10 Codex 5.4-X-high to try to break your website.”

Whiteboard puzzles measure the wrong thing. If your interview loop has not changed since 2022, you are selecting for the previous era.

12. Imagine the weird endpoint

The closing speculation is genuinely strange:

“In the early days of computing, people were a little bit confused as to whether computers would look like calculators or computers would look like neural nets. And in the ’50s and ’60s, it was not really obvious which way it would go… You could imagine that a lot of this will flip and that the neural net becomes kind of the host process, and the CPUs become kind of the coprocessor.”

UIs diffusion-rendered moment by moment from raw video and audio. No apps in between.

You do not have to buy this exact picture. The point is simply that the linear extrapolation, the same software but smarter, is almost certainly the wrong frame for where this ends up.

Based on Andrej Karpathy’s interview with Sequoia at AI Ascent

On-Device AI Models Will Be The New Reason to Upgrade Your Phone

me@philippdubach.com (Philipp D. Dubach) — Wed, 25 Mar 2026 00:00:00 +0000

The iPhone 17 runs a 3 billion parameter language model on-device at 30 tokens per second. Obviously, the average consumer has no idea what that sentence means, and Apple hasn’t figured out how to make them care.

I believe that’s about to change. Apple now has complete access to Google’s Gemini model in its own data centers, with the ability to distill it into smaller models built for iPhones and iPads. Knowledge distillation works like this: you take a large model, have it perform tasks with detailed reasoning, then feed those reasoning traces to a smaller model until the student learns to mimic the teacher. The smaller model ends up far more capable than if you’d trained it from scratch on the same data. Apple can now do this with the full Gemini, not just their own in-house models, and the distilled output runs locally. No internet required.

Smartphones haven’t had a real upgrade story in years. The camera is great. The screen is great. The processor was fast enough three generations ago. Battery life has overtaken price as the top purchase driver for the first time. The global replacement cycle has stretched to 3.5 years. People hold onto their phones because nothing about the new one feels different enough. Deloitte’s 2025 TMT Predictions report frames on-device generative AI as the feature that could break this cycle, if the experience delivers on the promise. On-device AI might become the next reason.

The spec

In the late 1990s it was megahertz: Intel and AMD raced clock speeds past the point where consumers could distinguish real-world performance differences, but the number on the box still drove purchases. Then it was megapixels. Samsung shipped a 200 MP camera sensor knowing that most phones use 16-to-1 pixel binning to output a 12.5 MP image by default.

Parameters could be next. The iPhone 17’s standard A19 chip has 8GB of RAM. The Pro gets 12GB with faster memory bandwidth, which determines how large a model the phone can run and how quickly. Samsung’s 2026 flagships with the Exynos 2600 hit 80 TOPS on a 2nm process, more than double the prior generation. These are already the numbers in press releases. It’s not hard to imagine an Apple keynote where someone says, with rehearsed enthusiasm, that the iPhone 18 Pro runs a 7 billion parameter model while the standard model is limited to 3 billion.

The difference from previous spec wars is that this one might actually correlate with user experience. Megahertz past a certain threshold didn’t make Word open faster. Megapixels past 12 MP didn’t make photos look better on a phone screen. But a 7 billion parameter model running locally outperforms a 3 billion one on nearly every task. It handles longer documents, follows more complex instructions, holds better conversational context.

Breaking the stalemate

Gartner projects GenAI smartphone spending will reach $393 billion in 2026, up 32% from $298 billion in 2025. IDC reports GenAI smartphone shipments growing 73% year over year. Samsung has publicly committed to 800 million AI-enabled devices by end of 2026, doubling its 2025 footprint. Morgan Stanley’s latest survey found iPhone upgrade intentions at 37%, an all-time high, with FY26 shipment forecasts of 260 million units sitting 3% above Street consensus.

On-device AI creates hard hardware requirements in a way that camera improvements and screen upgrades never did. You cannot run a 3 billion parameter model on an iPhone 14. The Neural Engine isn’t powerful enough and the memory bandwidth isn’t there. Apple Intelligence requires an A17 Pro or later, which means the feature itself creates an upgrade floor. Every year that floor rises. When Apple ships distilled Gemini models that need the A19 Pro’s 12GB of RAM, every phone older than 2025 is locked out.

The Gemini deal matters for the hardware cycle because of the distillation pipeline. Apple doesn’t need to build frontier-scale models from scratch. They can take Gemini’s best capabilities, run them through distillation, and compress the results into models sized for their hardware tiers. A 3 billion parameter model for the standard iPhone. A 5 billion version for the Pro. Maybe a 10 billion model for a future iPad Pro with enough memory and thermal headroom.

Google is playing a similar game from the other side. The original Gemini Nano shipped at 1.8 billion parameters; the updated Nano-2 rose to 3.25 billion. Samsung’s Galaxy S26 ships with on-device Gemini running on NPUs that are 39% faster than the prior generation. On-device models get larger every hardware generation. Each generation’s models don’t run well on older hardware. You see where this goes.

I find it plausible that within two product cycles, on-device model capability becomes the primary differentiator between phone tiers and between generations. The data isn’t there yet: only 17% of Americans say AI is a major purchase influence today, Apple Intelligence ranked seventh globally as a reason to upgrade in Morgan Stanley’s survey, and over 40% of users have privacy concerns about smartphone AI, with half unwilling to pay extra for it. But you can’t tell the difference between a 48 MP photo and a 12 MP photo on your phone screen. You can absolutely tell the difference between an AI assistant that understands your question and one that doesn’t. The feedback loop is immediate and personal. If the bigger model actually works better, and if the distillation pipeline from Gemini delivers real capability gains, the upgrade incentive is self-reinforcing. People will upgrade not because the spec sheet says they should, but because they tried their friend’s phone and the AI was better.

Whether this arrives with iOS 27 this fall or takes another generation to mature, I don’t know. But the next reason to buy a new phone will much more likely be the model than the camera.

The Last Architecture Designed by Hand

me@philippdubach.com (Philipp D. Dubach) — Mon, 16 Mar 2026 00:00:00 +0000

I bet there is another new architecture to find that is gonna be as big of a gain as transformers were over LSTMs.

Sam Altman, the CEO of the company most invested in the transformer is telling a room of students it isn’t the final form. So what comes after the transformer? He’s probably right that something will, and the evidence is no longer anecdotal. Several recent papers have proved that the transformer’s worst properties are structural, not engineering problems to be fixed with better data or more compute, but mathematical lower bounds.

The transformer, born from the 2017 paper “Attention Is All You Need,” took us from barely-coherent GPT-2 to GPT-4 in five years. An extraordinary run. But Duman Keles et al. proved that O(n²) attention complexity isn’t an implementation detail. It’s a necessary lower bound unless a foundational conjecture in complexity theory turns out to be wrong. Double the context, quadruple the cost. The KV cache for a 70B model at one-million-token context eats roughly 320 GB of GPU memory. Most hardware can’t hold it.

The problems run deeper than compute costs. Kalai and Vempala proved that any calibrated language model must hallucinate at a certain rate. A 2025 follow-up goes further: no computable LLM can be universally correct on unbounded queries. Not fixable with better training data. Not fixable with RLHF. A statistical property of how these models generate text.

On reasoning: Dziri et al. showed transformers collapse multi-step reasoning into pattern matching. Performance drops exponentially as task complexity rises. GPT-4 gets 59% on 3-digit multiplication. Chowdhury proved the “lost in the middle” problem, models performing 20-30% worse on information buried mid-context, is a geometric property of the architecture itself. Present at initialization already, before any training occurs.

These are theorems. The architecture that runs every frontier AI system has a ceiling, and the ceiling is proved.

The post-transformer stack is already in production

A survey by Fichtl et al. checked the top 10 models on every major benchmark. Zero were non-transformer. The transformer is still winning on the leaderboards. But the field is moving toward hybrid architectures. Over 60% of frontier models released in 2025 already use Mixture of Experts. DeepSeek-V3 has 671B total parameters but activates only 37B per token. It trained for 2.788 million H800 GPU hours, a fraction of what a comparable dense model would require, and matched frontier closed-source performance. By late 2025, DeepSeek-V3.2 reportedly hit GPT-5-level performance at 90% lower training cost. MoE doesn’t replace the transformer. It changes the economics so radically that it’s arguably the single biggest practical advance since the original architecture.

The more interesting part is what happens when you blend attention with state space models. Gu and Dao (2024) proved SSMs and attention are mathematically dual: two views of the same computation. That theoretical result is showing up in production. AI21’s Jamba runs a 1:7 attention-to-Mamba ratio and gets 256K context at 3x throughput over Mixtral. Alibaba’s Qwen3-Next shipped the first top-tier model with a hybrid backbone: Gated DeltaNet for linear attention at a 3:1 ratio with full attention. Microsoft’s Phi-4-mini-flash-reasoning is 75% Mamba layers with 10x throughput at 2-3x lower latency.

Diffusion language models are the wild card. LLaDA, the first 8B-parameter diffusion LLM, treats text generation as denoising rather than sequential token prediction. It matches Llama3-8B and does something no autoregressive model can: it solves the “reversal curse,” outperforming GPT-4o on reversal tasks. Gemini Diffusion hit 1,479 tokens per second. Over 50 papers on diffusion LLMs appeared in 2025. If parallel generation works reliably at scale, inference economics change completely.

Alman and Yu proved there are tasks where every subquadratic alternative has a fundamental theoretical gap. That’s the strongest mathematical argument for why hybrids, not clean replacements, are what comes next.

The search is no longer human-speed

The part of this I find most interesting is the recursion. AI systems are now running the search for their own architectural successors.

AlphaEvolve an evolutionary coding agent built on Gemini 2.0 found a way to multiply 4x4 complex matrices in 48 scalar multiplications: the first improvement on Strassen’s 56-year-old bound. Across 50+ open math problems, it matched the best known solutions 75% of the time and beat them 20% of the time. The recursive part: AlphaEvolve found a 23% speedup on a kernel inside Gemini’s own architecture, cutting Gemini’s training time by 1% and recovering 0.7% of Google’s total compute. Gemini making Gemini faster.

Karpathy’s AutoResearch, released March 7, 2026, is a 630-line Python script that lets an AI agent modify training code, run 5-minute experiments, check results, and iterate. He pointed it at his own highly-tuned “Time to GPT-2” codebase. The agent found about 20 additive improvements that transferred to larger models, cutting the metric by 11%. Shopify CEO Tobi Lutke tried it overnight: 37 experiments, 19% validation improvement, a 0.8B model outperforming a 1.6B one. Sakana AI’s AI Scientist v2 went further and produced the first AI-authored paper accepted through standard peer review. OpenAI said publicly in late 2025 that it’s researching how to safely build AI systems capable of recursive self-improvement. Two years ago this was a thought experiment.

What the hardware decides

The transformer won not because attention was theoretically prettier than recurrence. It won because it parallelized well on GPUs. Whatever comes next has to clear the same bar.

Pre-training scaling for dense transformers is flattening. OpenAI spent at least $500 million per major training run on Orion. The model hit GPT-4 performance after 20% of training; the remaining 80% gave diminishing returns. They downgraded it from GPT-5 to GPT-4.5. Sutskever at NeurIPS 2024: “Pre-training as we know it will end. The data is not growing because we have but one internet.” His startup SSI has raised to a $32 billion valuation with about 20 employees and zero revenue. A bet that the next leap requires something architecturally new.

But test-time compute opened a different axis entirely. OpenAI’s o3 hit 87.5% on ARC-AGI, beating most humans. DeepSeek-R1 matched o1-level reasoning at 70% lower cost. OpenAI’s inference spending reached $2.3 billion in 2024: 15x what they spent training GPT-4.5. Dario Amodei at Morgan Stanley in March 2026: “We do not see hitting the wall. We don’t see a wall.” He’s talking about this axis, inference-time compute and RL from verifiable rewards, not about pre-training bigger dense models. The Densing Law now shows capability per parameter doubling every 3.5 months through better data, MoE, and distillation. Last year’s frontier, matched with a fraction of the parameters.

Inference demand is projected to exceed training demand by 118x. Global data center power is heading toward 945 TWh by 2030, roughly Japan’s total electricity consumption. An architecture that scores 2x better on benchmarks but runs 3x worse at inference won’t win. What ships is whatever fits the hardware. The transformer isn’t going away. It’s becoming one component in a larger stack: attention for recall, SSMs for cheap sequence processing, MoE for capacity, maybe diffusion for parallel output. Jamba, Hymba, and Qwen3-Next already ship this way. That’s not a prediction. It’s what’s in production.

How fast the stack evolves is the open question. The answer, given AlphaEvolve and AutoResearch and AI Scientist v2, is faster than any previous architectural transition. I don’t know whether the transformer remains the dominant layer for two years or five. But I’m fairly confident that whatever comes next, humans won’t have designed it alone.

MCP vs A2A in 2026: How the AI Protocol War Ends

me@philippdubach.com (Philipp D. Dubach) — Sun, 15 Mar 2026 00:00:00 +0000

On March 26, 2025, Sam Altman posted the following three sentences

people love MCP and we are excited to add support across our products.

MCP is Anthropic’s Model Context Protocol. OpenAI is Anthropic’s most direct competitor. Altman was endorsing a rival’s standard. That post may be the most significant event in enterprise AI infrastructure this year. When your main competitor adopts your protocol, the war is close to over. I’ve been watching this play out since Anthropic launched MCP in November 2024, and I want to work through what’s happening: who controls what, what “interoperability” means in practice, and whether any of this follows patterns we’ve seen before.

What is MCP

MCP is a client-server protocol, licensed MIT, built on JSON-RPC 2.0. The mental model is simple: an AI agent (the host) connects through a client to MCP servers that expose tools, data sources, and context. Instead of building a bespoke integration every time Claude or GPT needs to talk to Salesforce, GitHub, or your internal database, you build one MCP server. Any compatible host can then use it.

The problem it solves, which explains why it spread so fast, is that without a standard like this, integration complexity grows quadratically. Every new AI model times every new tool equals a new custom integration. MCP tries to make it linear.

By December 2025, Anthropic’s own count put the public MCP server ecosystem at 10,000+ active servers and 97 million monthly SDK downloads across the Python and TypeScript SDKs. GitHub’s 2025 Octoverse report flagged MCP as a standout, hitting 37,000 stars in eight months. The unofficial registry mcp.so lists over 18,000 servers. Official SDKs now cover ten languages, including Python, TypeScript, Java, C#, Go, Kotlin, Rust, and Swift.

The companies building MCP integrations: Microsoft, Salesforce, Cloudflare, GitHub, Stripe, Atlassian, Figma, Snowflake, Databricks, New Relic. At Cloudflare’s MCP Demo Day in May 2025, Asana, PayPal, Sentry, and Webflow all shipped remote servers in a single afternoon. Gartner predicts 75% of API gateway vendors will have MCP features by 2026.

OpenAI’s adoption went beyond Altman’s post. MCP support rolled out across their Agents SDK (March 2025), Responses API (May 2025), Realtime API (August 2025), and ChatGPT Developer Mode (September 2025). The two companies later co-authored the MCP Apps Extension. You don’t see that often between direct competitors.

One performance claim circulates in blog posts and marketing materials: that organizations implementing MCP report “40–60% faster agent deployment times.” I have not found a primary source for this. No survey, no case study, no named company. I’d treat it as marketing content until someone produces the underlying data.

Google’s A2A fills a different layer

Google launched A2A, the Agent-to-Agent protocol, at Cloud Next on April 9, 2025, five months after MCP. Google didn’t position A2A as MCP replacement. They called it a complement. I think that’s honest, but it takes a minute to see why.

MCP connects an agent to tools; A2A connects agents to each other, the two protocols produce different behavior.

When an MCP host calls an MCP server, it knows exactly what it’s getting: structured tool descriptions, specific function signatures, predictable outputs. The agent can see inside the tool. A2A works differently. Agents remain opaque to each other. An A2A agent publishes an “Agent Card,” a JSON metadata document at a well-known URL, describing its capabilities and authentication requirements. Other agents discover it, negotiate tasks through a defined lifecycle (submitted, working, input-required, completed), and collaborate without sharing memory or internal state.

Google’s own documentation uses a repair shop analogy. MCP is how the mechanic uses diagnostic equipment. A2A is how the customer talks to the shop manager, or how the manager coordinates with a parts supplier. It works: both conversations happen in a real repair shop, and cutting either one doesn’t simplify anything.

A2A launched with 50+ partner organizations and grew to 150+ by July 2025. The list includes Atlassian, Salesforce, SAP, ServiceNow, McKinsey, BCG, Accenture. Google donated A2A to the Linux Foundation in June 2025. IBM’s competing Agent Communication Protocol merged into A2A in August, with IBM’s engineers joining the technical steering committee. As of February 2026, A2A has roughly 21,900 GitHub stars, about 40% of MCP’s total.

What history can tell us about how this ends

AI agent protocol wars have a consistent pattern. The winner is almost never the technically superior option. It’s the one that ships first and gets adopted before anyone can catch up.

TCP/IP and OSI are the canonical example. The OSI model, published by ISO in 1983, was architecturally more rigorous than TCP/IP’s four-layer stack. It had real institutional backing: the US Commerce Department published its GOSIP mandate in August 1988, with formal enforcement beginning in 1990. European governments followed. OSI still lost. TCP/IP won because it had running code, freely available implementations bundled with BSD Unix workstations, while OSI remained elegant theory trapped in committee processes. By 1994 the outcome was obvious. David Clark’s IETF motto captures why:

We reject kings, presidents and voting. We believe in rough consensus and running code.

VHS versus Betamax is the other lesson people cite, often incorrectly. Betamax had better picture quality. VHS won anyway, and the usual explanation is the movie library. That’s part of it. But JVC openly licensed VHS to manufacturers across the industry, which drove prices down and built a content ecosystem Sony couldn’t match. By 1987, VHS held 90% of the US VCR market. Sony conceded in 1988 by manufacturing VHS players. Ecosystem breadth, once established, creates a gravitational field that technical superiority alone can’t escape.

USB is a more recent example with a twist. The consortium, Compaq, DEC, IBM, Intel, Microsoft, NEC, Nortel, formed in 1994 and shipped USB 1.0 in January 1996. Adoption was sluggish until Apple shipped the iMac G3 in August 1998 with only USB ports, forcing the entire peripheral industry to follow. One player is so central to the ecosystem that their adoption forces everyone else’s hand. OpenAI adopting MCP in March 2025 is MCP’s iMac moment.

But USB also offers a warning. USB-C’s physical connector won universally, then the underlying protocol fragmented. The same connector could carry anything from USB 2.0 to USB4, 5W to 240W of power, depending on what you plugged together. The EU eventually legislated convergence through its Radio Equipment Directive, which took effect December 28, 2024. A standard can win and still fragment when nobody governs the details.

What now?

The Linux Foundation’s Agentic AI Foundation (AAIF), launched December 9, 2025 with Anthropic, OpenAI, and Block as co-founders, now has 146 member organizations, including JPMorgan Chase, American Express, Autodesk, Red Hat, and Huawei. A2A has its own Linux Foundation governance body. MCP sits within AAIF. Both are under the same umbrella, but they’re not the same project.

This is the governance structure you typically see after a standards war has been decided in principle but before the implementation details have been hammered out. Think of the W3C in 1994, not the W3C in 1998. For anyone making architectural decisions right now, the practical question isn’t MCP versus A2A. Most major enterprise platforms already support both. Salesforce, SAP, IBM, Microsoft, and AWS have committed to both. The question is sequencing and depth.

ISG analyst David Menninger put it clearly: “MCP first for sharing context; then A2A for dynamic interaction among agents.” That’s the sequence I’d follow. MCP is the more mature protocol with the larger server ecosystem. The 10,000+ existing servers represent integration work that doesn’t need to be rebuilt. Start there. Layer A2A on top when your use cases require multi-agent coordination across organizational boundaries, supply chain, cross-platform orchestration, which is exactly where the Tyson Foods and Adobe deployments have landed.

MCP security deserves a separate conversation. Astrix Security’s research found that 53% of MCP servers rely on static credentials rather than OAuth. A critical vulnerability in the mcp-remote npm package (CVE-2025-6514) exposed 437,000+ installations to shell injection. TCP/IP had its share of early-stage security problems in the 1980s, so I’m not calling this fatal. But these are real vulnerabilities, and they will cause real incidents before the posture matures.

Multiple analyst firms converge on an agentic AI market of roughly $7–8 billion in 2025, growing at 40–50% annually, with projections ranging from $50 billion by 2030 to $199 billion by 2034. NVIDIA’s CUDA is the comparison that matters: 4 million developers, 15 years of compounding library investment, and switching costs that produce $130.5 billion in annual revenue at 73% gross margins. MCP’s 97 million monthly downloads aren’t CUDA yet. But the trajectory points the same direction.

My best guess (and I want to be clear it’s a guess): MCP becomes the infrastructure layer, A2A becomes the coordination layer, much as TCP handles transport while HTTP handles application-layer communication. Different floors of the same building. The question remains whether 146 AAIF members can hold coherent standards against the competitive pressure of over 1,000 active agentic AI startups, each with economic incentives to differentiate.

93% of Developers Use AI Coding Tools. Productivity Hasn't Moved.

me@philippdubach.com (Philipp D. Dubach) — Wed, 04 Mar 2026 00:00:00 +0000

A study published in July 2025 gave AI coding tools their most credible test yet. Sixteen experienced open-source developers, 246 real tasks, randomized controlled design. The researchers expected to measure how much faster AI made them. What they found: developers using AI took 19% longer to complete tasks than those working without it.

The developers themselves thought they were 20% faster.

That 39-point gap between perception and reality is the most important number in METR’s paper. It lands inside two years of adoption data pointing in the opposite direction. DX surveyed 121,000 developers across 450+ companies and found 92.6% use AI coding tools at least monthly. JetBrains’ AI Pulse measured 93%. The DORA 2025 report put it at 90%. On the productivity side: six independent research efforts converge on roughly the same ceiling, 10% at the system level, if you’re being generous.

The bottleneck was never the typing

Goldratt’s Theory of Constraints makes the following prediction: optimizing a step that isn’t the bottleneck doesn’t improve system throughput. You can make the fastest machine on the factory floor twice as fast. If it’s feeding a queue that’s already backed up, you’ve accomplished nothing at the output level.

Writing code has never been that bottleneck. Bain’s analysis found that writing and testing code accounts for roughly 25-35% of the total software development lifecycle. The rest goes to code review, understanding requirements, debugging, meetings, documentation. Even with a 100% speedup on the coding step, that gives you a 15-25% overall improvement, and that’s before accounting for what happens downstream when you generate a lot more code. Gergely Orosz, who runs The Pragmatic Engineer, put it directly:

Speed of typing out code has never been the bottleneck for software development.

What the data shows now is that AI tools don’t just fail to clear the bottleneck. They move it downstream and make it worse.

The code review bottleneck

Faros AI measured this across 10,000+ developers on 1,255 teams in June 2025. Teams with high AI adoption completed 21% more tasks and merged 98% more pull requests. PR size grew 154%. Then: review time up 91%, bugs up 9%, organizational DORA metrics flat.

More PRs, bigger PRs, slower reviews, more bugs, no throughput improvement. The coding step accelerated. The review step, already a constraint, got worse. Michael Truell, Cursor’s CEO:

Cursor has made it much faster to write production code. However, for most engineering teams, reviewing code looks the same as it did three years ago

Cursor then acquired Graphite, a code review startup. The acquisition is a more honest statement about where the constraint lives than anything in Cursor’s marketing. The DORA 2024 report found that for every 25 percentage point increase in AI adoption, delivery throughput dropped 1.5% and delivery stability dropped 7.2%. DORA 2025, at 90% adoption, put it tersely: “AI doesn’t fix a team; it amplifies what’s already there.” The negative relationship with stability holds even as adoption saturates.

41% of what?

One number circulates constantly in press coverage: 41% of code is now AI-generated. It comes from Emad Mostaque, who took GitHub’s figure about the share of code accepted by Copilot users and extrapolated it into a claim about all code everywhere. The original figure applied only to developers already using Copilot, a fraction of GitHub’s user base at the time. The extrapolation doesn’t hold.

The more defensible numbers: DX’s measurement across 4.2 million developers puts AI-generated production code at 26.9%. A study published in Science found roughly 30% of Python functions from U.S. contributors on GitHub were AI-generated by late 2024. Sundar Pichai said more than a quarter of all new code at Google is AI-generated. These numbers cluster around 25-30%.

The inflated figure matters because it supports a specific argument: that AI has already crossed some threshold, that the transformation is done, that the productivity gains are already baked in. At 27%, AI is a meaningful contributor to software production. At 41%, you’re telling a different story, and the decisions that follow from it are different decisions.

The quality picture at 27% is not reassuring. Veracode tested 100+ LLMs across 80 coding tasks and found 45% of AI-generated code introduced OWASP Top 10 vulnerabilities. CodeRabbit’s analysis found AI-generated code contains 2.74x more security vulnerabilities than human-written code. Black Duck’s 2026 OSSRA report found vulnerabilities per codebase up 107% year over year, the mean codebase going from 280 to 581 known vulnerabilities. Martin Fowler’s framing is still the most honest I’ve seen: “Treat every slice as a PR from a rather dodgy collaborator who’s very productive in the lines-of-code sense, but you can’t trust a thing they’re doing.”

Perception is reality

The 19% slowdown number has been contested, fairly: the CI is wide (+2% to +39%), the study covered experienced developers on complex codebases, and METR has acknowledged design limitations. In February 2026, METR published an update changing their experiment design after discovering that 30-50% of invited developers declined to participate without AI access, a selection effect that biased the original sample toward developers who benefit least from AI. Their newer cohort (800+ tasks, 57 developers) showed a -4% slowdown with a CI of -15% to +9%, substantially less negative. METR’s conclusion: “AI likely provides productivity benefits in early 2026.” The perception gap and the bottleneck problem remain real, but the exact magnitude of the July 2025 finding should be read with that caveat.

METR’s companion Horizon benchmark (Kwa et al., 2025) puts numbers to that curve: the 50%-task-completion time horizon for Claude 3.7 Sonnet was 60 minutes. Claude Opus 4.6, released February 2026, reached 719 minutes. The doubling time from 2023 is approximately 128 days. METR frames the productivity result as a point on that trend, not a fixed constant, though they also note that their benchmark tasks are cleaner than real production work and performance on “messier” tasks may improve more slowly. But the perception gap itself is more robust than the exact slowdown figure, and it replicates.

Stack Overflow’s 2025 Developer Survey found favorable views of AI tools dropped from 70% to 60%, with 46% not trusting AI output and 66% citing “almost right but not quite” as their top frustration. Software.com’s monitoring of 250,000 developers found the median developer codes for 52 minutes per day, about 11% of a 40-hour week. The tools are fighting over 11% of the workday.

A field experiment across 4,867 developers from MIT, Princeton, Wharton, and Microsoft found that above-median-tenure developers showed no significant productivity increase from AI tools. The people capable of using AI most effectively are also the people most likely to catch when it’s wrong and fix it. It’s why the tools work better for junior developers on simple tasks than for senior developers on the things that actually matter most.

GitHub’s 2022 Copilot study

GitHub’s 2022 Copilot study, the “55% faster” figure, still appears in enterprise sales decks in 2026. One JavaScript task: implementing a web server with HTTP endpoints. Thirty-five completers. No assessment of output quality, test coverage, or whether the code would survive production. Confidence interval: 21% to 89%. Participants knew they were being timed for productivity.

What the study actually shows is that when you pick a task specifically suited to AI assistance and measure completion time without checking correctness, AI looks fast. That’s a real finding. It’s just not the one being used to justify eight-figure licensing deals.

Macro data

Apollo’s Torsten Slok wrote in early 2026: “AI is everywhere except in the incoming macroeconomic data.” An NBER paper from February 2026 surveying nearly 6,000 executives found over 80% of firms reported AI had no impact on productivity over the preceding three years. Expected improvement over the next three: 1.4%.

Daron Acemoglu, who shared the 2024 Nobel Prize in Economics partly for his work on technology and labor markets, projected a 0.5% total factor productivity increase from AI over the next decade. His reasoning: the economic value of AI concentrates in a narrow set of tasks that don’t represent enough of total economic activity to move aggregate numbers. The Bain arithmetic, at macroeconomic scale.

The standard optimist response is the IT comparison: computers entered enterprises in the 1970s and 1980s without producing measurable productivity improvements for a decade, then the gains came in the mid-1990s. It’s a reasonable historical parallel. I’m genuinely uncertain whether it applies. Computers replaced manual processes wholesale. AI coding tools are a faster ingredient inside a process whose other ingredients haven’t changed: the requirements still need to be understood, the review still needs to happen, the tests still need to pass. The productivity lag might resolve. Or the structure of the workflow might mean it doesn’t, even eventually. I don’t know, and the honest answer is that nobody does yet.

Where the value actually lands

Exploration is faster. When I’m working on something unfamiliar, a library I haven’t used, an API I’m integrating for the first time, the startup cost drops. A working first draft arrives in minutes rather than hours. That’s real, and I notice it. Whether it shows up in throughput metrics is a different question, and the data suggests mostly not, because the constraint was never the first draft.

Boilerplate, test scaffolding, documentation: these genuinely benefit too. The tasks that are well-scoped and low-stakes if approximately wrong are where these tools earn their keep. Anyone who’s used them seriously already knew this before the research said so.

Simon Willison, in an NPR interview: “Our job is not to type code into a computer. Our job is to deliver systems that solve problems.” The tools handle the first part better than they did a year ago. The second part hasn’t changed.

The right question

The useful product question, if the bottleneck is now review, is what makes review faster and more reliable, not what generates more code faster. AI tools that flag security issues, catch logic errors, and surface context about why code was written a certain way would attack the actual constraint. This is at least part of what Cursor is working toward with Graphite.

The harder problem is cultural. Bain and DORA say the same thing from different angles: AI amplifies what’s already there. Teams with good review practices and clear requirements get leverage. Teams without them produce more code that still doesn’t ship on time. The organizations that most want a tool to fix their velocity tend to be the ones with the process debt that prevents any tool from working.

I have no idea what the five-year picture looks like. The Solow paradox took a decade to resolve and resolved in ways nobody expected. Maybe the AI productivity gains show up in 2029 and the 2026 skeptics look naive. Genuinely possible. I try to hold that view honestly rather than dismiss it.

What the data shows now: at 92.6% monthly adoption and roughly 27% of production code AI-generated, the experiment has run at real scale. Organizational throughput hasn’t moved past 10%. Experienced developers are slower with AI assistance than without it. Bugs are up, review times are up, code quality metrics are declining, and DORA stability goes the wrong way as adoption increases.

Building a No-Tracking Newsletter from Markdown to Distribution

me@philippdubach.com (Philipp D. Dubach) — Wed, 24 Dec 2025 00:00:00 +0000

Friends have been asking how they can stay up to date with what I’m working on and keep track of the things I read, write, and share. RSS feeds don’t seem to be en vogue anymore, apparently. So I built a mailing list. What else would you do over the Christmas break?

From a previous marketing job I knew Mailchimp. Also, every newsletter I unsubscribe from is Mailchimp. I no longer wish to receive these emails.

Or obviously Substack. I read Simon Willison’s Newsletter sometimes. And obviously Michael Burry’s $379 Substack. Those are solid options, but I had a clear picture in mind of what I wanted. I wanted only HTML, no tracking (also why I use GoatCounter on my site and not Google Analytics), and full control of the creation and distribution chain from end to end. So I sat down and drew into my notebook, what I always do when I have an idea after a long walk or a hot shower.

I then went over to Illustrator (actually Affinity Designer, which I have been happily using since my Creative Cloud subscription ran out, sorry Adobe) and built a quick mockup of my drawing. I fed the mockup to Claude to generate pure HTML. After a few iterations it more or less looked like I wanted it to be.

The architecture: write the newsletter in Markdown (as I do for all of my blog). Render it as HTML. Fetch OpenGraph images from my Cloudflare CDN at the lowest feasible resolution and pull descriptions automatically. Format links with preview cards. Keep some space for freetext at the top and bottom.

I built a Python engine that renders my .md files to email-safe HTML. The script handles several things automatically: (1) It fetches OpenGraph metadata for every link using Beautiful Soup, caching results to avoid repeated requests. (2) optimizes images using Cloudflare’s image transformation service. For email, I use 240px width (2x the display size of 120px for retina displays). (3) It generates LinkedIn-style preview cards with images on the left and text on the right. The output is table-based HTML because email clients from 2003 still exist and they’re apparently immortal.

Originally I intended to manually copy-paste the HTML into an email and send it out since I did not expect many subscribers at first (or at all). But I had another challenge at hand: how do people sign up?

Since I had already been using Cloudflare Workers KV to build an API with historic values of my temperature and humidity sensor at home, I resorted to that. The API is simple. POST to /api/subscribe with an email address, and it gets stored in KV with a timestamp and some metadata.

After some Copilot iterations (I’m not a security guy, so not sure how I feel about handing all the security and testing to an agent, please reach out if you can help) the Worker includes rate limiting, honeypot fields for spam protection, proper CORS headers, and RFC-compliant email validation.

I then wanted to get a confirmation email every time someone signed up. Since SMTP sending over my domain did not work reliably at first, I had to look for other options. Even though I wanted everything self-hosted, I ended up using the Resend API. The API is straightforward:

typescriptasync function sendWelcomeEmail(subscriberEmail: string, env: Env) {
 const response = await fetch('https://api.resend.com/emails', {
 method: 'POST',
 headers: {
 'Authorization': `Bearer ${env.RESEND_API_KEY}`,
 'Content-Type': 'application/json',
 },
 body: JSON.stringify({
 from: 'Philipp Dubach <noreply@notifications.philippdubach.com>',
 to: [subscriberEmail],
 subject: 'Welcome to the Newsletter',
 html: `<p>Thanks for subscribing!</p>`,
 }),
 });
 return response.ok;
}

After implementing this, I figured: why not send a confirmation to the subscriber and a copy to me? Why not use Resend for the whole distribution? (This is not a paid advertisement.) The HTML newsletter I generate goes straight into the email body. No images hosted elsewhere (except for the optimized preview thumbnails). No tracking pixels. No click tracking. The email is just HTML.

I also looked at Mailgun and SendGrid before settling on Resend. Mailgun has better deliverability monitoring but a more complex API. SendGrid has more features but felt overengineered for what I needed. Resend’s free tier and simple API won. If you have strong opinions on email APIs, I’m curious to hear them.

The total cost of running this: zero. Cloudflare Workers has a generous free tier. Cloudflare R2 (where the HTML newsletters are hosted) has 10GB free storage. Resend gives 3,000 emails per month. The Python script runs locally or on my Azure instance.

You can find my first newsletter here. The full code for both the newsletter generator and the subscriber API is on GitHub.

Visualizing Gradients with PyTorch

me@philippdubach.com (Philipp D. Dubach) — Sat, 23 Aug 2025 00:00:00 +0000

Gradients are one of the most important concepts in calculus and machine learning, but it’s often poorly understood. Trying to understand them better myself, I wanted to build a visualization tool that helps me develop the correct mental picture of what the gradient of a function is. I came across GistNoesis/VisualizeGradient, so I went on from there to write my own iteration. This mental model generalizes beautifully to higher dimensions and is the foundation for understanding optimization algorithms like gradient descent.

The colored surface shows function values. Black arrows show gradient vectors in the input plane (x-y space), pointing toward the direction of steepest ascent.

If you are interested in having a closer look or replicating my approach, the full project can be found on my GitHub. I’m also looking forward to doing something similar on the Central Limit Theorem as well as doing a short tutorial on plotting options volatility surfaces with python, a project I have been waiting to finish for some time now.

Counting Cards with Computer Vision

me@philippdubach.com (Philipp D. Dubach) — Sun, 06 Jul 2025 00:00:00 +0000

After installing Claude Code

the agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster through natural language commands

I was looking for a task to test its abilities. Fairly quickly we wrote less than 200 lines of python code predicting blackjack odds using Monte Carlo simulation. When I went on to test this little tool on Washington Post’s online blackjack (I also didn’t know that existed!) I quickly noticed how impractical it was to manually input all the card values on the table. What if the tool could also handle blackjack card detection automatically and calculate the odds from it? I have never done anything with computer vision so this seemed like a good challenge.

To get to any reasonable result we have to start with classification where we “teach” the model to categorize data by showing them lots of examples with correct labels. But where do the labels come from? I manually annotated 409 playing cards across 117 images using Roboflow Annotate (at first I only did half as much - why this wasn’t a good idea we’ll see in a minute). Once enough screenshots of cards were annotated we can train the model to recognize the cards and predict card values on tables it has never seen before. I was able to use a NVIDIA T4 GPU inside Google Colab which offers some GPU time for free when capacity is available.

During training, the algorithm learns patterns from this example data, adjusting its internal parameters millions of times until it gets really good at recognizing the differences between categories (in this case different cards). Once trained, the model can then make predictions on new, unseen data by applying the patterns it learned. With the annotated dataset ready, it was time to implement the actual computer vision model. I chose to run inference on Ultralytics’ YOLOv11 pre-trained model, a leading object detection algorithm. I set up the environment in Google Colab following the “How to Train YOLO11 Object Detection on a Custom Dataset” notebook. After extracting the annotated dataset from Roboflow, I began training the model using the pre-trained YOLOv11s weights as a starting point. This approach, called transfer learning, allows the model to reuse patterns already learned from millions of general images and adapt them to this specific task. I initially set it up to run for 350 epochs, though the model’s built-in early stopping mechanism kicked in after 242 epochs when no improvement was observed for 100 consecutive epochs. The best results were achieved at epoch 142, taking around 13 minutes to complete on the Tesla T4 GPU. The initial results were quite promising, with an overall mean Average Precision (mAP) of 80.5% at IoU threshold 0.5. Most individual card classes achieved good precision and recall scores, with only a few cards like the 6 and Queen showing slightly lower precision values.

However, looking at the confusion matrix and loss curves revealed some interesting patterns. While the model was learning effectively (as shown by the steadily decreasing loss), there were still some misclassifications between similar cards, particularly among the numbered cards. This highlighted exactly why I mentioned earlier that annotating only half the amount of data initially “wasn’t a good idea” - more training examples would likely improve these edge cases and reduce confusion between similar-looking cards. My first attempt at solving the remaining accuracy issues was to add another layer to the workflow by sending the detected cards to Anthropic’s Claude API for additional OCR processing.

This hybrid approach was very effective - the combination of YOLO’s object detection to dynamically crop down the Black Jack table to individual cards with Claude’s advanced vision capabilities yielded 99.9% accuracy on the predicted cards. However, this solution came with a significant drawback: the additional API layer consumed valuable time and the large model’s processing overhead, making it impractical for real-time gameplay.

Seeking a faster solution, I implemented the same workflow locally using easyOCR instead. EasyOCR seems to be really good at extracting black text on white background but might struggle with everything else. While it was able to correctly identify the card numbers when it detected them, it struggled to recognize around half of the cards in the first place - even when fed pre-cropped card images directly from the YOLO model. This inconsistency made it unreliable for the application. Rather than continue band-aid solutions, I decided to go back and improve my dataset. I doubled the training data by adding another 60 screenshots with the same train/test split as before. More importantly, I went through all the previous annotations and fixed many of the bounding polygons. I noticed that several misidentifications were caused by the model detecting face-down dealer cards as valid cards, which happened because some annotations for face-up cards inadvertently included parts of the card backs next to them. The improved dataset and cleaned annotations delivered what I was hoping for: The confusion matrix now shows a much cleaner diagonal pattern, indicating that the model now correctly identifies most cards without the cross-contamination issues we saw earlier.

Both the training and validation losses converge smoothly without signs of overfitting, while the precision and recall metrics climb steadily to plateau near perfect scores. The mAP@50 reaches an impressive 99.5%. Most significantly, the confusion matrix now shows that the model has virtually eliminated false positives with background elements. The “background” column (rightmost) in the confusion matrix is now much cleaner, with only minimal misclassifications of actual cards as background noise.

With the model trained and performing, it was time to deploy it and play some blackjack. Initially, I tested the system using Roboflow’s hosted API, which took around 4 seconds per inference - far too slow for practical gameplay. However, running the model locally on my laptop dramatically improved performance, achieving inference times of less than 0.1 seconds per image (1.3ms preprocess, 45.5ms inference, 0.4ms postprocess per image). I then integrated the model with MSS to capture a real-time feed of my browser window. The system automatically overlays the detected cards with their predicted values and confidence scores

The final implementation successfully combines the pieces: the computer vision model detects and identifies cards in real-time, feeds this information to the Monte Carlo simulation, and displays both the card recognition results and the calculated odds directly on screen - do not try this at your local (online) casino!

Modeling Glycemic Response with XGBoost

me@philippdubach.com (Philipp D. Dubach) — Fri, 30 May 2025 00:00:00 +0000

Earlier this year I wrote how I built a CGM data reader after wearing a continuous glucose monitor myself. Since I was already logging my macronutrients and learning more about molecular biology in an MIT MOOC, I became curious: given a meal’s macronutrients (carbs, protein, fat) and some basic individual characteristics (age, BMI), could a machine learning model predict the shape of my postprandial glucose curve? I came across Zeevi et al.’s paper on Personalized Nutrition by Prediction of Glycemic Responses, which used machine learning to predict individual glycemic responses from meal data. Exactly what I had in mind. Unfortunately, neither the data nor the code were publicly available. So I decided to build my own model. In the process I wrote this working paper.

The paper documents my attempt to build an open, reproducible glucose prediction pipeline, and what I learned about why that is harder than it sounds. The methodologies employed were largely inspired by Zeevi et al.’s approach. This matters because the landscape of personalized nutrition is increasingly dominated by proprietary systems. Companies like ZOE, DayTwo, and Ultrahuman all run versions of this pipeline on closed data. Open-source alternatives remain scarce.

Why not only use my own data?

I quickly realized that training a model only on my own CGM data was not going to work. Over several weeks of diligent logging, I collected roughly 40 meal-response pairs. To make matters worse, Howard, Guo & Hall (2020) showed that two CGMs worn simultaneously on the same person can give discordant meal rankings for postprandial glucose, meaning some of the variance in the signal is measurement noise, not biology.

To get enough data, I used the publicly available Hall dataset containing continuous glucose monitoring data from 57 adults, which I narrowed down to 112 standardized meals from 19 non-diabetic subjects with their respective glucose curve after the meal (full methodology in the paper).

Gaussian curve fitting

Rather than trying to predict the entire glucose curve, I simplified the problem by fitting each postprandial response to a normalized Gaussian function. This gave me three key parameters to predict: amplitude (how high glucose rises), time-to-peak (when it peaks), and curve width (how long the response lasts).

The Gaussian approximation worked surprisingly well for characterizing most glucose responses. While some curves fit better than others, the majority of postprandial responses were well-captured, though there is clear variation between individuals and meals. Some responses were high amplitude, narrow width, while others are more gradual and prolonged.

XGBoost pipeline

I then trained an XGBoost regressor with 27 engineered features including meal composition, participant characteristics, and interaction terms. XGBoost was chosen for its ability to handle mixed data types, built-in feature importance, and strong performance on tabular data. The pipeline included hyperparameter tuning with 5-fold cross-validation to optimize learning rate, tree depth, and regularization parameters. Rather than relying solely on basic meal macronutrients, I engineered features across multiple categories and implemented CGM statistical features calculated over different time windows (24-hour and 4-hour periods), including time-in-range and glucose variability metrics. Architecture-wise, I trained three separate XGBoost regressors, one for each Gaussian parameter.

Results

The model could predict how high my blood sugar rises after a meal with moderate accuracy (R² = 0.46, correlation = 0.73, p < 0.001). Not good enough for clinical guidance, which typically requires R² > 0.7, but meaningfully better than the multi-linear regression baseline (R² = 0.24).

The more telling result is what the model could not do. It had no idea when blood sugar would peak. The time-to-peak prediction was literally worse than guessing the average every time (R² = -0.76, p = 0.896). Curve width prediction was marginally better but still not useful (R² = 0.10). In other words: meal composition tells you something about the magnitude of your glucose spike, but almost nothing about its timing or duration. That is a meaningful finding in itself, consistent with the idea that temporal dynamics are driven by factors like gastric emptying, insulin sensitivity, and gut microbiome composition, none of which were captured in the feature set.

For context, Cappon et al. (2023) trained a similar XGBRegressor on 3,296 meals from 927 healthy individuals and achieved a correlation of r = 0.48 for predicting glycemic response magnitude. Their larger dataset did not dramatically improve over my amplitude correlation of 0.73, but they also found systematic bias in predictions, suggesting that XGBoost captures the general direction well while missing individual-level variation. Separately, Shin et al. (2025) tried a bidirectional LSTM on 171 healthy adults and achieved r = 0.43, worse than XGBoost on amplitude. Deep learning does not automatically win here, especially at small-to-medium dataset sizes. Data quantity matters more than model complexity.

A study on glycemic prediction in pregnant women found that adding gut microbiome data increased explained variance in glucose peaks from 34% to 42%, underscoring that meal composition alone leaves a lot on the table.

The complete code, Jupyter notebooks, processed datasets, and supplementary results are available in my GitHub repository.

(10/06/2025) Update: Today I came across Marcel Salathé’s LinkedIn post on a publication out of EPFL: Personalized glucose prediction using in situ data only.

With data from over 1,000 participants of the Food & You digital cohort, we show that a machine learning model using only food data from myFoodRepo and a glucose monitor can closely track real blood sugar responses to any meal (correlation of 0.71).

As expected, Singh et al. achieve substantially better predictive performance (R = 0.71 vs R² = 0.46). The most critical difference is sample size: their 1,000+ participants versus my 19 (from the Hall dataset). They leveraged the “Food & You” study with high-resolution nutritional intake data from more than 46 million kcal collected across 315,126 dishes, 1,470,030 blood glucose measurements, and 1,024 gut microbiota samples. Both studies use XGBoost, SHAP for interpretability, cross-validation for evaluation, and mathematical approaches to characterize glucose responses (Gaussian curve fitting in my case, incremental AUC in theirs). The methodological overlap is reassuring; what separates the results is data at scale.

The CGMacros dataset (Das et al., Scientific Data, 2025) now provides the first publicly available multimodal dataset with CGM readings, food macronutrients, meal photos, activity data, and microbiome profiles for 45 participants. It even includes an XGBoost example script for predicting postprandial AUC. This is exactly the kind of open resource the field needs more of.

I Built a CGM Data Reader

me@philippdubach.com (Philipp D. Dubach) — Thu, 02 Jan 2025 00:00:00 +0000

If you’re reading this, you might also be interested in: Modeling Glycemic Response with XGBoost

Last year I put a Continuous Glucose Monitor (CGM) sensor, specifically the Abbott Freestyle Libre 3, on my left arm. Why? I wanted to optimize my nutrition for endurance cycling competitions. Where I live, the sensor is easy to get—without any medical prescription—and even easier to use. Unfortunately, Abbott’s FreeStyle LibreLink app is less than optimal (3,250 other people with an average rating of 2.9/5.0 seem to agree). In their defense, the web app LibreView does offer some nice reports which can be generated as PDFs—not very dynamic, but still something! What I had in mind was more in the fashion of the Ultrahuman M1 dashboard. Unfortunately, I wasn’t allowed to use my Libre sensor (EU firmware) with their app (yes, I spoke to customer service).

At that point, I wasn’t left with much enthusiasm, only a coin-sized sensor in my arm. The LibreView website fortunately lets you download most of your (own) data in a CSV report (there is also a reverse engineered API), which is nice. So that’s what I did: download the data, pd.read_csv() it into my notebook, calculate summary statistics, and plot the values.

After some interpolation, I now had the same view as the LibreLink app (which I had rejected earlier) provided. Yet, this setup allowed me to do further analysis and visualizations by adding other datapoints (workouts, sleep, nutrition) I was also collecting at that time:

Blood sugar from LibreView: Measurement timestamps + glucose values
Nutrition from MacroFactor: Meal timestamps + macronutrients (carbs, protein, and fat)
Sleep data from Sleep Cycle: Sleep start timestamp + time in bed + time asleep (+ sleep quality, which is a proprietary measure calculated by the app)
Cardio workouts from Garmin: Workout start timestamp + workout duration
Strength workouts from Hevy: Workout start timestamp + workout duration

After structuring those datapoints in a dataframe and normalizing timestamps, I was able to quickly highlight sleep (blue boxes with callouts for time in bed, time asleep, and sleep quality) and workouts (red traces on glucose measurements for strength workouts, green traces for cardio workouts) by plotting highlighted traces on top of the historic glucose trail for a set period. Furthermore, I was able to add annotations for nutrition events with the respective macronutrients.

I asked Claude to create some sample data and streamline the functions to reduce dependencies on the specific data sources I used. The resulting notebook is a comprehensive CGM data analysis tool that loads and processes glucose readings alongside lifestyle data (nutrition, workouts, and sleep), then creates an integrated dashboard for visualization. The code handles data preprocessing including interpolation of missing glucose values, timeline synchronization across different data sources, and statistical analysis with key metrics like time-in-range and coefficient of variation. The main output is a day-by-day dashboard that overlays workout periods, nutrition events, and sleep phases onto continuous glucose monitoring data, enabling users to identify patterns and correlations between lifestyle factors and blood sugar responses.

You can find the complete notebook as well as the sample data in my GitHub repository.

The Tech behind this Site

me@philippdubach.com (Philipp D. Dubach) — Mon, 15 Jan 2024 00:00:00 +0000

This site runs on Hugo, deployed to GitHub Pages with Cloudflare CDN. Images are hosted on R2 (static.philippdubach.com) with automatic resizing and WebP conversion.

The core challenge was responsive images. Standard markdown ![alt](url) doesn’t support multiple sizes. I built a Hugo shortcode that generates <picture> elements with breakpoint-specific sources—upload once at full quality, serve optimized versions (320px mobile to 1600px desktop) automatically.

Updates

May 2026

Cron Reliability — Scheduled rebuilds moved off GitHub Actions cron (drifts 30+ min, occasionally skips during platform incidents) onto a Cloudflare Worker that fires workflow_dispatch on a deterministic 17 */3 * * * UTC schedule. Builds still run on Actions; only the trigger moved. Cadence increased from 3× daily to every 3 hours, so max publish delay for future-dated posts dropped from ~7h to ~3h. Worker source: social-automation/build-trigger/.

Hugo 0.161.1 Upgrade — Bumped from 0.157.0. Byte-identical output, zero deprecation warnings. Added a local diff harness (scripts/upgrade-diff.sh) that builds with two Hugo versions and diffs public/, used to validate the bump as a no-op. The new release window opened up strings.ReplacePairs (collapsed an 8-call entity-decode chain in llms-full.txt) and fixed enough Goldmark / RenderShortcodes edge cases that 8 of 9 post-render regex passes in the markdown variant template could go. One was actively harmful: the math-delimiter regex was corrupting Wikipedia URLs like Universal_Serial_Bus_$USB$ into _$USB$_ because the pattern was over-eager.

Index Redesign — Articles and projects now share the same structure: hero dropped, the featured row is the masthead (red overline, 1.5px red rule, large headline), hairline divider, filter chips reframed as “browse the archive” rather than page nav. Quiet <h1 class="page-label"> for SEO and screen readers without competing visually with the featured headline. Featured card image now requests a 1200×630 landscape source matching the CSS aspect-ratio (was getting a 480×600 portrait stretched sideways, which clipped faces).

Markdown Variant Maturity — /posts/<slug>/index.md now emits clean markdown via output-format-aware sibling shortcodes (img.markdown.md, disclaimer.markdown.md, newsletter.markdown.md, readnext.markdown.md). HTML pollution never enters the markdown stream. YAML preamble parseable by gray-matter and every major LLM tooling SDK. Dropped the eq .Section "posts" gate so /about/, /research/, /subscribe/ ship real markdown content too.

Worker Audit — Three parallel reviews (perf, reliability, security) of the Hugo site and remaining Cloudflare Workers found nine issues worth fixing, clustered in the social-automation workers. Bluesky’s 300-char post limit was being silently exceeded when the appended URL pushed total length over budget, leaving stuck-loop posts that retried every 15 minutes. The og:image URL was being fetched without domain validation (SSRF gap, low probability). Cron ticks could race the same article when a tick took longer than 15 minutes. The bluesky worker fetched each article URL twice per post. All four fixed, plus a GoatCounter title sanitizer at the Worker boundary so a downstream consumer that uses innerHTML can’t be tricked into executing script regardless of how the rendering side is written.

Template Hardening — readnext shortcode now emits a build-time warnf when the slug doesn’t match a post, instead of silently rendering nothing. FAQ aggregation extracted to a cached partial (one scan per category instead of two on every faq/* page). Homepage and projects ItemList JSON-LD capped at 20 entries (Google rich-result band). posts.json prefers .Description over .Summary to skip 86 of 87 markdown renders at build time.

Decommissioning — Removed the post composer (post-composer.pages.dev) and URL shortener (pdub.click). Both unused. Deleted source from the repo, then deleted the Pages project, KV namespaces, D1 database via wrangler.

Accessibility — Sidebar wordmark aria-label now starts with the visible text per WCAG 2.5.3 (voice-control users saying “click philippdubach” can now activate it). Newsletter card meta switched to --text-secondary for 7.0:1 contrast on the pink tint (was 3.91:1, below AA).

April 2026

Agent Readiness — Shipped three coordinated changes so AI agents and content-aware crawlers can discover and consume the site through standardized protocols. Every response now carries a Link: header (RFC 8288) advertising machine-readable resources: the api-catalog, sitemap, RSS and JSON feeds, llms.txt, and a per-page markdown alternate. A new /.well-known/api-catalog endpoint returns an RFC 9264 Linkset enumerating those endpoints (RFC 9727). Content negotiation now works: requesting any page with Accept: text/markdown returns a markdown variant with Content-Type: text/markdown, Vary: Accept, and an x-markdown-tokens count for LLM context-window planning. The robots.txt declares Content-Signal: search=yes, ai-input=yes, ai-train=yes per draft-romm-aipref-contentsignals.

Worker Refactor — The 60-line security-headers Worker grew into four focused modules (accept, links, cache, index) with 32 unit tests. The cache module is the interesting bit: Cloudflare’s edge cache doesn’t honor Vary: Accept by default, so the Worker uses the Cache API with synthetic keys (?_v=html|md) to keep HTML and Markdown variants isolated under the same URL. Origin fetches HTML or rewrites to /index.md based on the client’s Accept header.

March 2026

Hugo Upgrade — Upgraded from Hugo v0.128.0 to v0.157.0. Migrated deprecated .Site.AllPages to .Site.Pages in the sitemap template and .Site.Data to site.Data across navigation, structured data, and research templates. Removed a dead readFile security config key from hugo.toml. No breaking changes, zero deprecation warnings.

February 2026

Frontmatter Unification — Converted all 70 YAML frontmatter posts to TOML and added Key Takeaways to all 73 posts. Takeaways render as a visible summary box between the post header and content body, optimized for Generative Engine Optimization (GEO) so AI search engines can extract citation-ready passages.

Design Streamlining — Unified left-bordered aside components (key takeaways, newsletter CTA, disclaimer) to consistent 3px borders and aligned padding. Established a vertical spacing rhythm across post zones: key takeaways, content body, newsletter CTA, footer divider, related posts. Added breathing room around images (1.5rem padding). Refined key takeaways heading to 0.85rem uppercase label with square bullets.

Homepage Redesign — Rebuilt the homepage with a tabbed layout (Articles/Projects), year dividers, and thumbnail images served via Cloudflare Image Resizing. Consolidated navigation into a unified sidebar.

Security Headers Worker — Deployed a dedicated Cloudflare Worker on philippdubach.com/* that injects HSTS, CSP with frame-ancestors, COEP, COOP, and Permissions-Policy headers. GitHub Pages doesn’t process _headers files, so the Worker fills that gap. SHA-pinned all GitHub Actions and added Hugo binary checksum verification in CI.

Machine-Readable Feeds — Added JSON Feed 1.1 alongside RSS, a Posts API for programmatic access, and llms.txt/llms-full.txt for AI crawler discovery. All output formats configured in hugo.toml.

GoatCounter “Most Read” API — Built a Cloudflare Worker proxy that queries the GoatCounter API for the top 10 posts over the past 7 days. The footer’s “Most Read” section now fetches live data from this worker instead of a static list.

FAQ Section — New /faq/ section with per-category pages (Finance, AI, Tech, Economics, Medicine). Each post can define faq entries in frontmatter; Hugo aggregates them into browsable FAQ pages with FAQPage structured data for search engines.

Readnext Shortcode — Inline “Related” link to another post: {{< readnext slug="post-slug" >}}. Links are validated against live permalinks at build time.

RSS Feed Fixes — Stripped lightbox overlay elements from full-content RSS to prevent images appearing twice in feed readers. Added XSLT stylesheet for browser-friendly RSS rendering.

Cloudflare Cache Purge — GitHub Actions deployment now automatically purges the Cloudflare cache after each build.

Research Page — Dynamic /research/ page pulling publication data from data/research.yaml with SSRN links, DOIs, and structured data.

January 2026

Social Automation & AI Model Upgrade — Upgraded Workers AI model from Llama 3.1 8B to Llama 4 Scout 17B for better post generation. Added Twitter/X automation worker alongside Bluesky. AI generates neutral, non-clickbait posts with extensive banned word filtering.

UI/UX Polish — Fixed mobile footer spacing consistency. Increased homepage post spacing (3.75rem). Disclaimers now only display on individual posts, hidden on homepage. Centered related posts heading.

Content Organization — Taxonomy system with categories (Finance, AI, Medicine, Tech, Economics) and types (Project, Commentary, Essay, Review). Hugo generates browsable /categories/ and /types/ pages.

Disclaimer Shortcode — Six types: finance, medical, general, AI, research, gambling. Syntax: {{< disclaimer type="finance" >}}.

IndexNow Integration — Automated submissions via GitHub Actions for faster search engine discovery. Only pings recently changed URLs based on lastmod.

December 2025

Code Blocks — Syntax highlighting via Chroma with line numbers in table layout. GitHub-inspired color theme.

Newsletter System — Integrated email subscriptions via Cloudflare Workers + KV. Welcome emails via Resend.

Security & Performance Audit — Fixed multiple H1 tags per page. Hardened CSP with frame-ancestors. Added preconnect hints for external domains. Added seoTitle frontmatter for long titles.

November 2025

Shortcodes — HTML table shortcode. Lightbox support on images.

June 2025

SEO & Math — Open Graph integration for social previews. Per-post keyword management. LaTeX rendering via MathJax 3 (conditional loading).

May 2025

Full Rebuild — Migrated from hugo-blog-awesome fork to fully custom Hugo build.