Confession before we start. This article was written in collaboration with an LLM. The article argues that LLMs cannot do the kind of cognitive work that produced this article. Both statements are true. The space between them is what I want you to look at.

I had a tension I couldn't name. Something about how everyone keeps describing this new suite of tools felt off, but I couldn't put my finger on the off-ness. So I did what anyone with a laptop and a moderate AI bill does in 2026: I took the tension to a chatbot. Twenty minutes of conversation later, I had names for things I had been sensing for months. I also had a clearer view of why the chatbot, for all its help, would never have noticed the tension in the first place. Fifty minutes after that, with research checked and prose constructed, this article came into being. And it wouldn't have without the linguistic power of an LLM.

That's the thesis.

Before we go further: I'm not here to tell you AI is overhyped or under-hyped. The camps have been at this for three years, producing a lot of heat and not much that helps you make a decision on Thursday afternoon. Watching the battle with a bowl of popcorn can be entertaining and occasionally informative, but I want to look at something deeper than the usual battle lines reach.

The actual thing is consequential. The pace at which knowledge work is being reshaped is something we haven't seen in modern memory. The blast radius is wide. Which means the quality of our thinking about these tools matters, and the cost of getting it wrong is high.

So this article is going to try to do something the camps haven't: name what's actually going on underneath these tools, in language useful for people who have to make consequential decisions about delegation, hiring, and capital allocation. Because the underlying technology is changing faster than the vocabulary we use to describe it, and the vocabulary gap is where the expensive mistakes live.

The place where humans operate in business and society is changing. It always has, but never at this pace and never across this much of knowledge work at once. Which means the question of where humans still bring a distinct contribution is interesting and useful, personally and economically.

I think that place is discomfort. Something being off. Not smelling right, where AI systems have no nose. The capacity to feel that something is wrong before you can name why is the function I'm pointing at. We call it intuition. It comes from aeons of biological evolution on our sense-making organ, layered with our own direct, complex, and often subconscious experiences.

This capacity isn't a soft skill. It's a structural feature of how humans process the world, and current AI tools, on their own, cannot reach the place it lives.

Here's what I mean by "structural feature."

Human cognition has a substrate that predates language by a long way. Our basic machinery for pattern recognition, threat detection, social modeling, prediction, and emotional integration evolved over millions of years. Language is comparatively recent (roughly 100,000-300,000 years), and the strong contemporary evidence is that it interacts with older cognitive systems without being the medium in which thinking actually happens. Language shapes thought in real ways. It just isn't the substrate of thought.

The evidence comes from several directions and they triangulate.

A 2024 Nature paper by Evelina Fedorenko, Steven Piantadosi, and Edward Gibson (MIT, Berkeley, Harvard) makes the case in plain terms: "Language is primarily a tool for communication rather than thought." Their decade-plus of brain imaging shows that the brain's language network is highly specialized for language and largely separate from the networks that do reasoning, math, music, and social cognition. Patients with severe aphasia who have lost most or all of their language often retain the ability to solve logic puzzles, do multi-step arithmetic, reason about probability, play chess, and navigate social situations. The language is gone; the thinking is intact.

Pre-linguistic infants reason about objects, agents, numbers, and social interactions before they have words for any of it. Crows plan tool use. Octopuses solve novel problems. None of these systems run on syntax.

One piece of evidence that often goes unmentioned: people vary substantially in how much they use inner speech at all. Some have rich running internal monologues. Others have very little verbal self-talk. On most cognitive tasks, performance is comparable across the spectrum. This is an active line of research, and the variation is exactly what you'd expect if language is the interface and not the substrate.

Antonio Damasio's somatic markers hypothesis adds another layer: affective bodily states do real work in decision-making, in ways pure reasoning can't replicate. Jonathan Haidt's social intuitionist model finds the same pattern in moral judgment, where intuitions arrive first and language justifies them after the fact. George Lakoff and Mark Johnson's work on embodied cognition argues most of our abstract reasoning is metaphorical extension of bodily experience.

There is a contested camp here. Jerry Fodor's "language of thought" hypothesis argued for a mental language underneath thinking, and it still has serious defenders. But the empirical center of gravity over the last two decades has shifted toward the view I'm describing: language and substrate interact constantly, but a lot of the heavy lifting in cognition happens in places language can't fully reach.

Call this the substrate: the older, richer cognitive machinery your brain runs on, much of which works without recruiting language. Language is the interface: how you and other people get the contents of that substrate in and out of each other's heads. The two interact constantly. The interaction doesn't make them the same thing.

Two horizontal layers connected by translucent organic roots. The lower layer shows amber biological root textures; the upper layer shows cool blue circuit-board patterns.
Substrate beneath, interface above, organic connection between. Image generated with Gemini.

Now consider how LLMs work.

An LLM is trained on next-token prediction over a massive text corpus. Its internal representations are high-dimensional vector spaces. Recent mechanistic interpretability work by Anthropic and others has shown that these internal representations are surprisingly structured. The 2024 "Scaling Monosemanticity" research on Claude 3 Sonnet found discrete interpretable features that encode concepts, including ones that aren't obviously surface-tied to specific tokens. So in a real sense LLMs do have non-linguistic internal representations.

Here's the catch. Those internal features can only encode patterns that show up in the training data. The training data is, overwhelmingly, language. (Multimodal models extend the corpus to images, audio, and video, but the data is still curated, structured, and consciously produced. It is the surface of human experience, not the substrate that generated it.) An LLM's internal representations are shaped by what language carries, and bounded by what the training corpus can put in front of them.

Where the human's substrate predates language and runs on machinery language can only partially express, the LLM's substrate is its interface. The thing on top is also the thing underneath. There is no older, richer layer holding the weight.

There's also no consequence-centering. An LLM has no stake in what happens after its response. It doesn't persist as a continuous entity with a future to protect or a reputation to maintain. Its training optimized it to be useful inside the context window; outside that context, nothing remains to care. The cost of being wrong, for a human, is part of how the cognition works. For an LLM, it isn't there.

Worth noting: this description is about LLMs specifically, not about all AI architectures. Non-LLM approaches (world models, embodied agents, continual-learning systems) are being actively developed precisely to address the substrate gap. Whether any of them succeeds, and on what timeline, is open. The vocabulary I'm offering is about the tools currently dominating the discourse and the deployment decisions in front of you. It isn't a prophecy about AI in general.

This is not a critique of how well LLMs work. They work very well at what they do. It is a description of the kind of thing they are.

Knowing what LLMs are doesn't stop them from feeling like something else.

How to think about these tools is hard because the interactions feel humanlike at every register we put them through. LLMs argue, hedge, encourage, joke, sympathize, push back when you're wrong, admit when they're not sure (at least sometimes). Then they confabulate a citation that doesn't exist, get stuck in a loop, or fail at something a five-year-old would handle, in a way that makes you want to tear your hair out. Our expectations break because our only prior interactions with entities capable of this kind of linguistic facility have been other humans, who function very differently underneath.

The reason this keeps happening, even to people who know better, is structural. Humans co-evolved language and person-detection. The same neural and cognitive machinery that lets you tell that another person is in the room with you also runs on the linguistic signal those people produce. When something else produces the same signal at sufficient quality, that machinery fires anyway. The anthropomorphizing isn't a bug in your reasoning; it's the predictable output of an interface-first system running through a person-detector built to read humans through language.

This is the category error currently sitting in many boardrooms.

Smart leaders are making capital-allocation, headcount, and delegation decisions inside it. They're not naive about AI. They've read the briefings. They know "LLMs aren't conscious." But the interaction with the tool keeps feeling enough like a colleague that the operational decisions get made inside that feeling rather than the structural reality. You hire AI to do a job. You measure it on the deliverables. You free up the budget that used to fund the human. And then six months later something quietly goes wrong in a place no one was watching, because the entity you assigned the work to has no taste buds to tell when something's gone bitter or spoiled, and no stake in what happens after the output ships.

The expensive mistakes don't come from the moments you know you're talking to an LLM. They come from the moments you forget.

Three categories of cognitive work fall out of the substrate-interface gap with particular force.

The first is instinct: pattern recognition trained on years of unstructured, lived, often unrecorded experiences. A doctor knows a patient is sicker than the chart shows. An investor smells a quarter that doesn't add up before the numbers tell her why. An experienced operator hears a machine that's slightly off before any gauge moves. None of these people can fully articulate what they noticed. They learned the patterns by being in the room for years, and the learning happened beneath the level of language. LLMs can pattern-match within what's been written down. They can't pattern-match across what was never said.

The second is multi-signal weak-evidence integration: weighing many weak, ambiguous, slow-feedback signals to reach a high-stakes judgment. Clinical diagnosis under uncertainty. Hiring decisions where every individual data point is unreliable. Executive judgment about whether to push a project that has mixed indicators. The pattern is the same in each: no single signal is strong enough, and the feedback on whether you got it right is delayed by months or years. Humans build the weighting through years of watching outcomes drift away from predictions. The training signal for this work is too sparse, too slow, and too ambiguous to scale into the corpora that produce LLMs.

The third is motivated judgment, which is where the no-stake property of LLMs starts mattering operationally. Humans don't just have stakes in the outcome; the stakes are integrated into the cognitive process. A surgeon's cognition while operating is shaped by the fact that the patient could die. A pilot's cognition while landing in weather is shaped by the cost of a wrong call. A leader's cognition about a layoff is shaped by knowing whose mortgage gets defaulted on. This isn't sentimentality. It's somatic-marker integration in the Damasio sense: the bodily and affective register of stakes is what produces the difference between a system that could make the right call and one that does because being wrong costs something. An LLM can simulate stake-aware reasoning when prompted. The simulation operates outside the cognitive process. The stakes don't bend the cognition itself.

These three gaps share something. They're not capability gaps in the conventional sense; they're training-signal gaps. The kind of cognition each one requires runs on signal that doesn't make it into any training corpus at the scale current methods need. Years of unrecorded lived experience. Sparse, delayed, ambiguous outcome feedback. Stakes that shape cognition because they're real. None of it transfers cleanly to text.

Which is why getting AI deployment right isn't a question of capability matching. It's a question of which work has a substrate that has to stay human, and which doesn't.

A silhouetted figure works on a glowing laptop at the circuit-board boardroom table. Saplings rise from the table near the figure; an amber substrate is visible beneath.
The substrate function in action. Image generated with Gemini.

The personal version is simpler than the org version, and you can start tomorrow.

The discipline isn't "review what the AI produced before shipping it." That's still treating the LLM as a colleague whose work you spot-check. The discipline is staying in the cognitive loop while the work is happening. Co-explore the problem with the tool. Notice when the framing it offers feels too clean, or when its examples skip the case that matters, or when its confident answer is answering a slightly different question than the one you asked. Pull on the discomfort. The places where something feels off are where the substrate is doing work the LLM can't do alone.

The version of this article you're reading is the worked example. It was built in an hour-long, iterative conversation with an LLM. The LLM helped me articulate, sharpen, find citations, anticipate counterarguments, and tighten prose. It also, repeatedly, overreached: claimed too much, picked metaphors that broke voice, leaked working-document scaffolding into the prose, drifted toward formulations that pattern-matched to camps I was trying to differentiate from. Every one of those moves had to be caught by me, in flight, while the work was happening. Not after.

That asymmetry is the personal practice. Substrate work doesn't just feed in at the start. It stays active throughout. The interface tool is faster and broader; the substrate function is the thing that catches when the interface is wrong.

The organizational version is harder, because it requires rethinking how you frame the question.

Most companies are asking "where can we use AI," a deployment question that treats each task as a slot to fill with either a human or a machine. The frame the substrate-interface gap suggests is different: what collaboration does each task need? Some tasks have feedback structure that AI can train on cleanly, and you can hand those over with light human oversight. Others live in the substrate zone and need humans in the cognitive loop throughout, with AI as a faster interface but never the primary cognition. Still others shouldn't be touched by AI at all, because the substrate development the work produces is itself the long-term asset.

That last category is the one most easily missed and most expensive to lose. Klarna is the canonical case. Between 2022 and 2024 they cut roughly 700 customer service positions to replace them with an OpenAI-powered system. By 2025, with quality dropping and customer trust eroding, they were quietly rehiring. CEO Sebastian Siemiatkowski publicly admitted the company had gone too far on efficiency at the cost of quality. The lesson generalizes. It wasn't only that the AI couldn't handle the hard cases. It was that there was no longer anyone in the org whose substrate had been built by years of handling hard cases. Today's noticers could be replaced. Tomorrow's noticers had nowhere to grow.

The implication is not "don't deploy AI." The implication is that AI deployment is a decision about which substrate-development pipelines you preserve and which you let dry up. That decision belongs at the level of people who think about succession, not the level of people who think about cost savings.

Right now the vocabulary in use mostly treats AI as a colleague-substitute or a magical capability, and both are wrong in ways that lead to expensive mistakes. The frame that fits is asymmetric collaboration: the substrate function (the noticing, the discomfort, the bitter taste) stays with the humans; the interface function gets shared with the tool. The mistakes don't come from AI being too powerful or too weak. They come from misreading what kind of thing it is.

The confession at the start of this article holds. It was written in collaboration with an LLM, and it argues that LLMs can't do the kind of work that produced it. Both still true. The space between them is the whole point.