Francois Chollet on Measuring Intelligence, the ARC Prize, and Why LLMs Are Not the Path to AGI
The creator of Keras and the ARC benchmark makes the most technically precise case for why LLMs are memorization engines, not intelligence — and puts $1 million on the line to prove it.
Top Claims — Verdict Check
Current AI benchmarks measure memorization, not intelligence — the ARC benchmark is the corrective
🟢 Real“Every time a model scores well on a benchmark, we discover it memorized the patterns rather than learned the concepts. ARC was designed to be unsolvable by memorization — every puzzle requires genuine abstraction from novel inputs. [representative paraphrase]”
LLMs are sophisticated pattern matchers that cannot perform genuine abstraction or reasoning from novel situations
🟢 Real“A language model can retrieve and recombine patterns from its training data with remarkable fluency. What it cannot do is encounter a truly novel situation and generate a solution from first principles. That is the definition of intelligence, and LLMs do not have it. [representative paraphrase]”
Intelligence should be measured as skill-acquisition efficiency, not task performance
🟢 Real“If you train on 10 billion chess games and play chess well, that is not intelligence. Intelligence is learning to play chess from 100 games. The measure that matters is how efficiently you acquire new skills from limited experience. [representative paraphrase]”
The ARC Prize ($1M) will prove that solving general intelligence requires fundamentally different approaches than scaling LLMs
🟡 Partially True“We put up a million-dollar prize because we believe no amount of LLM scaling will solve ARC. If someone does solve it, they will have built something fundamentally different from a language model — and that will be a genuine step toward AGI. [representative paraphrase]”
The AI field is suffering from a measurement crisis — we don't have good metrics for what actually matters
🟢 Real“Goodhart's Law is destroying AI research. Every benchmark becomes a target, and when a model optimizes for the target, the benchmark stops measuring what it was designed to measure. We are drowning in impressive numbers that mean less than we think. [representative paraphrase]”
What's Real
Chollet's measurement critique is the most technically grounded argument in the current AI discourse. The benchmark contamination problem is documented: studies have shown that GPT-4's performance on coding benchmarks like HumanEval drops significantly when evaluated on problems published after its training cutoff — the model was performing well partly because it had seen similar problems during training, not because it could code in a generalizable way. The ARC benchmark was specifically designed to resist this: each puzzle requires identifying an abstract pattern from 2-3 examples and applying it to a novel input. The puzzles are trivially easy for human children (average 85% success rate) and extremely hard for frontier AI models (best performance around 34% as of the ARC Prize 2024 competition). This gap — between human ease and AI difficulty — precisely targets the abstraction capability Chollet argues LLMs lack. The skill-acquisition efficiency framework from his 2019 paper 'On the Measure of Intelligence' is one of the few rigorous attempts to define what intelligence actually means in a testable way. It's not a thought experiment — it's a formal mathematical framework.
What's Hype
The 'LLMs cannot do this and scaling won't fix it' claim, while directionally sound for pure next-token prediction, has been partially complicated by approaches like OpenAI's o1 and o3 models. The o3 model reportedly achieved around 75-88% on ARC-AGI in late 2024 (with high compute budgets), though this came at enormous inference cost and the results are debated. Chollet acknowledged this was a meaningful advance while arguing it still relied on massive compute rather than efficient intelligence. The $1M prize structure also has limitations: it tests a very specific kind of abstraction (visual-spatial pattern matching on 2D grids) and solving ARC does not necessarily demonstrate general intelligence any more than solving chess demonstrates it. Chollet is honest about this, but the public framing of ARC as 'the intelligence test' overstates what it measures. The implicit claim that there is a fundamentally different architecture waiting to be discovered — one that will solve ARC elegantly — is a research bet, not a demonstrated finding.
What They Missed
The commercial irrelevance of the intelligence debate is the gap in Chollet's framework. Whether LLMs are 'truly intelligent' or 'merely memorizing' matters enormously for AGI research but barely at all for business applications. A customer service chatbot that resolves 75% of tickets through pattern matching is commercially valuable regardless of whether it 'understands' the customer's problem. The obsession with whether AI is 'really' intelligent is a philosophical question that most business owners can safely ignore — what matters is whether it reliably does the task. The cost dimension of the ARC Prize results is important: if o3 can score 75% on ARC but costs $100+ per puzzle in compute, the theoretical capability is economically impractical. The practical question isn't 'can AI reason?' but 'can AI reason at a cost that makes products viable?' The framework also doesn't address the hybrid approaches that are actually working in production — RAG, tool use, chain-of-thought, agent architectures — which achieve reliable task completion not through intelligence but through engineering scaffolding around limited capabilities.
The One Thing
If your AI product works because of memorization rather than intelligence, that's fine for business — but know the difference, because it determines which problems you can and cannot solve.
So What?
- Don't confuse benchmark scores with capability — when evaluating AI models for your business, test on YOUR data and YOUR edge cases, not published benchmarks that may be contaminated
- The ARC Prize results show that genuine reasoning is orders of magnitude more expensive than pattern matching — price your AI features accordingly and don't promise reasoning capabilities your model can't deliver cost-effectively
- Chollet's framework gives you a vocabulary for distinguishing tasks your AI can handle (pattern matching, retrieval, summarization) from tasks it can't (novel problem-solving, genuine analysis of unprecedented situations) — use this distinction in your product roadmap
Action Items
- 1Try 10 ARC puzzles yourself at arcprize.org — they take 2 minutes each and permanently recalibrate your intuition about what 'intelligence' means vs what AI currently does. Show them to your team.
- 2Audit your AI product for 'memorization dependence': identify three tasks where your AI performs well and ask — would this still work on data the model has never seen before? If you're relying on the model having seen similar patterns in training, your product is more fragile than you think.
- 3Read Chollet's 2019 paper 'On the Measure of Intelligence' — the abstract and Section 1 (10 pages) are accessible to non-researchers and provide the most rigorous framework available for thinking about what AI can and cannot do.
Tools Mentioned
ARC Benchmark
Abstraction and Reasoning Corpus — the intelligence test that exposes LLM limitations. $1M prize at arcprize.org.
Keras
Deep learning framework created by Chollet — used by millions of developers, the most popular high-level neural network API
OpenAI o3
Reasoning-focused model that achieved high ARC scores but at enormous compute cost — complicates the 'scaling won't work' thesis
Workflow Idea
Build a 'novelty stress test' for your AI product. Once per quarter, create 10 test inputs that are deliberately different from anything in the model's likely training data: unusual formatting, domain-specific jargon from your industry, edge cases unique to your market, questions that combine concepts in ways that wouldn't appear on the internet. Run them through your AI and score the outputs. This tells you exactly where your product relies on memorization (fine for common cases) vs where it fails on novelty (dangerous for edge cases). Takes 2 hours per quarter and is the single most valuable QA process for AI products.
Context & Connections
Agrees With
- gary-marcus
- yann-lecun
Contradicts
- sam-altman
- dario-amodei
- ilya-sutskever
Further Reading
- On the Measure of Intelligence — Francois Chollet (2019, arXiv:1911.01547)
- ARC Prize — arcprize.org — the $1M challenge for AGI-relevant intelligence
- Keras documentation — keras.io — Chollet's deep learning framework used by millions