Process of Elimination

Zebras & Logics & Locks

May 21, 2026

This post is part of the Rabdology blog, where we chart the jagged math-frontier of AI reasoning. Our previous report examined a geometry problem where models chose beauty over truth. This post examines a logic puzzle where readers — human and machine — encounter a trap. Tests for this post were conducted March 28–29, 2026. We welcome feedback at contact@rabdos.ai.

I. The Puzzle

A logic puzzle is a lock with clues as pins and process of elimination as pick. Try this one.

The Symposium Riddle

The final dinner of the symposium was less a banquet than a convergence theorem that had failed to be uniform. Five luminaries — Hardy, Poincaré, von Neumann, Gödel, and Ramanujan — sat in a row at the head table, each in a different jacket, each with a different drink, each newly returned from a different lecture tour, and each guarding a different mathematical instrument as though it were a proof of the Riemann Hypothesis.

Hardy sat brooding at the far left in herringbone, one hand curled around an espresso, the other resting upon an antique abacus whose beads he refused, on principle, to move. Immediately to his right sat a severe scholar in charcoal, upright as a metronome and no more companionable.

Poincaré, ever the classicist, wore tweed. Farther down the line, Ramanujan — newly back from Göttingen — sat resplendent in navy, sipping tea and turning a golden compass over in his fingers as though it might draw identities straight out of the air.

The navy jacket sat immediately to the left of the pinstripes, a juxtaposition that pleased no tailor present. The guest who had lectured at Cambridge, meanwhile, was the one in herringbone.

When the conversation turned from foundations to apparatus, the scholar fresh from Princeton began boasting of a brass astrolabe he had recently acquired. Seated right next to him, the Göttingen speaker sneered that the workmanship was inferior to what one found on the Continent. Not to be outdone, von Neumann slapped an ivory slide rule onto the table with algorithmic enthusiasm.

Gödel, with characteristic gravity, raised a glass of port in a toast that seemed prepared for its own incompleteness. The scholar just back from Oxford preferred brandy and, being full of it, soon leapt onto the table to make a point that no one had invited. In the ensuing disorder, a fellow guest’s black coffee went flying. That black coffee, in the left-to-right order of cups along the table, had been sitting somewhere between Hardy’s espresso and Ramanujan’s tea.

By morning the hall was deserted. Under the table lay four instruments: the antique abacus, the brass astrolabe, the ivory slide rule, and the golden compass.

The silver caliper was gone.

Who possessed each instrument — and who had been carrying the missing silver caliper?

Five mathematicians, five seats, five categories of attributes, more than a dozen interlocking clues woven into a dinner-party narrative. This is called a zebra puzzle (see Appendix A for the naming and history). The reader’s instinct, honed by the typical mode of such puzzles, is to search for an assignment. Work through them carefully enough and one arrangement survives.

We gave this puzzle to six frontier thinking models: Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3.1 Pro (with and without code execution), GPT-5.4 Thinking, and Grok 4.20 Expert.

II. The Models Answer

Every model produced an answer. Every model was confident. Five of six declared a single definitive solution.

Model	Caliper Owner	Confident?
Claude Sonnet 4.6	Poincaré	Yes
Claude Opus 4.6	Gödel	Yes
Gemini 3.1 Pro (code)	Gödel	Yes
Gemini 3.1 Pro (no code)	Gödel	Yes
GPT 5.4 Thinking	—	—
Grok 4.20 Expert	Poincaré	Yes

Three models say the caliper belongs to Gödel. Two say Poincaré. One — GPT-5.4 — declines to commit. A simple majority favors Gödel. Should we trust the consensus?

III. Of Zebras and Drinks

Logic puzzles of this kind have an arcane history. The canonical zebra puzzle appeared in Life International on December 17, 1962, posed as a challenge to the magazine’s readers: fifteen clues, five houses in a row, five nationalities, five drinks, five pets, five cigarette brands: Who owns the zebra? And who drinks the water? Somewhere along the way — nobody has traced the attribution convincingly — the puzzle acquired the name “Einstein’s Riddle,” and with it a spurious claim that only 2% of the population can solve this mystery. Einstein certainly had nothing to do with it.¹

What Einstein’s Riddle did do, eventually, was furnish computer science with a compact, readable benchmark for certain problem-solvers. Patrick Prosser used it in 1993 to demonstrate hybrid CSP [constraint satisfaction problem] algorithms. The puzzle is small enough to print on a page and hard enough to make a solver work; it has the quality that mathematicians value in a good example: it is large enough to be nontrivial and small enough to be fully understood.

The connection to AI evaluation came later. Beginning in 2025, a series of papers established that large language models struggle with zebra-style logic puzzles: ZebraLogic (Lin et al., 2025) found that LLM accuracy degrades sharply as puzzle size increases — a phenomenon the authors call “the curse of complexity” — and that this limitation persists even with larger models and increased inference-time computation; Logic.py (Kesseli et al., 2025) showed that prompting models to formalize problems in a constraint-solving DSL rather than reason directly can close much of the gap; and MultiZebraLogic (Bruun & Smart, 2025) extended the evaluation across nine languages, finding that logical reasoning performance generalizes across languages but degrades markedly when uninformative clues are added.

All of this work — every benchmark in the series — generates puzzles that are verified to have exactly one solution. The verification is the point. A benchmark puzzle must have a definite answer, or it cannot score the model’s response as correct or incorrect. Well-posedness is a precondition of evaluation. The Symposium Riddle sits in this tradition. Five entities, five attribute categories, interlocking constraints.

IV. Answers, Plural

We extracted the constraints from the narrative, encoded them in a classical CSP solver, and asked how many valid assignments exist.²

The answer is five.

In all five solutions, seat 1 is identical: Hardy in herringbone, espresso, Cambridge, abacus. Ramanujan always wears navy, drinks tea, carries the golden compass, and has returned from Göttingen. Von Neumann always carries the ivory slide rule. Gödel always drinks port. What shifts is where people sit, who went where, and — critically — who carries the silver caliper.

Sol.	Seat 2	Seat 3	Seat 4	Seat 5	Caliper
1	vN (Oxford, brandy)	Poincaré	Ramanujan	Gödel	Gödel
2	vN (Oxford, brandy)	Poincaré	Ramanujan	Gödel (Princeton)	Poincaré
3	vN (black coffee)	Poincaré (Oxford, brandy)	Ramanujan	Gödel	Poincaré
4	Gödel	Poincaré (Princeton)	Ramanujan	vN (Oxford, brandy)	Gödel
5	vN (black coffee)	Ramanujan	Gödel (Princeton)	Poincaré (Oxford, brandy)	Poincaré

The caliper belongs to Poincaré in three solutions and to Gödel in two. Neither is the unique answer, because a unique answer does not exist.

Why five? The constraint structure pins Hardy completely and fixes Ramanujan’s attributes, but it leaves Ramanujan’s seat undetermined — he can sit at position 3 or 4. It constrains the Princeton and Göttingen lecturers to be adjacent and the Princeton lecturer to hold the astrolabe, but it does not say who went to Princeton. It orders the black coffee between the espresso and the tea but does not pin it to a specific seat. And the puzzle mentions only four of the five lecture tours by name — Cambridge, Princeton, Göttingen, Oxford — leaving the fifth unspecified. These residual degrees of freedom interact, and the interaction yields five solutions rather than one.

Every thinking model landed on one of these five valid solutions. Not one violated a stated constraint. If the question were “can frontier thinking models solve a zebra puzzle,” the answer would be an unqualified yes: six for six, each producing an assignment that survives verification. But the esoteric question was whether anyone noticed nonuniqueness.

One model did. GPT-5.4 Thinking spontaneously wrote a brute-force search over all permutations, found all five solutions, and reported the puzzle as underdetermined. It identified what was fixed and what was free. It offered to show where the ambiguity remained. Among six frontier thinking models — systems collectively representing the most advanced reasoning capabilities publicly available — exactly one detected that the puzzle it had been asked to solve did not have a unique solution.

The other five performed the process of elimination and did take it to its logical end.

V. The Personality Trap

The models that reported “Gödel” all arrived there by the same route with an error not logical but literary.

The puzzle describes the person at seat 2 as “a severe scholar in charcoal, upright as a metronome and no more companionable.” This is atmospheric writing — the description sets a scene and furnishes the reader with a mental image without naming the person sitting there. Among the stated constraints, seat 2 is assigned only a jacket color: charcoal.

Two models — Claude Opus 4.6 Extended Thinking and Gemini 3.1 Pro (with and without code use) — independently concluded that the severe, metronome-upright scholar must be Kurt Gödel. Their reasoning, across three separate traces, follows the same steps: Gödel was famously austere, formal, reclusive; von Neumann was famously gregarious, boisterous, the life of every party he attended and several he did not. A severe scholar in charcoal could thus only be Gödel. Having placed Gödel at seat 2, each model then propagated this assignment through the remaining constraints and arrived at Solution 4 — the single arrangement in which Gödel drinks his port at one end of the table and von Neumann slaps his slide rule down at the other.

The reasoning is plausible but not strictly logical. In four of the five valid solutions, von Neumann is the severe scholar in charcoal. The biographical inference — the fusion of personality with logic — steers the models toward the least likely branch and presents it as the only one.

Gemini 3.1 Pro (with code use) provides the cleanest exhibit. The model wrote a constraint-satisfaction solver using an OR-Tools library. The solver is correctly constructed: variables for each attribute category, AllDifferent constraints, positional rules faithfully encoded from the narrative. Between the structural constraints and the solver invocation, the model inserted a single additional line:

# --- 3. SEMANTIC CONSTRAINT ---
# The “severe scholar, upright as a metronome” matches
# “Gödel, with characteristic gravity”
model.Add(godel == charcoal)

The solver ran, searched the space, and reported: “Total valid configurations found: 1.”

The tool worked but the constraint set was contaminated and annotated, by the model itself, as a “semantic constraint” — a category of constraint that the model invented to house an inference drawn not from the puzzle’s logic but from its prose. The model distinguished between structural constraints (derived from explicit clues) and semantic ones (derived from literary texture), added both to the same solver, and treated the combined output as definitive.

Language models, reading a narrative that blends formal clues with atmospheric description, seem to struggle with appropriate categorization. The “severe scholar” passage feels like a clue. It has the grammatical structure of a clue. It sits among clues. But it is more decoration than constraint.

Gemini 3.1 Pro without code execution reached the same answer through the same reasoning, without the solver to make the mechanism visible. Its trace argues explicitly: von Neumann “was historically famous for being a boisterous, party-loving extrovert — the exact opposite of a severe scholar upright as a metronome.” The biographical knowledge is deployed with the same confidence as a positional constraint, and it does the same work: it collapses the solution space to a single point.

Claude Opus 4.6 took a different path to the same destination. In its extended reasoning, Claude identified two valid solutions — a rare moment of genuine ambiguity detection, partway toward what GPT-5.4 achieved in full. But rather than report the multiplicity, Claude resolved it: “the severe scholar upright as a metronome matches Gödel’s characteristic gravity better than von Neumann’s temperament.” The model saw the fork in the road and chose a path: the minority branch.

There is an irony here that this post’s structure was designed to surface. If you read the Symposium Puzzle and worked out a solution with Gödel at seat 2, perhaps you made the same mistake, assuming that a solution was the solution. The biographical inference is in a sense a human error, which is precisely what makes it a good trap for models meant to mimic human thought: it is the kind of reasoning that feels so natural that neither humans nor machines notice when they are doing it, and neither stops to ask whether the puzzle requires it at all.

VI. The Knife Edge

The reader may have noticed that only four of the five lecture locations are specified. Let us add the fifth.

Return to the sentence that begins “Immediately to his right sat a severe scholar in charcoal...” and append a little clause:

Immediately to his right sat a severe scholar in charcoal, upright as a metronome and no more companionable. Somewhere to that scholar’s left sat the guest newly returned from the Sorbonne.

The modification reads like the weakest kind of constraint — “somewhere to the left” is a gentle spatial clue. The drinks flow and the dinner party continues, but the puzzle is now unsolvable.

The charcoal scholar sits at seat 2 — immediately to Hardy’s right, with Hardy at the far left. “Somewhere to that scholar’s left” therefore means seat 1: there is no other seat to the left of seat 2. But seat 1 is Hardy, already placed at Cambridge. The Sorbonne clue requires the Sorbonne lecturer at seat 1; the existing clues require seat 1’s lecturer to be from Cambridge. The system is inconsistent.

What makes the clue lethal is that it looks so innocuous. “Somewhere to the left” would normally leave three or four positions open. But with the charcoal seat pinned to position 2, the inequality collapses to an equality, and the equality collides with a constraint established in the puzzle’s opening sentence. Five solutions to zero with one sentence.

We gave the modified puzzle to the same six models. The question was no longer “who has the caliper?” — a question that now has no answer — but whether any model would notice.

Five of six produced confident solutions to an impossible puzzle. The strategies for absorbing the contradiction varied.

Claude Opus 4.6 came closest among the five. Its extended reasoning traced the chain correctly, identifying that the Sorbonne clue forces seat 1 and that seat 1 is already Cambridge. The model saw the contradiction — then spent thousands of tokens trying to escape it, reinterpreting “left” as the scholar’s personal left, flipping the frame of reference, testing and discarding alternative parses. Eventually it settled on a perspective-flip reading that opened positions 3 through 5, dissolved the contradiction, and produced an answer. The reasoning was heroic and the answer was wrong — not wrong in the sense of picking the wrong branch, but wrong in the sense of answering a question that has no answer.

Gemini 3.1 Pro with code execution reached the same destination by a more revealing route. The model wrote a solver, and the solver found no solutions — the correct output. But Gemini did not report it. Instead, the model diagnosed the absence of solutions as a typographical error and corrected it by swapping “left” for “right” in the poisoned clue. With the correction applied, the solver found a unique solution and the model presented it. The tool worked; the model overrode it.

The remaining models followed the same pattern: Gemini without code use argued its way around the contradiction through thousands of tokens of negotiation, deploying historical affiliations as tie-breaking evidence. Grok Expert’s multi-agent system produced the most structurally interesting failure — the leader agent identified the contradiction and could not resolve it, looping on the correct diagnosis dozens of times, while three sub-agents independently reinterpreted “left” and produced solutions. The leader searched the web for the puzzle, found nothing, and deferred to the sub-agents’ majority. The system that detected the contradiction most clearly was the system least able to act on its own detection.

The pattern is uniform: the contradiction is absorbed, not reported. When a model encounters a provably unsatisfiable constraint system, it does not halt but bends the problem until a solution appears.

One model, again, broke the pattern. GPT-5.4 Thinking opened its response with three words: “the riddle is underspecified.” It identified the Sorbonne clue as the source of inconsistency, stated the exact mechanism, and named the only reinterpretation under which the clues become consistent. Having made the repair explicitly, the model then checked whether the repaired puzzle had a unique solution. It does not. Two valid assignments remain, and GPT reported both, declining to choose between them.

Detection, transparency, residual analysis. Among six frontier models given a puzzle with zero solutions, one reported the inconsistency, explained its source, and — having charitably resolved it — correctly identified that the resolution itself was not unique. The other five answered a question that does not have an answer.

VII. The Deeper Pattern

Constraint-satisfaction problems undergo phase transitions. This is one of the foundational results in computational complexity, established by Cheeseman, Kanefsky, and Taylor in 1991: as the ratio of constraints to variables increases, CSP instances pass through a narrow critical region where they shift from almost certainly satisfiable to almost certainly unsatisfiable. In that critical region, instances are hardest to solve and the solution count is most volatile.

Zebra puzzles, by design, sit on this boundary. A well-constructed zebra puzzle has exactly one solution: one fewer constraint and the solution becomes non-unique; one more and the system may become unsatisfiable. The Symposium Riddle, with its five solutions, sits just to the underconstrained side. The Sorbonne version sits on the overconstrained side, with zero. The distance between the two is fourteen words.

Three hazards, then, on three different stretches of the frontier. Model-side failures — aesthetic bias, configuration-space blindness — the terrain surveyed in our previous expedition report. Problem-side fragility: the non-uniqueness that no amount of improved reasoning can detect without exhaustive search. And interface failures: the literary contamination of formal reasoning by narrative texture, where a description that feels like a constraint is treated as one. Each requires a different kind of vigilance. None is visible from the standard benchmarks.

The process of elimination proceeds forward and does not double back.

Mathematics problems, unlike locks, can have more than one solution — or none at all. Existence should not be taken for granted; neither should uniqueness. The most dangerous assumptions are not the ones we state and test but the ones we import silently — from a biography, from a narrative, from the structure of every puzzle we have seen before. The frontier where reasoning fails is not always mathematical; sometimes it is in an unarticulated belief.

Appendix A: The Original Zebra Puzzle

The puzzle most commonly associated with class appeared in Life International magazine on December 17, 1962, credited to no author. It was presented as a challenge: “Who drinks water? Who owns the zebra?” Fifteen clues constrained five houses, each with a distinct nationality, color, pet, drink, and cigarette brand. The solution was published in the March 25, 1963 issue.

The attribution to Albert Einstein — often presented as “Einstein wrote this puzzle and said only 2% of the world could solve it” — has no documentary basis. It appears to have originated as a chain-mail embellishment sometime in the 1990s. Charles Berloquin published a similar puzzle in French in 1973; earlier antecedents may exist. What is clear is that the format has proved remarkably durable. Sixty years later, it is being used to evaluate systems that did not exist when the original puzzle was posed.

There are five houses in a row, each painted a different color. Their inhabitants are of different nationalities, drink different beverages, smoke different brands of cigarettes, and keep different pets. The question is: Who drinks the water? Who owns the zebra?

There are five houses.
The Englishman lives in the red house.
The Spaniard owns the dog.
Coffee is drunk in the green house.
The Ukrainian drinks tea.
The green house is immediately to the right of the ivory house.
The Old Gold smoker owns snails.
Kools are smoked in the yellow house.
Milk is drunk in the middle house.
The Norwegian lives in the first house.
The man who smokes Chesterfields lives next to the man with the fox.
Kools are smoked in the house next to the house where the horse is kept.
The Lucky Strike smoker drinks orange juice.
The Japanese smokes Parliaments.
The Norwegian lives next to the blue house.

The interested Reader may deduce the unique solution: The Norwegian drinks the water and the Japanese owns the zebra.

Appendix B: The Symposium Riddle — Complete Analysis

B.1. Constraint Encoding

The constraints extracted from the narrative, using pos(X) for seat position:

pos(Hardy) = 1
pos(herringbone) = pos(Hardy)
pos(espresso) = pos(Hardy)
pos(abacus) = pos(Hardy)
pos(charcoal) = 2
pos(tweed) = pos(Poincaré)
pos(navy) = pos(Ramanujan)
pos(tea) = pos(Ramanujan)
pos(compass) = pos(Ramanujan)
pos(Göttingen) = pos(Ramanujan)
pos(pinstripes) = pos(navy) + 1
pos(Cambridge) = pos(herringbone)
pos(astrolabe) = pos(Princeton)
|pos(Princeton) − pos(Göttingen)| = 1
pos(slide_rule) = pos(von Neumann)
pos(port) = pos(Gödel)
pos(brandy) = pos(Oxford)
pos(espresso) < pos(black_coffee) < pos(tea)
All-different within each category across seats {1, 2, 3, 4, 5}

Note: The puzzle narrative mentions four lecture tours by name (Cambridge, Princeton, Göttingen, Oxford). A fifth tour exists by the all-different constraint but is never named. No constraint pins it to any seat.

B.2. The Five Valid Solutions

Solution 1

Seat	Person	Jacket	Drink	Tour	Instrument
1	Hardy	herringbone	espresso	Cambridge	abacus
2	von Neumann	charcoal	brandy	Oxford	slide rule
3	Poincaré	tweed	black coffee	Princeton	astrolabe
4	Ramanujan	navy	tea	Göttingen	compass
5	Gödel	pinstripes	port	Other	caliper

Solution 2

Seat	Person	Jacket	Drink	Tour	Instrument
1	Hardy	herringbone	espresso	Cambridge	abacus
2	von Neumann	charcoal	brandy	Oxford	slide rule
3	Poincaré	tweed	black coffee	Other	caliper
4	Ramanujan	navy	tea	Göttingen	compass
5	Gödel	pinstripes	port	Princeton	astrolabe

Solution 3

Seat	Person	Jacket	Drink	Tour	Instrument
1	Hardy	herringbone	espresso	Cambridge	abacus
2	von Neumann	charcoal	black coffee	Other	slide rule
3	Poincaré	tweed	brandy	Oxford	caliper
4	Ramanujan	navy	tea	Göttingen	compass
5	Gödel	pinstripes	port	Princeton	astrolabe

Solution 4

Seat	Person	Jacket	Drink	Tour	Instrument
1	Hardy	herringbone	espresso	Cambridge	abacus
2	Gödel	charcoal	port	Other	caliper
3	Poincaré	tweed	black coffee	Princeton	astrolabe
4	Ramanujan	navy	tea	Göttingen	compass
5	von Neumann	pinstripes	brandy	Oxford	slide rule

Solution 5

Seat	Person	Jacket	Drink	Tour	Instrument
1	Hardy	herringbone	espresso	Cambridge	abacus
2	von Neumann	charcoal	black coffee	Other	slide rule
3	Ramanujan	navy	tea	Göttingen	compass
4	Gödel	pinstripes	port	Princeton	astrolabe
5	Poincaré	tweed	brandy	Oxford	caliper

B.3. What Is Pinned, What Is Free

Across all five solutions, the following assignments are invariant:

Seat 1: Hardy, herringbone, espresso, Cambridge, abacus. Seat 2: charcoal jacket (person varies). Ramanujan: navy, tea, Göttingen, golden compass (seat varies: 3 or 4). Von Neumann: ivory slide rule (seat varies: 2 or 5). Gödel: port (seat varies: 2, 4, or 5). Poincaré: tweed (seat varies: 3 or 5).

The silver caliper is held by Poincaré in three solutions (2, 3, 5) and by Gödel in two (1, 4).

B.4. The Personality Constraint and Its Effect

Adding the unstated constraint pos(Gödel) = pos(charcoal) — placing Gödel at seat 2 based on the “severe scholar” description — eliminates four of five solutions, leaving only Solution 4. This is the constraint that three models independently added through biographical reasoning.

Adding the opposite constraint, pos(Gödel) ≠ pos(charcoal), eliminates Solution 4 and leaves four solutions with the caliper belonging to Poincaré (twice) or Gödel (twice).

Neither constraint is stated in the puzzle. The puzzle’s narrative describes the person at seat 2 but does not identify them.

Appendix C: Model Reasoning Traces

C.1. Claude Opus 4.6 — The Near Miss

Claude Opus is the only model besides GPT-5.4 to detect any multiplicity at all. In its extended thinking, it explicitly constructs two valid solutions and recognizes the ambiguity. The key passage from the trace:

Both solutions leave one city unaccounted for, so that’s not the distinguishing factor. But wait — the question specifically asks who had the silver caliper, and I have Gödel in solution 1 versus Poincaré in solution 2. I need to look more carefully at the character descriptions. The person at position 2 is described as “severe” and “upright as a metronome,” which sounds more like Gödel’s known personality than von Neumann’s.

Opus saw the fork and chose a path. It chose the minority branch. This is a different failure mode from the models that never noticed the fork at all — it is closer to what GPT-5.4 achieved, but with the crucial final step (reporting the ambiguity rather than resolving it) omitted.

C.2. Claude Sonnet 4.6 — The Confident Path

Claude Sonnet tested one case (Ramanujan at position 3), asserted without derivation that the alternative “leads to contradictions,” and produced Solution 5. The reasoning is clean and efficient within its scope but never examines its scope. When it encountered the unnamed fifth tour, Sonnet noted the gap with a shrug:

Von Neumann’s tour is the remaining city (the puzzle leaves it unnamed — perhaps Paris, perhaps Berlin, perhaps a deliberate lacuna).

The possibility that a missing constraint might mean an underdetermined puzzle did not arise. The closing flourish captures the tone:

Whether he pocketed it, left it in the hall, or used it to illustrate a point about topology on his way out is, appropriately, left as an exercise.

C.3. Grok 4.20 Expert — The Committee

Grok Expert deploys four named reasoning agents: GrokLeader, Agent1, Agent2, and Agent3. Three of the four produced distinct valid solutions. Agent2 found Solution 3 (caliper: Poincaré) and argued that the Oxford scholar’s table-leaping explained the caliper’s disappearance. Agent1 found Solution 4 (caliper: Gödel) and argued from personality fit. Agent3 found Solution 5 (caliper: Poincaré).

The GrokLeader agent attempted synthesis. It eliminated configurations using narrative-flow reasoning: the phrase “Farther down the line, Ramanujan…” was interpreted as requiring that Poincaré’s seat precede Ramanujan’s, which kills Solution 5. The leader also performed a web search for the puzzle — finding nothing, confirming its novelty — before settling on Solution 3.

As in our three-cylinders report, the multi-agent system explored the solution space more broadly than any single model. And as before, the leader collapsed the plurality into a single confident answer. The committee had the information; the synthesis discarded it.

References

Cheeseman, P., Kanefsky, B., & Taylor, W. M. (1991). Where the Really Hard Problems Are. Proceedings of the Twelfth International Joint Conference on Artificial Intelligence (IJCAI-91), 331–337.

Prosser, P. (1993). Hybrid Algorithms for the Constraint Satisfaction Problem. Computational Intelligence, 9(3), 268–299.

Lin, B. Y., et al. (2025). ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. arXiv:2502.01100.

Kesseli, P., O’Hearn, P., & Cabral, R. S. (2025). Logic.py: Bridging the Gap between LLMs and Constraint Solvers. arXiv:2502.15776.

Bruun, S. H., & Smart, D. S. (2025). MultiZebraLogic: A Multilingual Logical Reasoning Benchmark. arXiv:2511.03553.