It began with a deceptively simple challenge: analyze a raw transaction ledger and identify the most profitable customers.
The First Attempt: GPT-5 and the “Illusion of Thinking”
When I first tested GPT-5, it looked confident. But the moment the ledger got messy, the logic fell apart. It was a big infinite loop and while it was fast, it kept making the same mistake, over and over and over. This is what Apple’s The Illusion of Thinking study describes: large reasoning models (LRMs) seem capable until tasks become too complex.[1]
Apples research outlines three regimes:
- Low complexity: conventional models often outperform LRMs.
- Medium complexity: LRMs provide an advantage.
- High complexity: performance breaks down for both.
My ledger project quickly pushed us into the high-complexity regime. Along the way, GPT-5 demonstrated several well-known failure modes:
- Confabulation & hallucination — inventing rules that didn’t exist.
- Inference & reasoning errors — drawing incorrect connections.
- Contextual drift — forgetting prior corrections.
- Overthinking & analysis paralysis — exploring irrelevant alternatives.
- Confirmation bias — sticking to an incorrect early assumption.
The Turning Point: Two AIs, Two Roles
At that point, I stopped asking GPT-5 to “figure it out” and just tried to get it to follow instructions. I distilled the logic that GPT-5 and I had developed, then handed this structured playbook to Gemini.
Could Gemini execute the instructions consistently and solve the problem? GPT-5 had me in an infinite loop and the project was on the brink.
The Gemini Difference: Methodical over Memorizing
Gemini thrived where GPT-5 struggled. Its strength aligned with the “medium-complexity advantage” described in Apple’s research.[1]
Rather than improvising, Gemini followed the established rules step by step. It was methodical:
- Precise discrepancy analysis: traced numerical gaps to specific transaction types (e.g., “Legal Fees,” “Misc Credit”).
- Rule permanence: corrections were retained in a growing Master Instruction Block.
- No drift: context was preserved, avoiding contradictions.
AI is not taking our job anytime soon, but it is a good collaboration partner. It functioned like a junior analyst who’s not totally wet behind the ears.
Why AI Works Better as a Collaborator
The problems I saw in practice are echoed in research. Multiple studies come to the same conclusion:
AI today is most effective as a collaborator, not a replacement.
Taken together, these studies highlight a central truth: AI often suffers from hallucination, logical errors, and bias. But paired with human oversight, it becomes a powerful tool for consistency, scale, and rule-based execution.
The Takeaway: Scope over Scale
At Intro XL, the lesson is clear: success comes NOT from deploying the largest models, but from scoping tightly, codifying rules, and treating AI as a disciplined collaborator.
- Humans: provide judgment, context, and course corrections.
- AI: provides memory, consistency, and scalability.
That balance turned a messy ledger into a somewhat reliable Borrower Profitability Scorecard. AI’s true power is unlocked not when it tries to “think” like us, but when it works with us. It’s a tool and not an end all be all.
Footnotes
- Apple. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, 2025. ↩
- MIT NANDA (Challapally, Pease, Raskar, Chari). State of AI in Business 2025: The GenAI Divide, July 2025. ↩
- Gu, Jain, Li, Shetty, Shao, Li, Yang, Ellis, Sen, Solar-Lezama. Challenges and Paths for AI in Software Engineering, arXiv:2503.22625 (2025). arXiv link. ↩
Leave A Comment?