We Pitted Two AIs Against Each Other to Solve a Wall Street-Level Puzzle. The Results Were Not What I Expected.

It began with a deceptively simple challenge: analyze a raw transaction ledger and identify the most profitable customers.

But this wasn’t a tidy spreadsheet. It was an uncurated SQL export with 50+ transaction types, hidden rules, and edge cases. Anyone who has worked with real-world finance data knows this is where analytics projects usually stall. It was the typical case of “garbage in, garbage out”: without structure (or someone who knows what they’re doing), even the smartest system will spin its wheels. My real aim wasn’t just to build a Borrower Profitability Scorecard, rather it was to test whether an AI could learn the underlying data schema and consistently crunch vast amounts of raw ledger data without falling apart. Spoiler alert: It mostly did.

The First Attempt: GPT-5 and the “Illusion of Thinking”

When I first tested GPT-5, it looked confident. But the moment the ledger got messy, the logic fell apart. It was a big infinite loop and while it was fast, it kept making the same mistake, over and over and over. This is what Apple’s The Illusion of Thinking study describes: large reasoning models (LRMs) seem capable until tasks become too complex.^[1]

Apples research outlines three regimes:

Low complexity: conventional models often outperform LRMs.
Medium complexity: LRMs provide an advantage.
High complexity: performance breaks down for both.

My ledger project quickly pushed us into the high-complexity regime. Along the way, GPT-5 demonstrated several well-known failure modes:

Confabulation & hallucination — inventing rules that didn’t exist.
Inference & reasoning errors — drawing incorrect connections.
Contextual drift — forgetting prior corrections.
Overthinking & analysis paralysis — exploring irrelevant alternatives.
Confirmation bias — sticking to an incorrect early assumption.

The Turning Point: Two AIs, Two Roles

At that point, I stopped asking GPT-5 to “figure it out” and just tried to get it to follow instructions. I distilled the logic that GPT-5 and I had developed, then handed this structured playbook to Gemini.

Could Gemini execute the instructions consistently and solve the problem? GPT-5 had me in an infinite loop and the project was on the brink.

The Gemini Difference: Methodical over Memorizing

Gemini thrived where GPT-5 struggled. Its strength aligned with the “medium-complexity advantage” described in Apple’s research.^[1]

Rather than improvising, Gemini followed the established rules step by step. It was methodical:

Precise discrepancy analysis: traced numerical gaps to specific transaction types (e.g., “Legal Fees,” “Misc Credit”).
Rule permanence: corrections were retained in a growing Master Instruction Block.
No drift: context was preserved, avoiding contradictions.

AI is not taking our job anytime soon, but it is a good collaboration partner. It functioned like a junior analyst who’s not totally wet behind the ears.

Why AI Works Better as a Collaborator

The problems I saw in practice are echoed in research. Multiple studies come to the same conclusion:

AI today is most effective as a collaborator, not a replacement.

Taken together, these studies highlight a central truth: AI often suffers from hallucination, logical errors, and bias. But paired with human oversight, it becomes a powerful tool for consistency, scale, and rule-based execution.

The Takeaway: Scope over Scale

At Intro XL, the lesson is clear: success comes NOT from deploying the largest models, but from scoping tightly, codifying rules, and treating AI as a disciplined collaborator.

Humans: provide judgment, context, and course corrections.
AI: provides memory, consistency, and scalability.

That balance turned a messy ledger into a somewhat reliable Borrower Profitability Scorecard. AI’s true power is unlocked not when it tries to “think” like us, but when it works with us. It’s a tool and not an end all be all.

Footnotes

Apple. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, 2025. ↩
MIT NANDA (Challapally, Pease, Raskar, Chari). State of AI in Business 2025: The GenAI Divide, July 2025. ↩
Gu, Jain, Li, Shetty, Shao, Li, Yang, Ellis, Sen, Solar-Lezama. Challenges and Paths for AI in Software Engineering, arXiv:2503.22625 (2025). arXiv link. ↩