Are you still impressed by a model that aces a multiple-choice test? Because if you’re building real agents, that kind of score means almost nothing.
I’m Sam Rivera, and I build bots for a living. I’ve spent more hours than I care to admit watching agents fail in spectacular ways — not on paper, but in actual terminals, on actual tasks, where there’s no answer key and no partial credit. So when I shipped an open-source agent that topped the TerminalBench 2.0 leaderboard running on Gemini 3 Flash Preview, I didn’t feel like I’d won a trophy. I felt like I’d finally asked the right question.
Why Terminal Bench Changes What We Measure
The ThursdAI podcast put it plainly: if you’re building autonomous agents, multiple-choice tests like MMLU are basically useless now. That’s not a hot take — that’s just where the field is in 2026. We’ve moved past “can the model recall a fact” and into “can the model actually do something.” TerminalBench 2.0 is built around that shift. It puts agents inside a terminal environment and watches what happens. No hints. No structured prompts with a neat set of options. Just a task and a shell.
That’s a completely different pressure test. And it’s the one that matters if you’re shipping anything that touches real infrastructure, real codebases, or real workflows.
What I Built and Why It Worked
The agent I built is open source, and the core idea wasn’t complicated — but the execution had to be tight. I designed it specifically for agent workflows and coding tasks, which lines up closely with what TerminalBench 2.0 actually evaluates. The result was near-perfect scores on the leaderboard, which put it at the top when running on Gemini 3 Flash Preview.
A few things I focused on that I think made the difference:
- Keeping the agent loop clean. A lot of agents fail because they overcomplicate the decision cycle. Mine stays close to the metal — observe, plan, act, check.
- Treating the terminal as a first-class environment, not an afterthought. Most agent frameworks bolt on shell access. I built around it.
- Pairing with the right model. Gemini 3 Flash Preview is built for agent workflows, computer use, and elite coding tasks. That alignment matters. You don’t want a model that’s great at essays running your shell commands.
The Show HN post got picked up as one of the best of 2026, which honestly surprised me. But looking at the list — isometric, the GPU-building game, the tiny LLM explainer — there’s a clear thread. People respond to builders who show their work and explain what they learned. That’s what I tried to do.
The Benchmark Exploitation Problem
Here’s something I can’t ignore: there’s a paper circulating on Hacker News right now about exploiting prominent AI agent benchmarks. The researchers achieved near-perfect scores too — through exploits, not genuine capability. The HN thread called it a phenomenal paper, and I agree. It should change how benchmarking is done.
So I want to be direct about this. When I say my agent topped TerminalBench 2.0, I mean it performed well on real terminal tasks in the way the benchmark was designed to measure. I’m not going to pretend the benchmarking space is clean. It isn’t. But the answer to that isn’t to stop measuring — it’s to measure better, and to be transparent about methodology. TerminalBench 2.0 is a step in the right direction because it’s harder to game a live terminal than a static multiple-choice format.
What This Means for Bot Builders
If you’re building agents in 2026, the question you should be asking isn’t “what’s the model’s MMLU score.” Ask whether it can hold up in a terminal. Ask whether it can recover from an error mid-task. Ask whether it can write and run code, check the output, and adjust — without you holding its hand.
That’s the bar now. And the good news is that the tools to meet that bar are genuinely solid. Gemini 3 Flash Preview is fast and capable in exactly the ways that matter for this kind of work. Pair it with a well-structured agent loop and you’re not just passing benchmarks — you’re building something that actually works in production.
The agent is open source. Go break it, fork it, improve it. That’s the whole point. Benchmarks are just the starting line.
🕒 Published: