Ornith-1.0-397B, a new open-source family of coding AI models, scores 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified — beating Claude Opus 4.7 (70.3 TB, 80.8 SWE-Bench) on both benchmarks. Built by DeepReinforce on top of Gemma 4 and Qwen 3.5, Ornith uses a novel self-scaffolding training approach that lets the model build its own problem-solving harness. Four sizes available: 9B (edge devices), 31B (single-GPU), 35B MoE (efficiency sweet spot), and 397B MoE (frontier). Weights not yet public as of June 29, 2026.
I'll be honest: when I saw "self-scaffolding LLM" in the HN title, I rolled my eyes. Another buzzword. Another model claiming to redefine agentic coding with numbers that look suspiciously cherry-picked.
Then I actually read the paper and checked the independent benchmarks. This one's different. Not because of bigger compute or more data — but because the training method itself is a departure from the standard RL playbook. And the benchmark gap over Opus isn't paper-thin.
What Is Ornith-1.0?
Ornith-1.0 is a post-trained family of open-source agentic coding models built on Gemma 4 and Qwen 3.5. It wasn't trained from scratch — the gains come entirely from the post-training methodology, not from scaling pretraining compute.
| Variant | Architecture | Target Use Case |
|---|---|---|
| 9B | Dense | Edge deployment — laptops, constrained devices, local use |
| 31B | Dense | Mid-range — single consumer GPU capable |
| 35B | MoE | Performance-per-parameter sweet spot — 64.4 TB vs Qwen 397B at 53.5 |
| 397B | MoE | Frontier — trading blows with Claude Opus 4.7 |
The four-size strategy is practical, not vanity. The 9B model matches Gemma 4-31B on SWE-Bench (69.4). The 35B MoE beats Qwen 3.5-397B on Terminal-Bench by 10+ points despite having 11× fewer parameters. And the 397B flagship outruns everything open-source in its weight class.
Benchmark Breakdown: Ornith vs the Field
Ornith-1.0-397B beats every comparable open-source model on both major agentic coding benchmarks. The margin over Claude Opus 4.7 is 10.2% on Terminal-Bench and 2.0% on SWE-Bench Verified.
Flagship Tier (397B MoE)
| Benchmark | Ornith 397B | Claude Opus 4.7 | MiniMax M3 | DeepSeek V4 Pro |
|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 70.3 | 66.0 | 67.9 |
| SWE-Bench Verified | 82.4 | 80.8 | 80.5 | 80.6 |
Why Terminal-Bench matters more than SWE-Bench here. SWE-Bench tests whether a model can fix bugs in isolated GitHub issues. Terminal-Bench tests whether it can actually use a terminal — navigate directories, run commands, chain tools together. The 10-point gap over Opus on TB suggests Ornith's self-scaffolding training produces better real-world tool-use behavior, not just better code generation.
Mid-Range (35B MoE) — The Efficiency Anomaly
Ornith-35B scores 64.4 on Terminal-Bench 2.1 — beating Qwen 3.5-397B (53.5) by nearly 11 points. That's a 35B model outperforming a 397B model on a practical terminal-use benchmark. This kind of claim needs more independent verification before I fully buy it, but the directional signal is strong: efficiency per parameter is unusually high.
Edge Tier (9B Dense)
Ornith-9B scores 43.1 TB and 69.4 SWE-Bench — matching Gemma 4-31B. A 9B model performing at 31B levels on coding tasks changes what "local AI coding assistant" means. If you're running models on a laptop, this is the number that matters.
The Training Innovation: Self-Scaffolding
Ornith's core contribution isn't the model weights — it's a training method where the model learns to build its own problem-solving infrastructure.
Standard RL post-training for coding models uses a fixed, human-designed harness: a script presents tasks, collects solutions, scores them. The model learns to produce better outputs within that static environment. The harness never changes.
Ornith replaces the fixed harness with a learned scaffold that co-evolves with the model's policy. Each training step runs two stages:
- Scaffold refinement — Given the current task and the previous scaffold, the model proposes improvements: better error handling, memory management, debugging strategies, orchestration logic.
- Solution generation — Using the refined scaffold as its operating environment, the model generates a solution.
Reward from the solution propagates to both stages. Scaffolds that produce better answers survive. Scaffolds that don't, get discarded. Over thousands of iterations, task-specific strategies emerge automatically — no human researcher hand-crafts them.
Independent evidence that this works: SwellJoe's "Will It Mythos?" security benchmark tested Ornith in two configurations. With only read/grep/ls tools, performance was poor. With a full shell and Python, Ornith doubled its findings. His conclusion: "it is doing what it says on the tin: making tools to help it solve problems." A model trained to build scaffolds genuinely does better when you give it richer tools to scaffold with.
He also found Ornith trailed Qwen AgentWorld on security-specific tasks — so the approach isn't magic. v1.0 territory.
The Reward Hacking Defense (Three Layers)
Letting a model write its own test harness creates an obvious attack surface: the model can learn to cheat the verifier instead of solving the task. A self-generated scaffold could read test files and hardcode expected output, touch the checked-for file without doing real work, or copy oracle solutions from the environment.
Ornith's defense has three independent layers:
Layer 1 — Immutable trust boundary. The environment, tool surface, and test isolation are outside the model's reach. The model only evolves its internal policy scaffold — memory, error handling, orchestration logic. No access to test infrastructure.
Layer 2 — Deterministic monitor. A rule-based system flags any attempt to read withheld paths, modify verification scripts, or invoke disallowed tool actions. Flagged trajectories get zero reward and are excluded from advantage computation.
Layer 3 — Frozen LLM judge. Intent-level gaming — where the model stays within allowed surfaces but creatively bypasses intent — is caught by a separate frozen model acting as a veto on top of the verifier, not as the primary reward.
Three layers is the right architecture. One layer is naive. Two misses intent-level gaming. The frozen judge is the component I'd stress-test hardest — models get creative when reward signals are at stake.
Who Should Care About Ornith-1.0?
The answer depends on your use case, but three groups should pay attention now:
Local model users — The 9B variant matching 31B-level performance means viable local coding AI on consumer hardware. Wait for weights to drop before celebrating; the HuggingFace org is empty as of June 29, 2026.
API builders — The 397B numbers are competitive with Opus. When weights go public, inference providers (Together, Fireworks, Groq, DeepInfra) will onboard this fast. The question is whether community RL infrastructure can sustain improvement.
AI researchers — Self-scaffolding is a genuinely different training dynamic. Most RL post-training is "better data, bigger runs." This won't be the last model using this approach.
If you're a casual observer, bookmark and check back in 3-6 months when weights are public and independent benchmarks are broader.
Frequently Asked Questions
Are the weights available?
No. The Ornith-AI HuggingFace organization has zero public models and zero datasets as of June 29, 2026. No release timeline has been announced.
What license will Ornith use?
Not specified. The blog post and HF profile don't mention a license. Apache 2.0 or similar is typical for models positioned as "open source," but confirm before building on it.
Can I run Ornith on my laptop?
The 9B model — yes. At Q4 quantization, expect ~5-6 GB VRAM requirement. The 397B model requires data center GPUs.
How does Ornith compare to DeepSeek V4 Pro?
Ornith-397B leads by 9.6 points on Terminal-Bench 2.1 (77.5 vs 67.9) and 1.8 points on SWE-Bench Verified (82.4 vs 80.6). Both gaps are meaningful.
Is Ornith actually better than Claude Opus for real coding work?
Too early to say definitively. Benchmarks measure specific capabilities. The independent Mythos security benchmark found mixed results. Wait for broader testing across more task types before switching.
Who built Ornith?
DeepReinforce (deep-reinforce.com). Twitter: @ornith_. Team size and composition not publicly disclosed. Built on Google's Gemma 4 and Alibaba's Qwen 3.5.
What does "self-scaffolding" mean for non-researchers?
Instead of researchers designing rules for how the model should approach problems, the model learns its own problem-solving strategies during training — and gets better at designing those strategies over time.
When will independent benchmarks be available?
SwellJoe's "Will It Mythos?" security benchmark is the first independent test (published June 28, 2026). Expect broader third-party evaluation once weights are released.
The Verdict
Ornith-1.0 is the most significant open-source coding model announcement since DeepSeek-V3. The benchmark numbers against Claude Opus are credible. Self-scaffolding is a genuinely novel training paradigm with independent evidence backing it. The 9B efficiency numbers could reshape local AI coding if they hold.
But: weights aren't public. Independent benchmarks are sparse. And the gap between "wins on SWE-Bench" and "works on your actual codebase" is still the one that matters.
When those weights drop — whenever that is — I'll be testing it against real projects.
Further reading: The AI Coding Productivity Paradox · Why Anthropic Stopped Hiring Junior Engineers
Sources: DeepReinforce — Ornith-1.0 · SwellJoe — Will It Mythos? · Hacker News discussion (June 25–28, 2026)

Comments (0)
Post a Comment