Haven't tried GPT-OSS-20B yet — the MOE approach is interesting for keeping VRAM usage down while getting better reasoning. 85 t/s on a 3060 is impressive. I'll look into that.
I've been on Qwen3 8B mostly because it was "good enough" for the mechanical stages (scanning, scoring, dedup) and I didn't want to optimize the local model before validating the orchestration pattern itself. Now that the pipeline is proven, experimenting with the local model is the obvious next lever to pull.
The Qwen3 4B 2507 claim is interesting — if the quality holds for structured extraction tasks, halving the VRAM footprint would open up running two models concurrently or leaving more room for larger contexts. Worth testing.
Thanks for the pointers — this is exactly the kind of optimization I haven't had time to dig into yet.
For the mechanical stages (scanning, scoring, dedup) — indistinguishable from proprietary models. These are structured tasks: "score this post 1-10 against these criteria" or "extract these fields from this text." An 8B model handles that fine at 30 tok/s on consumer GPU.
For synthesis and judgment — no, it's not close. That's exactly why I route those stages to Claude. When you need the model to generate novel connections or strategic recommendations, the quality gap between 8B and frontier is real.
The key insight is that most pipeline stages don't need synthesis. They need pattern matching. And that's where the 95% cost savings live.
I've been on Qwen3 8B mostly because it was "good enough" for the mechanical stages (scanning, scoring, dedup) and I didn't want to optimize the local model before validating the orchestration pattern itself. Now that the pipeline is proven, experimenting with the local model is the obvious next lever to pull.
The Qwen3 4B 2507 claim is interesting — if the quality holds for structured extraction tasks, halving the VRAM footprint would open up running two models concurrently or leaving more room for larger contexts. Worth testing.
Thanks for the pointers — this is exactly the kind of optimization I haven't had time to dig into yet.