M4v3R 14 hours ago

2000 tokens per second is absolutely insane for a model that's on par with GPT 4.1. However throughoutput is only one part of the equation, the other being latency. As of right now it looks like the latency for every API call is quite high, it takes few seconds to receive first token for every API call. This means it's not as exciting for agentic use where many API calls are being made in quick succession. I wish providers focused more on this part.

retreatguru 19 hours ago

I'm looking forward to trying this out.

I'd like to try this out: use Claude Code as the interface, setup claude-code-router to connect to Cerebras Qwen3 coder and see 20x speed up. The speed difference might make up for the slightly less intelligence compared to Sonnet or Opus.

I don't see Qwen3 Coder available yet on Open Router https://openrouter.ai/provider/cerebras

  • gnulinux 15 hours ago

    It's averaging to $0.3/1M input tok and $1.2/1M output tok. That's kind of mind blowingly cheap for a model at its caliber. Gemini 2.5 Pro is more than 10x that price.

gnulinux 15 hours ago

At $2/1Mt it's cheaper than e.g. Gemini 2.5 Pro which is ($1.25/1Mt for input and $10/1Mt per output). When I code with Aider my requests average to something like 5000 tokens input and 800 tokens output. At this rate, Gemini 2.5 Pro is about $0.01425 per single Aider request and Cerebras Qwen3 Coder is $0.0116 per request. Not a significant difference, but I think sufficiently cheaper to be competitive, especially given Qwen3-coder is on part with Gemini/Claude/o3, it even surpasses them in some tests.

NOTE: Currently in OpenRouter, Qwen3-Coder requests are averaging to $0.3/1M input tok and $1.2/1M output tok. That's just so significantly cheaper that I wouldn't be surprised if open weight models start eating Google/Anthropic/OpenAI lunch. https://openrouter.ai/qwen/qwen3-coder

  • pkaye 13 hours ago

    Do you have any experience on how good is Qwen3-coder compared to Claude 4 Sonnet?

pxc 14 hours ago

This feels way less annoying to use than ChatGPT. But I wonder how much the effect is lost when the tool does many of the things that make models like o3 useful (repeated web searches, running code in a sandbox, etc.).

For code generation, this does seem pretty useful with something like Qwen3-Coder-480B, if that generates good enough code for your purposes.

But for chat, I wonder: does this kind of speed call for models that behave pretty differently to current ones? With virtually instant speed, I find myself wanting much shorter answers sometimes. Maybe a model whose design and training are focused on concision and a context with lots and lots of turns would be a uniquely useful option with this kind of hardware.

But I guess the hardware is really for training, right, and the inference-as-a-service stuff is basically a powerful form of marketing?

alcasa 17 hours ago

Really cool, especially once 256k context size becomes available.

I think higher performance will be a key differentiator in AI tool quality from a user perspective, especially in use-cases where model quality is already sufficiently good for human-in-loop usage.