Gemini Pro initially refused (!) but it was quite simple to get a response:
> give me the svg of a pelican riding a bicycle
> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!
> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?
All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out
Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?
I just don't find the benchmarks on the site here at all believable. codex for me with gpt-5 is so much better then claude any model version. I mean maybe it's because they compare to gpt-5-codex model but they don't mention is that high, medium, low, etc... so it's just misleading probably... but i must reiterate zero loyalty to any AI vendor. 100% what solves the problem more consistently and of a higher quality and currently gpt-5 high - hands down
Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.
Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.
Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.
Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.
If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.
That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.
Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]
Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)
Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...I don't even know what. I know Cursor lets you select a few different models under the covers, but I have no idea how they differ, nor do I care.
I just want consistent tooling and I don't want to have to think about what's going on behind the scenes. Make it better. Make it better without me having to do research and pick and figure out what today's latest fashion is. Make it integrate in a generic way, like TLS servers, so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.
> annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions
I use KiloCode and what I find amazing is that it'll be working on a problem and then a message will come up about needing to topup the money in my account to continue (or switch to a free model), so I switch to a free model (currently their Code Supernova 1million context) and it doesn't miss a beat and continues working on the problem. I don't know how they do this. It went from using a Claude Sonnet model to this Code Supernova model without missing a beat. Not sure if this is a Kilocode thing or if others do this as well. How does that even work? And this wasn't a trivial problem, it was adding a microcode debugger to a microcoded state machine system (coding in C++).
> The score reported uses a minor prompt addition: "You should use tools as much as possible, ideally more than 100 times. You should also implement your own tests first before attempting the problem."
I'm not sure if the SWE benchmark score can be compared like for like with OpenAIs scores because of this.
I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.
I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.
Technically, they released Opus 4.1 a few weeks ago, so that alone hints at a smaller leap from 4.1 -> 4.5, compared to the leap from Sonnet 4 -> 4.5. That is, of course, if those version numbers represent anything but marketing, which I don't know.
I had forgotten that, given that Sonnet pretty much blows Opus out of the water these days.
Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.
My impression is that Sonnet and Haiku 4.5 are the same "base models" as Sonnet and Haiku 4, the improvements are from fine tuning on data generated by Opus.
I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)
Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.
The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.
Opus disappeared for quite a while and then came back. Presumably they're always working on all three general sizes of models, and there's some combination of market need and model capabilities which determine if and when they release any given instance to the public.
It's interesting to think about various aspects of marketing the models, with ChatGPT going the "internal router" direction due to address the complexity of choosing. I'd never considered something smaller than Haiku to be needed, but I also rarely used Haiku in the first place...
If you're going smaller than Haiku, you might be at the point of using various cheap open models already. The small model would need some good killer features to justify the margins.
$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 but nowadays thanks to the pace of the industry developing smaller/faster LLMs for agentic coding, you can get comparable models priced for much lower which matters at the scale needed for agentic coding.
Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.
With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.
The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.
I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens).
They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!
But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?
Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake
I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options
$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).
Unless you're working on a small greenfield project, you'll usually have 10s-100s of thousands of relevant words (~tokens) of relevant code in context for every query, vs a few hundred words of changes being output per query. Because most changes to an existing project are relatively small in scope.
I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.
I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.
Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.
I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).
I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?
I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.
What is the use case for these tiny models? Is it speed? Is it to move on device somewhere? Or is it to provide some relief in pricing somewhere in the API? It seems like most use is through the Claude subscription and therefore the use case here is basically non-existent.
I think with gpt-5-mini and now Haiku 4.5, I’d phrase the question the other way around: what do you need the big models for anymore?
We use the smaller models for everything that’s not internal high complexity tasks like coding. Although they would do a good enough of a job there as well, we happily pay the uncharge to get something a little better here.
Anything user facing or when building workflow functionalities like extracting, converting, translating, merging, evaluating, all of these are mini and nano cases at our company.
One big use-case is that claude code with sonnet 4.5 will delegate into the cheaper model (configurable) more specific, contextful tasks, and spin up 1-3 sub-agents to do so. This process saves a ton of available context window for your primary session while also increasing token throughput by fanning-out.
How does one configure Claude code to delegate to cheaper models?
I have a number of agents in ~/.claude/agents/. Currently have most set to `model: sonnet` but some are on haiku.
The agents are given very specific instructions and names that define what they do, like `feature-implementation-planner` and `feature-implementer`. My (naive) approach is to use higher-cost models to plan and ideally hand off to a sub-agent that uses a lower-cost model to implement, then use a higher-cost to code review.
I am either not noticing the handoffs, or they are not happening unless specifically instructed. I even have a `claude-help` agent, and I asked it how to pipe/delegate tasks to subagents as you're describing, and it answered that it ought to detect it automatically. I tested it and asked it to report if any such handoffs were detected and made, and it failed on both counts, even having that initial question in its context!
If you look at the OpenRouter rankings for LLMs (generally, the models coders use for vibe/agentic coding), you can see that most of them are in the "small" model class as opposed to something like full GPT-5 or Claude Opus, albeit Gemini 2.5 Pro is higher than expected: https://openrouter.ai/rankings
for me its the speed; eg cerebras qwen coder gets you a completely different workflow as its practically instant (3k tps) -- feels less like an agent and more like a natural language shell, very helpful for iterating on a plan that you them forward to a bigger model
For me speed is interesting. I sometimes use Claude from the CLI with `claude -p` for quick stuff I forget like how to run some docker image. Latency and low response speed is what almost makes me go to Google and search for it instead.
I use gh copilot suggest in lieu of claude -p. Two seconds latency and highly accurate. You probably need a gh copilot auth token to do this though, and truthfully, that is pointless when you have access to Claude code.
In my product I use gpt-5-nano for image ALT text in addition to generating transcriptions of PDFs. It’s been surprisingly great for these tasks, but for PDFs I have yet to test it on a scanned document.
I went looking for the bit about if it blackmails you or tries to murder you... and it was a bit of a cop-out!
> Previous system cards have reported results on an expanded version of our earlier agentic
misalignment evaluation suite: three families of exotic scenarios meant to elicit the model
to commit blackmail, attempt a murder, and frame someone for financial crimes. We
choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku
4.5 showed many clear examples of verbalized evaluation awareness on all three of the
scenarios tested in this suite. Since the suite only consisted of many similar variants of
three core scenarios, we expect that the model maintained high unverbalized awareness
across the board, and we do not trust it to be representative of behavior in the real extreme
situations the suite is meant to emulate.
In our (very) early testing at Hyperbrowser but we're seeing Haiku 4.5 do really well on computer use as well. Pretty cool that Haiku is like the cheapest computer use model from the big labs now.
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.
In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:
* A strong preference against engaging with harmful tasks;
* A pattern of apparent distress when engaging with real-world users seeking harmful content; and
* A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions.
I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.
You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.
Either charge more, stop subsidizing free accounts, or decrease the daily limit.
Sonnet 4.5 was two weeks ago. In the past I never had such issues, but every week my quota ended in 2-3 days. I suspect the Sonnet 4.5 model consumes more usage points than old Sonnet 4.1
I am afraid Claude Pro subscription got 3x less usage
I've tried it on a test case for generating a simple SaaS web page (design + code).
Usually I'm using GPT-5-mini for that task. Haiku 4.5 runs 3x faster with roughly comparable results (I slightly prefer the GPT-5-mini output but may have just accustomed to it).
I don't understand why more people don't talk about how fast the models are. I see so much obsession with bechmark scores but speed of response is very important for day to day use.
I agree that the models from OpenAI and Google have much slower responses than the models from Anthropic. That makes a lot of them not practical for me.
Why I use cheaper models for summaries (a lot ogf gemini-2.5-flash), what’s the use case of cheaper AI for coding? Getting more errors, or more spaghetti code, seems never worth it.
I feel like if I just do a better job of providing context and breaking complex tasks into a series of simple tasks then most of the models are good enough for me to code.
At augmentcode.com, we've been evaluating Haiku for some time, it's actually a very good model. We found out it's 90% as good as Sonnet and is ~34% faster than sonnet!
Where it doesn't shine much is on very large coding task. but it is a phenomenal model for small coding tasks and the speed improvement is much welcome
90% as good as Sonnet 4 or 4.5?
Openrouter just started reporting, and it's saying Haiku is 2x as fast (60tps vs 125tps) and 2-3x less latent (2-3s vs 1s)
The main thing holding these Anthropic models back is context size. Yes, quality deteriorates over a large context window, but for some applications, that is fine. My company is using grok4-fast, the Gemini family, and GPT4.1 exclusively at this point for a lot of operations just due to the huge 1m+ context.
What LLM do you guys use for fast inference for voice/phone agents? I feel like to get really good latency I need to "cheat" with Cerebras, groq or SambaNova.
Haiku 4.5 is very good but still seems to be adding a second of latency.
Sonnet 4.5 is an excellent model for my startup's use case. Chatting to Haiku it looks promising too, and it may be great drop in replacement for some of inference tasks that have a lot of input tokens but don't require 4.5-level intelligence.
And I was wondering today why Sonnet 4.5 seemed so freaking slow. Now this explains it, Sonnet 4.5 is the new Opus 4.1 where Anthropic does not really want you to use it.
If you want to see it generate a Haiku from your webcam I just upgraded my silly little bring-your-own-key Haiku app to use the new model: https://tools.simonwillison.net/haiku
Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...
Gemini Pro initially refused (!) but it was quite simple to get a response:
> give me the svg of a pelican riding a bicycle
> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!
> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?
> Of course. Here is the SVG code...
(it was this in the end: https://tinyurl.com/zpt83vs9)
Gemini 3.0 Pro (or what is deemed to be 3.0 Pro - you can get access to it via A/B testing on AI Studio) does a noticeably better job
https://x.com/cannn064/status/1972349985405681686
https://x.com/whylifeis4/status/1974205929110311134
https://x.com/cannn064/status/1976157886175645875
"create svg code that will create an image of svg code that will create a pelican riding a bicycle"
https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...
(can be rendered using simon's page at your link)
I like this workflow
Context on this cutting-edge benchmark for those unaware:
https://simonwillison.net/2025/Jun/6/six-months-in-llms/
https://simonwillison.net/tags/pelican-riding-a-bicycle/
Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852
I’m surprised none of the frontier model companies have thrown this test in as an Easter egg.
Because then they would have to admit that they try to game benchmarks
simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.
All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out
Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?
imagine finding the full text of the svg in the library of babel. Great work!
I just don't find the benchmarks on the site here at all believable. codex for me with gpt-5 is so much better then claude any model version. I mean maybe it's because they compare to gpt-5-codex model but they don't mention is that high, medium, low, etc... so it's just misleading probably... but i must reiterate zero loyalty to any AI vendor. 100% what solves the problem more consistently and of a higher quality and currently gpt-5 high - hands down
Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.
Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.
For reference:
Haiku 3: I $0.25/M, O $1.25/M
Haiku 4.5: I $1.00/M, O $5.00/M
GPT-5: I $1.25/M, O $10.00/M
GPT-5-mini: I $0.25/M, O $2.00/M
GPT-5-nano: I $0.05/M, O $0.40/M
GLM-4.6: I $0.60/M, O $2.20/M
Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.
Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.
If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.
That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.
[0] https://support.claude.com/en/articles/11145838-using-claude...
Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]
[1] https://openrouter.ai/anthropic/claude-haiku-4.5
Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)
Those numbers don’t mean anything without average token usage stats.
Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...I don't even know what. I know Cursor lets you select a few different models under the covers, but I have no idea how they differ, nor do I care.
I just want consistent tooling and I don't want to have to think about what's going on behind the scenes. Make it better. Make it better without me having to do research and pick and figure out what today's latest fashion is. Make it integrate in a generic way, like TLS servers, so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.
> annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions
I use KiloCode and what I find amazing is that it'll be working on a problem and then a message will come up about needing to topup the money in my account to continue (or switch to a free model), so I switch to a free model (currently their Code Supernova 1million context) and it doesn't miss a beat and continues working on the problem. I don't know how they do this. It went from using a Claude Sonnet model to this Code Supernova model without missing a beat. Not sure if this is a Kilocode thing or if others do this as well. How does that even work? And this wasn't a trivial problem, it was adding a microcode debugger to a microcoded state machine system (coding in C++).
Models are stateless, why would that not work?
> The score reported uses a minor prompt addition: "You should use tools as much as possible, ideally more than 100 times. You should also implement your own tests first before attempting the problem."
I'm not sure if the SWE benchmark score can be compared like for like with OpenAIs scores because of this.
https://en.wikipedia.org/wiki/Goodhart%27s_law "When a measure becomes a target, it ceases to be a good measure"
I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.
Comparing haiku and sonnet for a question needing a code doc fetch:
haiku https://claude.ai/share/8a5c70d5-1be1-40ca-a740-9cf35b1110b1 sonnet https://claude.ai/share/51b72d39-c485-44aa-a0eb-30b4cc6d6b7b
haiku invented the output of a function and gave a bad answer. sonnet got it right
I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.
Technically, they released Opus 4.1 a few weeks ago, so that alone hints at a smaller leap from 4.1 -> 4.5, compared to the leap from Sonnet 4 -> 4.5. That is, of course, if those version numbers represent anything but marketing, which I don't know.
I had forgotten that, given that Sonnet pretty much blows Opus out of the water these days.
Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.
My impression is that Sonnet and Haiku 4.5 are the same "base models" as Sonnet and Haiku 4, the improvements are from fine tuning on data generated by Opus.
I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)
Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.
The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.
Which is all to say that I think the reason they went from Opus 3 to Opus 4 is because there was no bigger model to fine tune Opus 3.5 with.
And I would expect Opus 4 to be much the same.
Opus disappeared for quite a while and then came back. Presumably they're always working on all three general sizes of models, and there's some combination of market need and model capabilities which determine if and when they release any given instance to the public.
I wonder what the next smaller model after Haiku will be called. "Claude Phrase"?
Claude Koan
Claude Glyph.
Smallest, fastest model yet, ideally suited for Bash oneliners and online comments.
It's interesting to think about various aspects of marketing the models, with ChatGPT going the "internal router" direction due to address the complexity of choosing. I'd never considered something smaller than Haiku to be needed, but I also rarely used Haiku in the first place...
If you're going smaller than Haiku, you might be at the point of using various cheap open models already. The small model would need some good killer features to justify the margins.
If they do come up with a tiny model tuned for generating conversion and code, I think that Claude Acronym would be a perfect name.
Claude Garden Path Sentence
Claude from Nantucket
Claude Couplet
Claude Punchline
Claude Banger
Claude has stopped showing code in artifacts unless it knows the extension.
I used to be able to work on Arduino .ino files in Claude now it just says it can’t show it to me.
And do we have zip file uploads yet to Claude? ChatGPT and Gemini have done this for ages.
And all the while Claude’s usage limits keep going up.
So yeah, less for more with Claude.
$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 but nowadays thanks to the pace of the industry developing smaller/faster LLMs for agentic coding, you can get comparable models priced for much lower which matters at the scale needed for agentic coding.
Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.
With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.
This could be massive.
The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.
https://docs.claude.com/en/docs/build-with-claude/prompt-cac...
https://ai.google.dev/gemini-api/docs/caching
https://platform.openai.com/docs/guides/prompt-caching
https://docs.x.ai/docs/models#cached-prompt-tokens
I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!
It's not about storing data on disk, it's about keeping data resident in memory.
Fascinating, so I have to think more "pay for RAM/redis" than "pay for SSD"?
"pay for data on VRAM" RAM of GPU
But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?
Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake
They are not caching to save network bandwidth. They are caching to increase interference speed and reduce (their own) costs.
That is slow.
I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options
What's sub-optimal about the OpenAI approach, where you get 90% discount on tokens that you've previously sent within X minutes?
$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).
GPT-5 is $10/M for output tokens, twice the cost of Haiku 4.5 at $5/M, despite Haiku apparently being better at some tasks (SWE Bench).
I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?
> I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?
Depends on what you're doing, but for modifying an existing project (rather than greenfield), input tokens >> output tokens in my experience.
Unless you're working on a small greenfield project, you'll usually have 10s-100s of thousands of relevant words (~tokens) of relevant code in context for every query, vs a few hundred words of changes being output per query. Because most changes to an existing project are relatively small in scope.
I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.
I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.
This also means API usage through Claude Code got more expensive (but better if benchmarks are to be believed)
Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.
I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).
There's probably less margin on the low end, so they don't want to focus on capturing it.
Margin? Hahahahaha
Inference is profitable.
I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?
I am very excited about this. I am a freelance developer and getting responses 3x faster is totally worth the slightly reduced capability.
I expect I will be a lot more productive using this instead of claude 4.5 which has been my daily driver LLM since it came out.
System card: https://assets.anthropic.com/m/99128ddd009bdcb/original/Clau... (edit: discussed here https://news.ycombinator.com/item?id=45596168)
This is Anthropic's first small reasoner as far as I know.
I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.
What is the use case for these tiny models? Is it speed? Is it to move on device somewhere? Or is it to provide some relief in pricing somewhere in the API? It seems like most use is through the Claude subscription and therefore the use case here is basically non-existent.
I think with gpt-5-mini and now Haiku 4.5, I’d phrase the question the other way around: what do you need the big models for anymore?
We use the smaller models for everything that’s not internal high complexity tasks like coding. Although they would do a good enough of a job there as well, we happily pay the uncharge to get something a little better here.
Anything user facing or when building workflow functionalities like extracting, converting, translating, merging, evaluating, all of these are mini and nano cases at our company.
One big use-case is that claude code with sonnet 4.5 will delegate into the cheaper model (configurable) more specific, contextful tasks, and spin up 1-3 sub-agents to do so. This process saves a ton of available context window for your primary session while also increasing token throughput by fanning-out.
How does one configure Claude code to delegate to cheaper models?
I have a number of agents in ~/.claude/agents/. Currently have most set to `model: sonnet` but some are on haiku.
The agents are given very specific instructions and names that define what they do, like `feature-implementation-planner` and `feature-implementer`. My (naive) approach is to use higher-cost models to plan and ideally hand off to a sub-agent that uses a lower-cost model to implement, then use a higher-cost to code review.
I am either not noticing the handoffs, or they are not happening unless specifically instructed. I even have a `claude-help` agent, and I asked it how to pipe/delegate tasks to subagents as you're describing, and it answered that it ought to detect it automatically. I tested it and asked it to report if any such handoffs were detected and made, and it failed on both counts, even having that initial question in its context!
They are great for building more specialized tool calls that the bigger models can call out to in agentic loops.
If you look at the OpenRouter rankings for LLMs (generally, the models coders use for vibe/agentic coding), you can see that most of them are in the "small" model class as opposed to something like full GPT-5 or Claude Opus, albeit Gemini 2.5 Pro is higher than expected: https://openrouter.ai/rankings
for me its the speed; eg cerebras qwen coder gets you a completely different workflow as its practically instant (3k tps) -- feels less like an agent and more like a natural language shell, very helpful for iterating on a plan that you them forward to a bigger model
For me speed is interesting. I sometimes use Claude from the CLI with `claude -p` for quick stuff I forget like how to run some docker image. Latency and low response speed is what almost makes me go to Google and search for it instead.
I use gh copilot suggest in lieu of claude -p. Two seconds latency and highly accurate. You probably need a gh copilot auth token to do this though, and truthfully, that is pointless when you have access to Claude code.
In my product I use gpt-5-nano for image ALT text in addition to generating transcriptions of PDFs. It’s been surprisingly great for these tasks, but for PDFs I have yet to test it on a scanned document.
I went looking for the bit about if it blackmails you or tries to murder you... and it was a bit of a cop-out!
> Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, attempt a murder, and frame someone for financial crimes. We choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku 4.5 showed many clear examples of verbalized evaluation awareness on all three of the scenarios tested in this suite. Since the suite only consisted of many similar variants of three core scenarios, we expect that the model maintained high unverbalized awareness across the board, and we do not trust it to be representative of behavior in the real extreme situations the suite is meant to emulate.
https://www.anthropic.com/research/agentic-misalignment
It sounds like AI researchers have used too much of their own bad sci-fi as training data for models they don't understand. Goodhart's law wins again!
For those wondering where the “card” terminology originated: https://arxiv.org/pdf/1810.03993
Maybe at 39 pages we should start looking for a different term…
In our (very) early testing at Hyperbrowser but we're seeing Haiku 4.5 do really well on computer use as well. Pretty cool that Haiku is like the cheapest computer use model from the big labs now.
They previously discussed this some in the context of Opus 4: https://www.anthropic.com/research/end-subset-conversations
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.
In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:
* A strong preference against engaging with harmful tasks;
* A pattern of apparent distress when engaging with real-world users seeking harmful content; and
* A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions.
Curious they don't have any comparison to grok code fast:
Haiku 4.5: I $1.00/M, O $5.00/M
Grok Code: I $0.2/M, O $1.5/M
wow, grok code fast is really cheap
it writes bad code and blinding speed
If I'm close to weekly limits on Claude Code with Anthropic Pro, does that go away or stretch out if I switch to Haiku?
How close are you?
Oh right, Anthropic doesn't tell you.
I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.
You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.
Either charge more, stop subsidizing free accounts, or decrease the daily limit.
I’m also really interested in this - in fact it’s the first thing I went looking for in the announcement…
Sonnet 4.5 was two weeks ago. In the past I never had such issues, but every week my quota ended in 2-3 days. I suspect the Sonnet 4.5 model consumes more usage points than old Sonnet 4.1
I am afraid Claude Pro subscription got 3x less usage
I've tried it on a test case for generating a simple SaaS web page (design + code).
Usually I'm using GPT-5-mini for that task. Haiku 4.5 runs 3x faster with roughly comparable results (I slightly prefer the GPT-5-mini output but may have just accustomed to it).
I don't understand why more people don't talk about how fast the models are. I see so much obsession with bechmark scores but speed of response is very important for day to day use.
I agree that the models from OpenAI and Google have much slower responses than the models from Anthropic. That makes a lot of them not practical for me.
Tried it in Claude Code via /config, makes it feel like I'm running on Cerebras. It's seriously fast, bottleneck is on human review at this point.
Do you need Pro?
You can use the model flag and specify the model like: claude --model claude-haiku-4-5-20251001
All I know is I'm on the Claude Code 5x max plan and it works on my machine.
Awww they took away free tier Sonnet 4.5, that was a beautiful model to talk to even outside coding stuff
Why I use cheaper models for summaries (a lot ogf gemini-2.5-flash), what’s the use case of cheaper AI for coding? Getting more errors, or more spaghetti code, seems never worth it.
I'm using the smaller models for things like searching and summarizing over a larger part of the codebase. The speed is really pleasant then.
I feel like if I just do a better job of providing context and breaking complex tasks into a series of simple tasks then most of the models are good enough for me to code.
If it’s fast enough it can make and correct mistakes faster, potentially getting to a solution quicker than a slower, more accurate model.
I'm not seeing it as a model option in Claude Code for my Pro plan. Perhaps, it'll roll out eventually? Anyone else seeing it with the same plan?
You on latest version? Try running /update hook. Can also config autoupdates
At augmentcode.com, we've been evaluating Haiku for some time, it's actually a very good model. We found out it's 90% as good as Sonnet and is ~34% faster than sonnet!
Where it doesn't shine much is on very large coding task. but it is a phenomenal model for small coding tasks and the speed improvement is much welcome
Do you have a definition of what is considered a small vs large coding task?
90% as good as Sonnet 4 or 4.5? Openrouter just started reporting, and it's saying Haiku is 2x as fast (60tps vs 125tps) and 2-3x less latent (2-3s vs 1s)
The main thing holding these Anthropic models back is context size. Yes, quality deteriorates over a large context window, but for some applications, that is fine. My company is using grok4-fast, the Gemini family, and GPT4.1 exclusively at this point for a lot of operations just due to the huge 1m+ context.
Is your company Tier 4? Anthropic has had 1M context size in beta for some time now.
https://docs.claude.com/en/docs/build-with-claude/context-wi...
Is it possible to get that in Claude Code with Pro? Or is it already a 1M context window?
Only for Sonnet. No 1m for Haiku (this new model) and Opus.
This means 2.5 Flash or Grok 4 fast takes all the low end business for large context needs.
What LLM do you guys use for fast inference for voice/phone agents? I feel like to get really good latency I need to "cheat" with Cerebras, groq or SambaNova.
Haiku 4.5 is very good but still seems to be adding a second of latency.
Ok, I use claude, mostly on default, but with extended thinking and per project prompts.
What's the advantage of using haiku for me?
is it just faster?
Sonnet 4.5 is an excellent model for my startup's use case. Chatting to Haiku it looks promising too, and it may be great drop in replacement for some of inference tasks that have a lot of input tokens but don't require 4.5-level intelligence.
Claude Code is great but slow to work with.
Excited to see how fast Haiku can go!
And I was wondering today why Sonnet 4.5 seemed so freaking slow. Now this explains it, Sonnet 4.5 is the new Opus 4.1 where Anthropic does not really want you to use it.
I'd like to see this price structure for Claude:
$5/mt for Haiku 4.5
$10/mt for Sonnet 4.5
$15/mt for Opus 4.5 when it's released.
claude --model Haiku-4.5
doesn't work
claude --model claude-haiku-4-5-20251001
check the model name here: https://docs.claude.com/en/docs/about-claude/models/overview...
use claude-haiku-4-5-20251001
Ehh, expensive
Was anyone else slightly disappointed that this new product doesn't respond in Haiku, as the name would imply?
If you want to see it generate a Haiku from your webcam I just upgraded my silly little bring-your-own-key Haiku app to use the new model: https://tools.simonwillison.net/haiku
Wasn't there a 3.5 haiku too?
https://aws.amazon.com/about-aws/whats-new/2024/11/anthropic...
It's not a new product; just a new version.