As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between
- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).
The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.
Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.
I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.
I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.
What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.
I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.
Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.
But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.
It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.
When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.
For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)
Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.
LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.
Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
I really loved Phind and always think of it as the OG perplexity / RAG search engine.
Sadly stopped my subscription, when you removed the ability to weight my own domains...
Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.
I'm not really certain a text output model can ever do well here.
FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
You probably know this and are looking for consistency but, a little trick I use
is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.
Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most
Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.
Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.
If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.
Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)?
There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case
Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.
Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.
Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.
Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.
Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.
I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.
There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.
OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).
GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.
I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.
I think trial-and-error hand-waving isn't all that far from experimentation.
As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).
Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
Even today you'll usually only see 2-4 cores actually getting significant load.
there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).
Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
The disadvantage is that LLMs are probabilistic, mercurial, unreliable.
The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
I feel like this a common pattern with people who work in STEM. As someone who is used to working with formal proofs, equations, math, having a startup taught me how to rewire myself to work with the unknowns, imperfect solutions, messy details. I'm going on a tangent, but just wanted to share.
One of the major advantages and disadvantages of LLMs is they act a bit more like humans. I feel like most "prompt advice" out there is very similar to how you would teach a person as well. Teachers and parents have some advantages here.
Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
> I don't see how it's any different than optimizing for new CPU/GPU architectures
I mean that seems wild to say to me. Those architectures have documentation and aren't magic black boxes that we chuck inputs at and hope for the best: we do pretty much that with LLMs.
If that's how you optimise, I'm genuinely shocked.
Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).
And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
In my experience, even simple CRUD apps generally have some domain-specific intricacies or edge cases that take some amount of experimentation to get right.
Yeah this is why I don't like statistical and ML solutions in general. Monte Carlo sampling is already kinda throwing bullshit at the wall and hoping something works with absolutely zero guarantees and it's perfectly explainable.
But unfortunately for us, clean and logical classical methods suck ass in comparison so we have no other choice but to deal with the uncertainty.
> no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
Challenge accepted.
That said, the exact quote from the linked notebook is "It’s generally not necessary to use all-caps or other incentives like bribes or tips, but developers can experiment with this for extra emphasis if so desired.", but the demo examples OpenAI provides do like using ALL CAPS.
I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end.
The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think".
Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them.
It's placing instructions AND user query at top and bottom. So if you have a prompt like this:
[Long system instructions - 200 tokens]
[Very long document for reference - 5000 tokens]
[User query - 32 tokens]
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.
But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.
[Long system instructions - 200 tokens]
[User query - 32 tokens]
[Very long document for reference - 5000 tokens]
[Long system instructions - 200 tokens]
[User query - 32 tokens]
If you're skimming a text to answer a specific question, you can go a lot faster than if you have to memorize the text well enough to answer an unknown question after the fact.
The size of that SWE-bench Verified prompt shows how much work has gone into the prompt to get the highest possible score for that model. A third party might go to a model from a different provider before going to that extent of fine-tuning of the prompt.
I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.
My take aways:
- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.
- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.
- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.
- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.
My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.
I feel the same way about these models as you conclude. Gemini 2.5 is where I paste whole projects for major refactoring efforts or building big new bits of functionality. Claude 3.7 is great for most day to day edits. And 4.1 okay for small things.
I hope they release a distillation of 4.5 that uses the same training approach; that might be a pretty decent model.
I completely agree. On initial takeaway I find 3.7 sonnet to still be the superior coding model. I'm suspicious now of how they decide these benchmarks...
> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).
Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.
55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge
There are no great shorthands, but here are a few rules of thumb I use:
- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)
- multiply by ~2 to go from standard error of the mean to 95% confidence interval
- scale sample size by sqrt(N)
So:
- N=100: +/- 10%
- N=1000: +/- 3%
- N=10000: +/- 1%
(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)
p-value of 7.9% — so very close to statistical significance.
the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.
Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer
I make it 8.9% with a binomial test[0]. I rounded that to 10%, because any more precision than that was not justified.
Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.
[0] In R, binom.test(110, 200, 0.5, alternative="greater")
That's a marketing page for something called qodo that sells ai code reviews. At no point were the ai code reviews judged by competent engineers. It is just ai generated trash all the way down.
Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.
I think an under appreciated reality is that all of the large AI labs and OpenAI in particular are fighting multiple market battles at once. This is coming across in both the number of products and the packaging.
1, to win consumer growth they have continued to benefit on hyper viral moments, lately that was was image generation in 4o, which likely was technically possible a long time before launched. 2, for enterprise workloads and large API use, they seem to have focused less lately but the pricing of 4.1 is clearly an answer to Gemini which has been winning on ultra high volume and consistency. 3, for full frontier benchmarks they pushed out 4.5 to stay SOTA and attract the best researchers. 4, on top of all they they had to, and did, quickly answer the reasoning promise and DeepSeek threat with faster and cheaper o models.
They are still winning many of these battles but history highlights how hard multi front warfare is, at least for teams of humans.
Hey Simon, I love how you generates these summaries and share them on every model release. Do you have a quick script that allows you to do that? Would love to take a look if possible :)
He has a couple of nifty plugins to the LLM utility [1] so I would guess its something as simple as ```llm -t fabric:some_prompt_template -f hn:1234567890``` and that applies a template (in this case from a fabric library) and then appends a 'fragment' block from HN plugin which gets the comments, strips everything but the author and text, adds an index number (1.2.3.x), and inserts it into the prompt (+ SQLite).
Are there any benchmarks or someone who did tests of performance of using this long max token models in scenarios where you actually use more of this token limit?
I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.
I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.
The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books.
Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.
To get around this, whoever is training these models would need to change their training strategy to either:
- Group books in a series together as a single, very long text to be trained on
- Train on multiple unrelated books at once in the same context window
- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.
I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.
I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.
RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.
I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.
No, there's a fundamental limitation of Transformer architecture:
* information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
* selection of what information passes through is done using just dot-product
Training data isn't the problem.
In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.
codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen"
For reference, I think a common approximation is one token being 0.75 words.
For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.
It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.
Synthetic data requires a discriminator that can select the highest quality results to feed back into training. Training a discriminator is easier than a full blown LLM, but it still suffers from a lack of high quality training data in the case of 1M context windows. How do you train a discriminator to select good 2,000 page synthetic books if the only ones you have to train it with are Proust and concatenated Harry Potter/Game of Thrones/etc.
Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.
Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section?
What you're describing as "needle in a haystack" is a necessary requirement for the downstream ability you want. The distinction is really how many "things" the LLM can process in a single shot.
LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).
Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.
It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)
IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.
There are some benchmarks such as Fiction.LiveBench[0] that give an indication and the new Graphwalks approach looks super interesting.
But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models.
I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant)
This is a paper which echoes your experience, in general. I really wish that when papers like this one were created, someone took the methodology and kept running with it for every model:
> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.
As much as I enjoy Gemini models, I have to agree with you. At some point, interactions with them start resembling talking to people with short-term memory issues, and answers become increasingly unreliable. Now, there are also reports of AI Studio glitching out and not loading these longer conversations.
Is there a reliable method for pruning, summarizing, or otherwise compressing context to overcome such issues?
I agree. I use it a lot but there is endless frustration when the C++ code I am working on gets both complex and largish. Once it gets to a certain size and the context gets too long they all pretty much lose the plot and start producing complete rubbish. It would be great for it to give some measure so I know to take over and not have it start injecting random bugs or deleting functional code.
It even starts doing things like returning locally allocated pointers lately.
> Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.
I believe this. I've been having the forgetting problem happen less with Gemini 2.5 Pro. It does hallucinate, but I can get far just pasting all the docs and a few examples, and asking it to double check everything according to the docs instead of relying on its memory.
Just some tiny feedback if you didn’t mind; in the free version 10 prompts/day is unticked which sort of hints that there isn’t a 10 prompt/day limit, but I’m guessing that’s not what you want to say?
I wonder if documentation would help to create an carefully and intentionally tokenized overview of the system. Maximize the amount of routine larger scope information provided in minimal tokens in order to leave room for more immediate context.
Similar to the function documentation provides to developers today, I suppose.
It does, shockingly well in my experience. Check out this blog post outlining such an approach, called Literate Development by the author: https://news.ycombinator.com/item?id=43524673
It's not the point of the announcement, but I do like the use of the (abs) subscript to demonstrate the improvement in LLM performance since in these types of benchmark descriptions I never can tell if the percentage increase is absolute or relative.
It's definitely an issue. Even the simplest use case of "create React app with Vite and Tailwind" is broken with these models right now because they're not up to date.
Whenever an LLM struggles with a particular library version, I use Cursor Rules to auto-include migration information and that generally worked well enough in my cases.
Periodically I keep trying these coding models in Copilot and I have yet to have an experience where it produced working code with a pretty straightforward TypeScript codebase. Specifically, it cannot for the life of it produce working Drizzle code. It will hallucinate methods that don't exist despite throwing bright red type errors. Does it even check for TS errors?
Not sure about Copilot, but the Cursor agent runs both eslint and tsc by default and fixes the errors automatically. You can tell it to run tests too, and whatever other tools. I've had a good experience writing drizzle schemas with it.
It has been really frustrating learning Godot (or any new technology you are not familiar with) 4.4.x with GPT4o or even worse, with custom GPT which use older GPT4turbo.
As you are new in the field, it kinda doesn't make sense to pick an older version. It would be better if there was no data than incorrect data. You literally have to include the version number on every prompt and even that doesn't guarantee a right result! Sometimes I have to play truth or dare three times before we finally find the right names and instructions. Yes I have the version info on all custom information dialogs, but it is not as effective as including it in the prompt itself.
Searching the web feels like an on-going "I'm feeling lucky" mode. Anyway, I still happen to get some real insights from GPT4o, even though Gemini 2.5 Pro has proven far superior for larger and more difficult contexts / problems.
The best storytelling ideas have come from GPT 4.5. Looking forward to testing this new 4.1 as well.
I'm afraid I'm only doing 2d ... Yes, GUI related LLM instructions have been exceptionally bad, with multiple prompts me saying "no there is no such thing"... But as I commented earlier, GPT has had it's moments.
I strongly recommend giving Gemini 2.5 Pro a shot. Personally I don't like their bloated UI, but you can set the temperature value, which is especially helpful when you are more certain what and how you want, then just lower that value. If you want to get some wilder ideas, turn it up. Also highly recommend reading the thought process it does! That was actually key in having very complex ideas working. Just spotting couple of lines there, that seem too vague or even just a little bit inaccurate ... then pasting them back, with your own comments, have helped me a ton.
Is there a specific part in which you struggle? And FWIW, I've been on a heavy learning spree for 2 weeks. I feel like I'm starting to see glimbses from the barrel's bottom ... it's not so deep, you just gotta hang in there and bombard different LLMs with different questions, different angles, stripping away most and trying the simplest variation, for both prompt and godot. Or sometimes by asking more general advice "what is current godot best practice in doing x".
And YouTube has also been helpful source, by listening how more experienced users make their stuff. You can mostly skim through the videos with doublespeed and just focus on how they are doing the basics. Best of luck!
It it annoying. The bigger cheaper context windows help this a little though:
E.g.: If context windows get big and cheap enough (as things are trending), hopefully you can just dump the entire docs, examples, and more in every request.
sometimes it feels like openai keeps serving the same base dish—just adding new toppings. sure, the menu keeps changing, but it all kinda tastes the same. now the menu is getting too big.
nice to see that we aren't stuck in october of 2023 anymore!
The real news for me is GPT 4.5 being deprecated and the creativity is being brought to "future models" and not 4.1. 4.5 was okay in many ways but it was absolutely a genius in production for creative writing. 4o writes like a skilled human, but 4.5 can actually write a 10 minute scene that gives me goosebumps. I think it's the context window that allows for it to actually build up scenes to hammer it down much later.
Cool to hear that you got something out of it, but for most users 4.5 might have just felt less capable on their solution-oriented questions. I guess this why they are deprecating it.
It is just such a big failure of OpenAI not to include smart routing on each question and hide the complexity of choosing a model from users.
Most of the improvements in this model, basically everything except the longer context, image understanding and better pricing, are basically things that reinforcement learning (without human feedback) should be good at.
Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.
I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.
If so, the loss of fidelity versus 4.5 is really noticeable and a loss for numerous applications. (Finding a vegan restaurant in a random city neighborhood, for example.)
In your example the LLM should not be responsible for that directly. It should be calling out to an API or search results to get accurate and up-to-date information (relatively speaking) and then use that context to generate a response
You should actually try it. The really big models (4 and 4.5, sadly not 4o) have truly breathtaking ability to dig up hidden gems that have a really low profile on the internet. The recommendations also seem to cut through all the SEO and review manipulation and deliver quality recommendations. It really all can be in one massive model.
ChatGPT currently recommends I use o3-mini-high ("great at coding and logic") when I start a code conversation with 4o.
I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?
4.1 is a pinned API variant with the improvements from the newer iterations of 4o you're already using in the app, so that's why the comparison focuses between those two.
Pricing wise the per token cost of o3-mini is less than 4.1 but keep in mind o3-mini is a reasoning model and you will pay for those tokens too, not just the final output tokens. Also be aware reasoning models can take a long time to return a response... which isn't great if you're trying to use an API for interactive coding.
> I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
There are tons of comparisons to o3-mini-high in the linked article.
Sam Altman wrote in February that GPT-4.5 would be "our last non-chain-of-thought model" [1], but GPT-4.1 also does not have internal chain-of-thought [2].
It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.
Did they not expect this model to turn out as well as it did?
Anyone making claims with a horizon beyond two months about structure or capabilities will be wrong - it's sama's job to show confidence and vision and calm stakeholders, but if you're paying attention to the field, the release and research cycles are still contracting, with no sense of slowing any time soon. I've followed AI research daily since GPT-2, the momentum is incredible, and even if the industry sticks with transformers, there are years left of low hanging fruit and incremental improvements before things start slowing.
There doesn't appear to be anything that these AI models cannot do, in principle, given sufficient data and compute. They've figured out multimodality and complex integration, self play for arbitrary domains, and lots of high-cost longer term paradigms that will push capabilities forwards for at least 2 decades in conjunction with Moore's law.
Things are going to continue getting better, faster, and weirder. If someone is making confident predictions beyond those claims, it's probably their job.
Maybe that's true for absolute arm-chair-engineering outsiders (like me) but these models are in training for months, training data is probably being prepared year(s) in advance. These models have a knowledge cut-off in 2024 - so they have been in training for a while. There's no way sama did not have a good idea that this non-COT was in the pipeline 2 months ago. It was probably finished training then and undergoing evals.
Maybe
1. he's just doing his job and hyping OpenAI's competitive advantages (afair most of the competition didn't have decent COT models in Feb), or
2. something changed and they're releasing models now that they didn't intend to release 2 months ago (maybe because a model they did intend to release is not ready and won't be for a while), or
3. COT is not really as advantageous as it was deemed to be 2+ months ago and/or computationally too expensive.
I doubt it's going to be weeks, the months were already turning into years despite Nvidia's previous advances.
(Not to say that it takes openai years to train a new model, just that the timeline between major GPT releases seems to double... be it for data gathering, training, taking breaks between training generations, ... - either way, model training seems to get harder not easier).
GPT Model | Release Date | Months Passed Between Former Model
The capabilities and general utility of the models are increasing on an entirely different trajectory than model names - the information you posted is 99% dependent on internal OAI processes and market activities as opposed to anything to do with AI.
I'm talking more broadly, as well, including consideration of audio, video, and image modalities, general robotics models, and the momentum behind applying some of these architectures to novel domains. Protocols like MCP and automation tooling are rapidly improving, with media production and IT work rapidly being automated wherever possible. When you throw in the chemistry and materials science advances, protein modeling, etc - we have enormously powerful AI with insufficient compute and expertise to apply it to everything we might want to. We have research being done on alternate architectures, and optimization being done on transformers that are rapidly reducing the cost/performance ratio. There are models that you can run on phones that would have been considered AGI 10 years ago, and there doesn't seem to be any fundamental principle decreasing the rate of improvement yet. If alternate architectures like RWKV get funded, there might be several orders of magnitude improvement with relatively little disruption to production model behaviors, but other architectures like text diffusion could obsolete a lot of the ecosystem being built up around LLMs right now.
There are a million little considerations pumping transformer LLMs right now because they work and there's every reason to expect them to continue improving in performance and value for at least a decade. There aren't enough researchers and there's not enough compute to saturate the industry.
Fair point, I guess my question is how long it would take them to train GPT-2 on the absolute bleedingest generation of Nvidia chips vs what they had in 2019, with the budget they have to blow on Nvidia supercomputers today.
the release and research cycles are still contracting
Not necessarily progress or benchmarks that as a broader picture you would look at (MMLU etc)
GPT-3 was an amazing step up from GPT-2, something scientists in the field really thought was 10-15 years out at least done in 2, instruct/RHLF for GPTs was a similar massive splash, making the second half of 2021 equally amazing.
However nothing since has really been that left field or unpredictable from then, and it's been almost 3 years since RHLF hit the field. We knew good image understanding as input, longer context, and improved prompting would improve results. The releases are common, but the progress feels like it has stalled for me.
What really has changed since Davinci-instruct or ChatGPT to you? When making an AI-using product, do you construct it differently? Are agents presently more than APIs talking to databases with private fields?
In some dimensions I recognize the slow down in how fast new capabilities develop, but the speed still feels very high:
Image generation suddenly went from gimmick to useful now that prompt adherence is so much better (eagerly waiting for that to be in the API)
Coding performance continues to improve noticeably (for me). Claude 3.7 felt like a big step from 4o/3.5. Gemini 2.5 in a similar way.compared to just 6 months ago I can give bigger and more complex pieces of work to it and get relatively good output back. (Net acceleration)
Audio-2-audio seems like it will be a big step as well. I think this has much more potential than the STT-LLM-TTS architecture commonly used today (latency, quality)
I see a huge progress made since the first gpt-4 release. The reliability of answers has improved an order of magnitude. Two years ago, more than half of my questions resulted in incorrect or partially correct answers (most of my queries are about complicated software algorithms or phd level research brainstorming). A simple “are you sure” prompt would force the model to admit it was wrong most of the time. Now with o1 this almost never happens and the model seems to be smarter or at least more capable than me - in general. GPT-4 was a bright high school student. o1 is a postdoc.
> Things are going to continue getting better, faster, and weirder.
I love this. Especially the weirder part. This tech can be useful in every crevice of society and we still have no idea what new creative use cases there are.
Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?
Yep, it’s literally just a slightly higher tech version of (for example) the 1992 Los Angeles riots over Rodney King but with phones and Facebook instead of handheld camcorders and television.
Maybe that's why they named this model 4.1, despite coming out after 4.5 and supposedly outperforming it. They can pretend GPT-4.5 is the last non-chain-of-thought model by just giving all non-chain-of-thought-models version numbers below 4.5
Everyone assumed malice when the board fired him for not always being "candid" - but it seems more and more that he's just clueless. He's definitely capable when it comes to raising money as a business, but I wouldn't count on any tech opinion from him.
I think that people balked at the cost of 4.5 and really wanted just a slightly more improved 4o . Now it almost seems that they will have a separate products that are non chain of thought and chain of thought series which actually makes sense because some want a cheap model and some don't.
> Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan.
Well they actually hinted already of possible depreciation in their initial announcement of gpt4.5 [0]. Also, as others said, this model was already offered in the api as chatgpt-latest, but there was no checkpoint which made it unreliable for actual use.
When I saw them say 'no more non COT models', I was minorly panicked.
While their competitors have made fantastic models, at the time I perceived ChatGPT4 was the best model for many applications. COT was often tricked by my prompts, assuming things to be true, when a non-COT model would say something like 'That isnt necessarily the case'.
I use both COT and non when I have an important problem.
Seeing them keep a non-COT model around is a good idea.
Looks like the Quasar and Optimus stealth models on Openrouter were in fact GPT-4.1. This is what I get when I try to access the openrouter/optimus-alpha model now:
{"error":
{"message":"Quasar and Optimus were stealth models, and
revealed on April 14th as early testing versions of GPT 4.1.
Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
As a user I'm getting so confused as to what's the "best" for various categories. I don't have time/want to dig into benchmarks for different categories, look into the example data to see which best maps onto my current problems.
The graphs presented don't even show a clear winner across all categories. The one with the biggest "number", GPT-4.5, isn't even in the best in most categories, actually it's like 3rd in a lot of them.
This is quite confusing as a user.
Otherwise big fan of OAI products thus far. I keep paying $20/mo, they keep improving across the board.
I think "best" is slightly subjective / user. But I understand your gripe. I think the only way is using them iteratively, settling on the one that best fits you / your use-case, whilst reading other peoples' experiences and getting a general vibe
Very interesting. For my use cases, Gemini's responses beat Sonnet 3.7's like 80% of the time (gut feeling, didn't collect actual data). It beats Sonnet 100% of the time when the context gets above 120k.
As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.
Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.
Okay, it's common across other industries, but not this one. Here is Google, Facebook, and Anthropic comparing their frontier models to others[1][2][3].
Confusing take - Gemini 2.5 is probably the best general purpose coding model right now, and before that it was Sonnet 3.5. (Maybe 3.7 if you can get it to be less reward-hacky.) OpenAI hasn't had the best coding model for... coming up on a year, now? (o1-pro probably "outperformed" Sonnet 3.5 but you'd be waiting 10 minutes for a response, so.)
Who has a (publicly released) model that is SOTA is constantly changing. It’s more interesting to see who is driving the innovation in the field, and right now that is pretty clearly OpenAI (GPT-3, first multi-modal model, first reasoning model, ect).
The deprecation of GPT-4.5 makes me sad. It's an amazing model with great world-knowledge and subtly. It KNOWS THINGS that, on a quick experiment, 4.1 just does not. 4.5 could tell me what I would see from a random street corner in New Jersey, or how to use minor features of my niche API (well, almost), and it could write remarkably. But 4.1 doesn't hold a candle to it. Please, continue to charge me $150/1M tokens. Sometimes you need a Big Model. Tells me it was costing more than $150/1M to serve (!).
> You're eligible for free daily usage on traffic shared with OpenAI through April 30, 2025.
> Up to 1 million tokens per day across gpt-4.5-preview, gpt-4.1, gpt-4o and o1
> Up to 10 million tokens per day across gpt-4.1-mini, gpt-4.1-nano, gpt-4o-mini, o1-mini and o3-mini
> Usage beyond these limits, as well as usage for other models, will be billed at standard rates. Some limitations apply.
>Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version
If anyone here doesn't know, OpenAI does offer the ChatGPT model version in the API as chatgpt-4o-latest, but it's bad because they continuously update it so businesses can't reliably rely on it being stable, that's why OpenAI made GPT 4.1.
Lots of the other models are checkpoint releases, and latest is a pointer to the latest checkpoint. Something being continuously updated is quite different and worth knowing about.
OpenAI (and most LLM providers) allow model version pinning for exactly this reason, e.g. in the case of GPT-4o you can specify gpt-4o-2024-05-13, gpt-4o-2024-08-06, or gpt-4o-2024-11-20.
Yes, and they don't make snapshots for chatgpt-4o-latest, but they made them for GPT 4.1, that's why 4.1 is only useful for API, since their ChatGPT product already has the better model.
Yeah, in the last week, I had seen a strong benchmark for chatgpt-4o-latest and tried it for a client's use case. I ended up wasting like 4 days, because after my initial strong test results, in the following days, it gave results that were inconsistent and poor, and sometimes just outputting spaces.
Juice not worth the squeeze I imagine. 4.5 is chonky, and having to reserve GPU space for it must not have been worth it.
Makes sense to me - I hadn't founding anything it was so much better at that it was worth the incremental cost over Sonnet 3.7 or o3-mini.
Marginally on-topic: I'd love if the charts included prior models, including GPT 4 and 3.5.
Not all systems upgrade every few months. A major question is when we reach step-improvements in performance warranting a re-eval, redesign of prompts, etc.
There's a small bleeding edge, and a much larger number of followers.
Sonnet 3.7 non-reasoning is better on its own. In fact even Sonnet 3.5-v2 is, and that was released 6 months ago. Now to be fair, they're close enough that there will be usecases - especially non-coding - where 4.1 beats it consistently. Also, 4.1 is quite a lot cheaper and faster. Still, OpenAI is clearly behind.
There is no OpenAI model better than R1, reasoning or not (as confirmed by the same Aider benchmark; non-coding tests are less objective, but I think it still holds).
With Gemini (current SOTA) and Sonnet (great potential, but tends to overengineer/overdo things) it is debatable, they are probably better than R1 (and all OpenAI models by extension).
The user shoudn't have to research which model is the best for them. OpenAI needs to do a better job in UX and putting the best model forward in chatgpt.
As far as I can tell there's no way to discover the details of a model via the API right now.
Given the announced adoption of MCP and MCP's ability to perform model selection for Sampling based on a ranking for speed and intelligence, it would be great to have a model discovery endpoint that came with all the details on that page.
I like how Nano matches Gemini 2.0 Flash's price. That will help drive down prices which will be good for my app. However I don't like how Nano behaves worse than 4o Mini in some benchmarks. Maybe it will be good enough, we'll see.
yeah and considering that gemini 2.0 flash is much better than 4o-mini. On top of that gemini have also audio input as modality and realtime API for both audio input and output + web search grounding + free tier.
Theory here is that 4.1-nano is competing with that tier, 4.1 with flash-thinking (although likely to do significantly worse), and o4-mini or o3-large will compete with 2.5 thinking
> One last note: we’ll also begin deprecating GPT-4.5 Preview in the API today as GPT-4.1 offers improved or similar performance on many key capabilities at lower latency and cost. GPT-4.5 in the API will be turned off in three months, on July 14, to allow time to transition (and GPT 4.5 will continue to be available in ChatGPT).
I'm wondering if one of the big reasons that OpenAI is making gpt-4.5 deprecated is not only because it's not cost-effective to host but because they don't want their parent model being used to train competitors' models (like deepseek).
I'm not really bullish on OpenAI. Why would they only compare with their own models? The only explanation could be that they aren't as competitive with other labs as they were before.
It's the same organization that kept repeating that sharing weights of GPT would be "too dangerous for the world". Eventually DeepSeek thankfully did something like that, though they are supposed to be the evil guys.
They continue to baffle users with their version numbering.
Intiutively 4.5 is newer/better than 4.1 and perhaps 4o but of course this is not the case.
By leaving out scale or prior models they are effectively manipulating improvement. If from 3 to 4 it was from 10 to 80, and from 4 to 4o it was 80 to 82, leaving out 3 would let us see a steep line instead of steep decrease of growth.
The benchmarks and charts they have up are frustrating because they don’t include 03-mini(-high) which they’ve been pushing as the low-latency+low-cost smart model to use for coding challenges instead of 4o and 4o-mini. Why won’t they include that in the charts?
I've been using it in Cursor for the past few hours and prefer it to Sonnet 3.7. It's much faster and doesn't seem to make the sort of stupid mistakes Sonnet has been making recently.
I feel there's some "benchmark-hacking" is going on with GPT4.1 model as its metrics on livebench.com aren't all that exciting.
- It's basically GPT4o level on average.
- More optimized for coding, but slightly inferior in other areas.
It seems to be a better model than 4o for coding tasks, but I'm not sure if it will replace the current leaders -- Gemini 2.5 Pro, o3-mini / o1, Claude 3.7/3.5.
Is the version number a retcon of 4.5? On OpenAI's models page the names appear completely reasonable [1]: The o1 and o3 reasoning models, and non-reasoning there is 3.5, 4, 4o and 4.1 (let's pretend 4o makes sense). But that is only reasonable as long as we pretend 4.5 never happened, which the models page apparently does
We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months
Here's something I just don't understand, how can ChatGPT 4.5 be worse than 4.1? Or the only thing bad is that the OpenAI naming ability?
It would be incredible to be able to feed an entire codebase into a model and say "add this feature" or "we're having a bug where X is happening, tell me why", but then you are limited by the output token length
As others have pointed out too, the more tokens you use, the less accuracy you get and the more it gets confused, I've noticed this too
We are a ways away yet from being able to input an entire codebase, and have it give you back an updated version of that codebase.
> These Terms and your use of T3 Chat will be governed by and construed in accordance with the laws of the jurisdiction where T3 Tools Inc. is incorporated, without regard to its conflict of law provisions. Any disputes arising out of or in connection with these Terms will be resolved exclusively in the courts located in that jurisdiction, unless otherwise required by applicable law.
Would be nice if there was at least some hint as to where T3 Tools Inc. is located and what jurisdiction applies.
I'm using models which scored at least 50% in Aider leaderboard but I'm micromanaging 50 line changes instead of being more vibe. Is it worth experimenting with a model that didnt crack 10%?
While impressive that the assistants can use dynamic tools and reason about images, I'm most excited about the improvements to factual accuracy and instruction following. The RAG capabilities with cross-validation seem particularly useful for serious development work, not just toy demos.
Big focus on coding. It feels like a defensive move against Claude (and more recently, Gemini Pro) which became very popular in that regime. I guess they recently figured out some ways to train the model for these "agentic" coding through RL or something - and the finding is too new to apply 4.5 on time.
The 'oN' schema was a such strange choice for branding. They had to skip 'o2' because it's already trademarked, and now 'o4' can easily be confused with '4o'.
I tried 4.1-mini and 4.1-nano. The response are a lot faster, but for my use-case they seem to be a lot worse than 4o-mini(they fail to complete the task when 4o-mini could do it). Maybe I have to update my prompts...
It's quite complex, but the task is to parse some HTML content, or to choose from a list of URLs which one is the best.
I will check again the prompt, maybe 4o-mini ignores some instructions that 4.1 doesn't (instructions which might result in the LLM returning zero data).
Too bad OpenAI named it 4.1 instead of 4.10. You can either claim 4.10 > 4.5 (the dots separate natural numbers) or 4.1 == 4.10 (they are decimal numbers), but you can't have both at once
> Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version (opens in a new window) of GPT‑4o, and we will continue to incorporate more with future releases.
The lack of availability in ChatGPT is disappointing, and they're playing on ambiguity here. They are framing this as if it were unnecessary to release 4.1 on ChatGPT, since 4o is apparently great, while simultaneously showing how much better 4.1 is relative to GPT-4o.
One wager is that the inference cost is significantly higher for 4.1 than for 4o, and that they expect most ChatGPT users not to notice a marginal difference in output quality. API users, however, will notice. Alternatively, 4o might have been aggressively tuned to be conversational while 4.1 is more "neutral"? I wonder.
There's a HUGE difference that you are not mentioning: there are "gpt-4o" and "chatgpt-4o-latest" on the API. The former is the stable version (there are a few snapshot but the newest snapshot has been there for a while), and the latter is the fine-tuned version that they often update on ChatGPT. All those benchmarks were done for the API stable version of GPT-4o, since that's what businesses rely on, not on "chatgpt-4o-latest".
Good point, but how does that relate to, or explain, the decision not to release 4.1 in ChatGPT? If they have a nice post-training pipeline to make 4o "nicer" to talk to, why not use it to fine-tune the base 4.1 into e.g. chatgpt-4.1-latest?
Because chatgpt-4o-latest already has all of those improvements, the largest point of this release (IMO) is to offer developers a stable snapshot of something that compares to modern 4o latest. Altman said that they'd offer a stable snapshot of chatgpt 4o latest on the API, he perhaps did really mean GPT 4.1.
> Because chatgpt-4o-latest already has all of those improvements
Does it, though? They said that "many" have already been incorporated. I simply don't buy their vague statements there. These are different models. They may share some training/post-training recipe improvements, but they are still different.
I disagree. From the average user perspective, it's quite confusing to see half a dozen models to choose from in the UI. In an ideal world, ChatGPT would just abstract away the decision. So I don't need to be an expert in the relatively minor differences between each model to have a good experience.
Vs in the API, I want to have very strict versioning of the models I'm using. And so letting me run by own evals and pick the model that works best.
I agree on both naming on stability. However, this wasn't my point.
They still have a mess of models in ChatGPT for now, and it doesn't look like this is going to get better immediately (even though for GPT-5, they ostensibly want to unify them). You have to choose among all of them anyway.
> We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency.
why would they deprecate when it's the better model? too expensive?
> why would they deprecate when it's the better model? too expensive?
Too expensive, but not for them - for their customers. The only reason they’d deprecated it is if it wasn’t seeing usage worth keeping it up and that probably stems from it being insanely more expensive and slower than everything else.
Where did you find that 4.5 is a better model? Everything from the video told me that 4.5 was largely a mistake and 4.1 beats 4.5 at everything. There's no point keeping 4.5 at this point.
Bigger numbers are supposed to mean better. 3.5, 4, 4.5. Going from 4 to 4.5 to 4.1 seems weird to most people. If it's better, it should of been GPT-4.6 or 5.0 or something else, not a downgraded number.
Awesome, thank you for posting. As someone who regularly uses 4o mini from the API, any guesses or intuitions about the performance of Nano?
I'm not as concerned about nomenclature as other people, which I think is too often reacting to a headline as opposed to the article. But in this case, I'm not sure if I'm supposed to understand nano as categorically different than many in terms of what it means as a variation from a core model.
they share in livestream that 4.1-nano is worse than 4o-mini - so nano is cheaper, faster and have bigger context but worse in intelligence. 4.1mini is smarter but there is price increase.
That's what I was thinking. I hoped to see a price drop, but this does not change anything for my use cases.
I was using gpt-4o-mini with batch API, which I recently replaced with mistral-small-latest batch API, which costs $0.10/$0.30 (or $0.05/$0.15 when using the batch API). I may change to 4.1-nano, but I'd have to be overwhelmed by its performance in comparision to mistral.
I don't think they ever committed themselves to uniformed pricing for mini models. Of course cheaper is better but I understand pricing to be contingent on factors specific to every next model rather than following from a blanket policy.
GPT-4.1 probably is a distilled version of GPT-4.5
I dont understand the constant complaining about naming conventions. The number system differentiates the models based on capability, any other method would not do that. After ten models with random names like "gemini", "nebula" you would have no idea which is which. Its a low IQ take. You dont name new versions of software as completely different software
Also, Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried. I have 15 years of backend experience at FAANG. Software will get automated, and it already is, people just havent figured it out yet
Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo are worse than GPT-4o in both speed and capabilities. o1 is worse than o3-mini-high in most aspects.
Then, some are not available yet: o3 and o4-mini. GPT-4.1 I haven't played with enough to give you my opinion on.
Among the rest, it depends on what you're looking for:
Multi-modal: GPT-4o > everything else
Reasoning: o1-pro > o3-mini-high > o3-mini
Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
(My personal favorite is o3-mini-high for most things, as it has a good tradeoff between speed and reasoning. Although I use 4o for simpler queries.)
Well, okay, but I'm certainly not an expert who knows the fine differences between all the models available on chat.com. So I'm somewhere between your definition of "layman" and your definition of "expert" (as are, I suspect, most people on this forum).
If you know the difference between 4.5 and 4o, it'll take you 20 minutes max to figure out the theoretical differences between the other models, which is not bad for a highly technical emerging field.
There's no single ordering -- it really depends on what you're trying to do, how long you're willing to wait, and what kinds of modalities you're interested in.
I recognize this is a somewhat rhetorical question and your point is well taken. But something that maps well is car makes and models:
- Is Ford Better than Chevy? (Comparison across providers) It depends on what you value, but I guarantee there's tribes that are sure there's only one answer.
- Is the 6th gen 2025 4Runner better than 5th gen 2024 4Runner? (Comparison of same model across new releases) It depends on what you value. It is a clear iteration on the technology, but there will probably be more plastic parts that will annoy you as well.
- Is the 2025 BMW M3 base model better than the 2022 M3 Competition (Comparing across years and trims)? Starts to depend even more on what you value.
Providers need to delineate between releases, and years, models, and trims help do this. There are companies that will try to eschew this and go the Tesla route without models years, but still can't get away from it entirely. To a certain person, every character in "2025 M3 Competition xDrive Sedan" matters immensely, to another person its just gibberish.
SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)
reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro
Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:
It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.
> Software will get automated, and it already is, people just havent figured it out yet
To be honest I think this is most AI labs (particularly the American ones) not-so-secret goal now, for a number of strong reasons. You can see it in this announcements, Anthrophic's recent Claude 3.7 announcement, OpenAI's first planned agent (SWE-Agent), etc etc. They have to justify their worth somehow and they see it as a potential path to do that. Remains to be seen how far they will get - I hope I'm wrong.
The reasons however for picking this path IMO are:
- Their usage statistics show coding as the main user: Anthrophic recently released their stats. Its become the main usage of these models, with other usages at best being novelty or conveniences for people in relative size. Without this market IMO the hype would of already fizzled awhile ago at best a novelty when looking at the rest of the user base size.
- They "smell blood" to disrupt and fear is very effective to promote their product: This IMO is the biggest one. Disrupting software looks to be an achievable goal, but it also is a goal that has high engagement compared to other use cases. No point solving something awesome if people don't care, or only care for awhile (e.g. meme image generation). You can see the developers on this site and elsewhere in fear. Fear is the best marketing tool ever and engagement can last years. It keeps people engaged and wanting to know more; and talking about how "they are cooked" almost to the exclusion of everything else (i.e. focusing on the threat). Nothing motivates you to know a product more than not being able to provide for yourself, your family, etc to the point that most other tech topics/innovations are being drowned out by AI announcements.
- Many of them are losing money and need a market to disrupt: Currently the existing use cases of a chat bot are not yet impressive enough (or haven't been till very recently) to justify the massive valuations of these companies. Its coding that is allowing them to bootstrap into other domains.
- It is a domain they understand: AI dev's know models, they understand the software process. It may be a complex domain requiring constant study, but they know it back to front. This makes it a good first case for disruption where the data, and the know how is already with the teams.
TL;DR: They are coming after you, because it is a big fruit that is easier to pick for them than other domains. Its also one that people will notice either out of excitement (CEO, VC's, Management, etc) or out of fear (tech workers, academics, other intellectual workers).
> Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried.
Exactly. Those who do frontend or focus on pretty much anything Javascript are, how should I say it? Cooked?
> Software will get automated
The first to go are those that use JavaScript / TypeScript engineers have already been automated out of a job. It is all over for them.
Yeah its over for them. Complicated business logic and sprawling systems are what are keeping backend safe for now. But the big front end code bases where individual files (like react components) are largely decoupled from the rest of the code base is why front end is completely cooked
I have a medium-sized typescript personal project I work on. It probably has 20k LOC of well organized typescript (react frontend, express backend). I also have somewhat comprehensive docs and cursor project rules.
In general I use Cursor in manual mode asking it to make very well scoped small changes (e.g. “write this function that does this in this exact spot”). Yesterday I needed to make a largely mechanical change (change a concept in the front end, make updates to the corresponding endpoints, update the data access methods, update the database schema).
This is something very easy I would expect a junior developer to be able to accomplish. It is simple, largely mechanical, but touches a lot of files. Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes. It would add weird hard coded conditions, define new unrelated files, not follow the conventions of the surrounding code at all.
TLDR; I think LLMs right now are good for greenfield development (create this front end from scratch following common patterns), and small scoped changes to a few files. If you have any kind of medium sized refactor on an existing code base forget about it.
> Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes.
Gemini 2.5 is currently broken with the Cursor agent; it doesn't seem to be able to issue tool calls correctly. I've been using Gemini to write plans, which Claude then executes, and this seems to work well as a workaround. Still unfortunate that it's like this, though.
My personal opinion is leveraging LLMs on a large code base requires skill. How you construct the prompt, and what you keep in context, which model you use, all have a large effect on the output. If you just put it into cursor and throw your hands up, you probably didnt do it right
I gave it a list of the changes I needed and pointed it to the area of the different files that needed updated. I also have comprehensive cursor project rules. If I needed to hand hold any more than that it would take considerably less time to just make the changes myself.
> using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning
AI is amazing, now all you need to create a stunning UI is for someone else to make it first so an AI can rip it off. Not beating the "plagiarism machine" allegations here.
Heres a secret: Most of the highest funded VC backed software companies are just copying a competitor with a slight product spin/different pricing model
Right, now it's up and comparison against Claude 3.7 is better than I feared based on the wording. Though why does the OpenAI announcement talk of comparison against multiple leading models when the Qodo blog post only tests against Claude 3.7...
They don't disclose parameter counts so it's hard to say exactly how far apart they are in terms of size, but based on the pricing it seems like a pretty wild comparison, with one being an attempt at an ultra-massive SOTA model and one being a model scaled down for efficiency and probably distilled from the big one. The way they're presented as version numbers is business nonsense which obscures a lot about what's going on.
and it's worse on just as many benchmarks by a significant amount. as a consumer I don't care about cheapness, I want the maximum accuracy and performance
If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful.
If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.
- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness)
- Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in
At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.
Answering my own question after some research. It looks like OpenAI decided not to introduce 4.1 in ChatGPT UI because 4.1 is not necessarily a better model than 4o because it is not multi modal.
Now you can imagine introducing a newer "type" of model like 4.1 that's better at following instructions and better at coding to bring a sort of overhead thats already too much with the given options.
OpenAI confirmed somewhere that they have already incorporated the enhancements made in 4.1 to 4o model in ChatGPT UI. I assume they would delegate to 4.1 model if the prompt doesn't require specific 4o capabilities.
Also one of the improvements made to 4.1 is following instructions. This type of thing is better suited for agentic use cases that are typically used in the form of an API.
i've recently set claude 3.7 as the default option for customers when they start new chats in my app. this was a recent change, and i'm feeling good about it. supporting multiple providers can be a nightmare for customer service, especially when it comes to billing and handling response quality queries. with so many choices from just one provider, it simplifies things significantly. curious about how openai manages customer service internally.
Is this correct: OpenAI will sequester 4.1 in the API permanently? And, since November 2024, they've already wrapped much of 4.1's features into ChatGPT 4o?
It seems that OpenAI is really differentiating itself in the AI market by developing the most incomprehensible product names in the history of software.
> We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months, on July 14, 2025, to allow time for developers to transition.
Think of 4.5 as being the lacklustre major upgrade to a software package, pick one maybe Photoshop or whatever. The 4.0 version is still available and most people are continuing to use it, then suddenly 4.0 gets a small upgrade which makes it considerably better and the vendor starts talking about how the real future is in 5.0.
I wish OpenAI had invented this but it’s not that uncommon.
The better the benchmarks, the worse the model is. Subjectively for me the more advanced models dont follow instructions, and are less capable of implementing features or building stuff. I could not tell a difference in blind testing SOTA models gemini, claude, openai, deepseek. There has been no major improvements in the LLM space since the original models gained popularity. Each release claims to be much better the last, and every time i have been disappointed and think this is worse.
First it was the models stopped putting in effort and felt lazy, tell it to do something and it will tell you to do it your self. Now its the opposite and the models go ham changing everything they see, instead of changing one line, SOTA models rather rewrite the whole project and still not fix the issue.
Two years back I totally thought these models are amazing. I always would test out the newest models and would get hyped up about it. Every problem i had i thought if i just prompt it differently I can get it to solve this. Often times i have spent hours prompting starting new chats, adding more context. Now i realize its kinda useless and its better to just accept the models where they are, rather then try and make them a one stop shop, or try to stretch capabilities.
I think this release I won’t even test it out, im not interested anymore. I’ll probably just continue using deepseek free, and gemini free. I canceled my openai subscription like 6 months ago, and canceled claude after 3.7 disappointment.
As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between
- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
Why do I have to figure all of this out myself?
> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).
The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.
Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.
I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.
I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.
What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.
I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.
Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.
But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.
It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.
*it's just what I have access to.
When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.
For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)
That use case seems pretty self defeating when a good news source will usually try to at least validate first-party materials which an llm cannot do.
Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.
LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.
Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages
I also like Perplexity’s 3/day limit! If I use them up (which I almost never do) I can just refresh the next day
I've only ever had to use DeepResearch for academic literature review. What do you guys use it for which hits your quotas so quickly?
I use it for mundane shit that I don’t want to spend hours doing.
My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.
I had a list of about 30 bands I wanted patches for.
I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.
It took me two minutes to write up the prompt and it did all the heavy lifting.
Write a comparison between X and Y
[dead]
I use them as follows:
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
Phind was fine-tuned specifically to produce inline Mermaid diagrams for technical questions (I'm the founder).
I really loved Phind and always think of it as the OG perplexity / RAG search engine.
Sadly stopped my subscription, when you removed the ability to weight my own domains...
Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
Have you been interviewed anywhere? Curious to read your story.
Gemini 2.5 Pro is quite good at code.
Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.
Which might be a side-effect of the reasoning.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.
Gemini 2.5 is very good. Since you have to wait for reasoning tokens, it takes longer to come back, but the responses are high quality IME.
You probably know this but it can already generate accurate diagrams. Just ask for the output in a diagram language like mermaid or graphviz
My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.
I'm not really certain a text output model can ever do well here.
FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:
1. User provides information
2. LLM generates structured output for whatever modeling language
3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.
4. LLM generates structured output based on the feedback.
5. etc...
But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.
I had a latex tikz diagram problem which sonnet 3.7 couldn't handle even after 10 attempts. Gemini 2.5 Pro solved it on the second try.
Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
Plantuml (action) diagrams are my go to
You probably know this and are looking for consistency but, a little trick I use is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers
Ha! That's the funniest and best description of 4.5 I've seen.
Switch to Gemini 2.5 Pro, and be happy. It's better in every aspect.
Warning to potential users: it's Google.
Not sure how or why OpenAI would be any better?
It's not. It's closed source. But Google is still the worst when it comes to privacy.
I prefer to use only open source models that don't have the possibility to share my data with a third party.
The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.
Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
Is that an LLM hallucination?
It’s a tongue in cheek reference to how audiophiles claim to hear differences in audio quality.
Pretty dark times on HN, when a silly (and obvious) joke gets someone labeled as AI.
Obvious to you perhaps not to everyone. Self-awareness goes a long way
Possibly, but it's running on 100% wetware, I promise!
Looks like NDA violation )
For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.
This sounds like whole lot of mental overhead to avoid using Gemini.
My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most
I'm also very curious of each limit for each model. Never thought about limit before upgrading my plan
Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.
Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.
If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.
What do you mean when you say that 4o doesn’t have chain-of-thought?
Must be weird to not have an "AI router" in this case.
Just ask the first AI that comes to mind which one you could ask.
what's hilarious to me is that I asked ChatGPT about the model names and approachs and it did a better job than they have.
Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year:
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet?
I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers.
Results, with other models for comparison:
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.[0] https://aider.chat/docs/leaderboards/
[1] https://aider.chat/HISTORY.html
Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)? There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case
What model are you personally using in your aider coding? :)
Mostly Gemini 2.5 Pro lately.
I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].
[0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro?
Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.
Aider author here.
Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.
Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.
Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.
Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.
I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.
Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview.
There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)
Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.
[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...
The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320
This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.
OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
They just pick the best performer out of the built-in modes they offer.
Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.
I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.
Yes, it is available in Cursor[1] and Windsurf[2] as well.
[1] https://twitter.com/cursor_ai/status/1911835651810738406
[2] https://twitter.com/windsurf_ai/status/1911833698825286142
And free on windsurf for a week! Vibe time.
Its available for free in Windsurf so you can try it out there.
Edit: Now also in Cursor
Yes on both Cursor and Windsurf.
https://twitter.com/cursor_ai/status/1911835651810738406
Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners:
Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research
In terms of price Deepseek is still absolutely fire!
OpenAI is in trouble honestly.
One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).
GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.
I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.
It's just not how I like to work.
I think trial-and-error hand-waving isn't all that far from experimentation.
As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).
Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
Even today you'll usually only see 2-4 cores actually getting significant load.
there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).
Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
The disadvantage is that LLMs are probabilistic, mercurial, unreliable.
The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
I feel like this a common pattern with people who work in STEM. As someone who is used to working with formal proofs, equations, math, having a startup taught me how to rewire myself to work with the unknowns, imperfect solutions, messy details. I'm going on a tangent, but just wanted to share.
Software engineering has involved a lot of people doing trial-and-error hand-waving for at least a decade. We are now codifying the trend.
One of the major advantages and disadvantages of LLMs is they act a bit more like humans. I feel like most "prompt advice" out there is very similar to how you would teach a person as well. Teachers and parents have some advantages here.
Out of curiosity, what do you work on where you don’t have to experiment with different solutions to see what works best?
Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
> I don't see how it's any different than optimizing for new CPU/GPU architectures
I mean that seems wild to say to me. Those architectures have documentation and aren't magic black boxes that we chuck inputs at and hope for the best: we do pretty much that with LLMs.
If that's how you optimise, I'm genuinely shocked.
i bet if we talked to a real low level hardware systems/chip engineer they'd laugh and take another shot at how we put them on a pedestal
Not really, in my experience. There's still fundamental differences between designed systems and trained LLMs.
Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).
And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
The problem has always been that every token is suspect.
most people are building straightforward crud apps. no experimentation required.
[citation needed]
In my experience, even simple CRUD apps generally have some domain-specific intricacies or edge cases that take some amount of experimentation to get right.
Idk, it feels like this is what you’d expect versus the actual reality of building something.
From my experience, even building on popular platforms, there are many bugs or poorly documented behaviors in core controls or APIs.
And performance issues in particular can be difficult to fix without trial and error.
Not helpful when the llm knowledge cutoff is a year out of date and api and lib has been changed since
prompt tuning is a temporary necessity
Yeah this is why I don't like statistical and ML solutions in general. Monte Carlo sampling is already kinda throwing bullshit at the wall and hoping something works with absolutely zero guarantees and it's perfectly explainable.
But unfortunately for us, clean and logical classical methods suck ass in comparison so we have no other choice but to deal with the uncertainty.
> no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
Challenge accepted.
That said, the exact quote from the linked notebook is "It’s generally not necessary to use all-caps or other incentives like bribes or tips, but developers can experiment with this for extra emphasis if so desired.", but the demo examples OpenAI provides do like using ALL CAPS.
references for all the above + added more notes here on pricing https://x.com/swyx/status/1911849229188022278
and we'll be publishing our 4.1 pod later today https://www.youtube.com/@latentspacepod
I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end.
The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think".
Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them.
Wouldn’t it be the other way around?
If the instructions are at the top the LV cache entries can be pre computed and cached.
If they’re at the bottom the entries at the lower layers will have a dependency on the user input.
It's placing instructions AND user query at top and bottom. So if you have a prompt like this:
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.
Ahh I see. Thank you for the explanation. I didn’t realise their was user input straight after the system prompt.
yep. we address it in the podcast. presumably this is just a recent discovery and can be post-trained away.
If you're skimming a text to answer a specific question, you can go a lot faster than if you have to memorize the text well enough to answer an unknown question after the fact.
The size of that SWE-bench Verified prompt shows how much work has gone into the prompt to get the highest possible score for that model. A third party might go to a model from a different provider before going to that extent of fine-tuning of the prompt.
>- dont self-inject/parse toolcalls (+2%)
What is meant by this?
Use the OpenAI API/SDK for function calling instead of rolling your own inside the prompt.
> - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
And yet, all function calling and MCP is done through JSON...
JSON is just MCP's transport layer. you can reformat to xml to pass into model
Yeah anyone who has worked with these models knows how much they struggle with JSON inputs.
Why XML over JSON? Are they just saying that because XML is more tokens so they can make more money?
I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.
My take aways:
- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.
- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.
- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.
- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.
My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.
0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
I feel the same way about these models as you conclude. Gemini 2.5 is where I paste whole projects for major refactoring efforts or building big new bits of functionality. Claude 3.7 is great for most day to day edits. And 4.1 okay for small things.
I hope they release a distillation of 4.5 that uses the same training approach; that might be a pretty decent model.
I completely agree. On initial takeaway I find 3.7 sonnet to still be the superior coding model. I'm suspicious now of how they decide these benchmarks...
From OpenAI's announcement:
> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).
https://www.qodo.ai/blog/benchmarked-gpt-4-1/
Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.
Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.
55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge
Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference.
I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading.
That's not a lot of samples for such a small effect, I don't think it's statistically significant (p-value of around 10%).
is there a shorthand/heuristic to calculate pvalue given n samples and effect size?
There are no great shorthands, but here are a few rules of thumb I use:
- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)
- multiply by ~2 to go from standard error of the mean to 95% confidence interval
- scale sample size by sqrt(N)
So:
- N=100: +/- 10%
- N=1000: +/- 3%
- N=10000: +/- 1%
(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)
p-value of 7.9% — so very close to statistical significance.
the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.
Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer
I make it 8.9% with a binomial test[0]. I rounded that to 10%, because any more precision than that was not justified.
Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.
[0] In R, binom.test(110, 200, 0.5, alternative="greater")
That's a marketing page for something called qodo that sells ai code reviews. At no point were the ai code reviews judged by competent engineers. It is just ai generated trash all the way down.
>4.1 Was better in 55% of cases
Um, isn't that just a fancy way of saying it is slightly better
>Score of 6.81 against 6.66
So very slightly better
"they found that GPT‑4.1 excels at both precision..."
They didn't say it is better than Claude at precision etc. Just that it excels.
Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...
A great way to upsell 2% better! I should start doing that.
Good marketing if you're selling a discount all purpose cleaner, not so much for an API.
I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol
55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge
Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.
the point is oai is saying they have a viable Claude Sonnet competitor now
I think an under appreciated reality is that all of the large AI labs and OpenAI in particular are fighting multiple market battles at once. This is coming across in both the number of products and the packaging.
1, to win consumer growth they have continued to benefit on hyper viral moments, lately that was was image generation in 4o, which likely was technically possible a long time before launched. 2, for enterprise workloads and large API use, they seem to have focused less lately but the pricing of 4.1 is clearly an answer to Gemini which has been winning on ultra high volume and consistency. 3, for full frontier benchmarks they pushed out 4.5 to stay SOTA and attract the best researchers. 4, on top of all they they had to, and did, quickly answer the reasoning promise and DeepSeek threat with faster and cheaper o models.
They are still winning many of these battles but history highlights how hard multi front warfare is, at least for teams of humans.
I agree. 4.1 seems to be a release that addresses shortcomings of 4o in coding compared to Claude 3.7 and Gemini 2.0 and 2.5
On that note, I want to see benchmarks for which LLM's are best at translating between languages. To me, it's an entire product category.
I would love to see a stackexchange-like site where humans ask questions and we get to vote on the reply by various LLMs.
is this like what you're thinking of? https://lmarena.ai
Kind of. But lmarena.ai has no way to see results to questions people asked and it only lets you look at two responses side by side.
There are probably many more small battles being fought or emerging. I think voice and PDF parsing are growing battles too.
Here's a summary of this Hacker News thread created by GPT-4.1 (the full sized model) when the conversation hit 164 comments: https://gist.github.com/simonw/93b2a67a54667ac46a247e7c5a2fe...
I think it did very well - it's clearly good at instruction following.
Total token cost: 11,758 input, 2,743 output = 4.546 cents.
Same experiment run with GPT-4.1 mini: https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3... (0.8802 cents)
And GPT-4.1 nano: https://gist.github.com/simonw/1d19f034edf285a788245b7b08734... (0.2018 cents)
Hey Simon, I love how you generates these summaries and share them on every model release. Do you have a quick script that allows you to do that? Would love to take a look if possible :)
He has a couple of nifty plugins to the LLM utility [1] so I would guess its something as simple as ```llm -t fabric:some_prompt_template -f hn:1234567890``` and that applies a template (in this case from a fabric library) and then appends a 'fragment' block from HN plugin which gets the comments, strips everything but the author and text, adds an index number (1.2.3.x), and inserts it into the prompt (+ SQLite).
[1] https://llm.datasette.io/en/stable/plugins/directory.html#fr...
I use this one: https://til.simonwillison.net/llms/claude-hacker-news-themes
Now try Deepseek V3 and see the magic!
Are there any benchmarks or someone who did tests of performance of using this long max token models in scenarios where you actually use more of this token limit?
I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.
I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.
The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books.
Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.
To get around this, whoever is training these models would need to change their training strategy to either:
- Group books in a series together as a single, very long text to be trained on
- Train on multiple unrelated books at once in the same context window
- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.
I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.
I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.
RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.
[0]: https://arxiv.org/html/2310.05209v2
[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...
But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].
[1] https://github.com/adobe-research/NoLiMa
Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.
Is there any evidence that GPT-4.1 is using RoPE to scale context?
Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.
I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.
Re: Llama 4, please see the sibling comment.
No, there's a fundamental limitation of Transformer architecture:
Training data isn't the problem.In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.
codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen"
What about old books? Wikipedia? Law texts? Programming languages documentations?
How many tokens is a 100 pages PDF? 10k to 100k?
For reference, I think a common approximation is one token being 0.75 words.
For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.
It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.
What about the role of synthetic data?
Synthetic data requires a discriminator that can select the highest quality results to feed back into training. Training a discriminator is easier than a full blown LLM, but it still suffers from a lack of high quality training data in the case of 1M context windows. How do you train a discriminator to select good 2,000 page synthetic books if the only ones you have to train it with are Proust and concatenated Harry Potter/Game of Thrones/etc.
Wikipedia does not have many pages that are 750k words. According to Special:LongPages[1], the longest page right now is a little under 750k bytes.
https://en.wikipedia.org/wiki/List_of_chiropterans
Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.
[1] https://en.wikipedia.org/wiki/Special:LongPages
I mean, can’t they just train on some huge codebases? There’s lots of 100KLOC codebases out there which would probably get close to 1M tokens.
Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section?
What you're describing as "needle in a haystack" is a necessary requirement for the downstream ability you want. The distinction is really how many "things" the LLM can process in a single shot.
LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).
Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.
I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode): https://news.ycombinator.com/item?id=43640166#43640790
Updated results from the authors: https://github.com/adobe-research/NoLiMa
It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.
There are some benchmarks such as Fiction.LiveBench[0] that give an indication and the new Graphwalks approach looks super interesting.
But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models.
I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant)
[0] https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
[1] https://arxiv.org/pdf/2404.06654
This is a paper which echoes your experience, in general. I really wish that when papers like this one were created, someone took the methodology and kept running with it for every model:
> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.
https://arxiv.org/abs/2502.05167
As much as I enjoy Gemini models, I have to agree with you. At some point, interactions with them start resembling talking to people with short-term memory issues, and answers become increasingly unreliable. Now, there are also reports of AI Studio glitching out and not loading these longer conversations.
Is there a reliable method for pruning, summarizing, or otherwise compressing context to overcome such issues?
I’m not optimistic. It’s the Wild West and comparing models for one’s specific use case is difficult, essentially impossible at scale.
Have they implemented "I don't know" yet.
I probably spend 100$ a month on AI coding, and it's great at small straightforward tasks.
Drop it into a larger codebase and it'll get confused. Even if the same tool built it in the first place due to context limits.
Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.
I agree. I use it a lot but there is endless frustration when the C++ code I am working on gets both complex and largish. Once it gets to a certain size and the context gets too long they all pretty much lose the plot and start producing complete rubbish. It would be great for it to give some measure so I know to take over and not have it start injecting random bugs or deleting functional code. It even starts doing things like returning locally allocated pointers lately.
> Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.
I believe this. I've been having the forgetting problem happen less with Gemini 2.5 Pro. It does hallucinate, but I can get far just pasting all the docs and a few examples, and asking it to double check everything according to the docs instead of relying on its memory.
Have you tried using a tool like 16x Prompt to send only relevant code to the model?
This helps the model to focus on a subset of codebase thst is relevant to the current task.
https://prompt.16x.engineer/
(I built it)
Just some tiny feedback if you didn’t mind; in the free version 10 prompts/day is unticked which sort of hints that there isn’t a 10 prompt/day limit, but I’m guessing that’s not what you want to say?
Ah I see what you mean. I was trying to convey that this is a limitation, hence not a tick symbol.
But I guess it could be interpreted differently like you said.
I wonder if documentation would help to create an carefully and intentionally tokenized overview of the system. Maximize the amount of routine larger scope information provided in minimal tokens in order to leave room for more immediate context.
Similar to the function documentation provides to developers today, I suppose.
It does, shockingly well in my experience. Check out this blog post outlining such an approach, called Literate Development by the author: https://news.ycombinator.com/item?id=43524673
bahahaha spoken like someone who spends $100 to do the task a single semi decent software developer (yourself) should be able to do for... $0
It's a matter of time.
The promise of AI is I can spend 100$ to get 40 hours or so of work done.
It's not the point of the announcement, but I do like the use of the (abs) subscript to demonstrate the improvement in LLM performance since in these types of benchmark descriptions I never can tell if the percentage increase is absolute or relative.
> They feature a refreshed knowledge cutoff of June 2024.
As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
Honestly this feels underwhelming and surprising. Especially if you're coding with frameworks with breaking changes, this can hurt you.
It's definitely an issue. Even the simplest use case of "create React app with Vite and Tailwind" is broken with these models right now because they're not up to date.
Time to start moving back to Java & Spring.
100% backwards compatibility and well represented in 15 years worth of training data, hah.
Write once, run nowhere.
LOOOOL you have my upvote
(I did use Spring, once, ages ago, and we deployed the app to a local Tomcat server in the office...)
Maybe LLMs will be the forcing function to finally slow down the crazy pace of changing (and breaking) things in JavaScript land.
Whenever an LLM struggles with a particular library version, I use Cursor Rules to auto-include migration information and that generally worked well enough in my cases.
A few weeks back I couldn't even get ChatGPT to output TypeScript code that correctly used the OpenAI SDK.
You should give it documentation is can't guess.
By "broken" you mean it doesn't use the latest and greatest hot trend, right? Or does it literally not work?
Periodically I keep trying these coding models in Copilot and I have yet to have an experience where it produced working code with a pretty straightforward TypeScript codebase. Specifically, it cannot for the life of it produce working Drizzle code. It will hallucinate methods that don't exist despite throwing bright red type errors. Does it even check for TS errors?
Not sure about Copilot, but the Cursor agent runs both eslint and tsc by default and fixes the errors automatically. You can tell it to run tests too, and whatever other tools. I've had a good experience writing drizzle schemas with it.
It has been really frustrating learning Godot (or any new technology you are not familiar with) 4.4.x with GPT4o or even worse, with custom GPT which use older GPT4turbo.
As you are new in the field, it kinda doesn't make sense to pick an older version. It would be better if there was no data than incorrect data. You literally have to include the version number on every prompt and even that doesn't guarantee a right result! Sometimes I have to play truth or dare three times before we finally find the right names and instructions. Yes I have the version info on all custom information dialogs, but it is not as effective as including it in the prompt itself.
Searching the web feels like an on-going "I'm feeling lucky" mode. Anyway, I still happen to get some real insights from GPT4o, even though Gemini 2.5 Pro has proven far superior for larger and more difficult contexts / problems.
The best storytelling ideas have come from GPT 4.5. Looking forward to testing this new 4.1 as well.
hey- curious what your experience has been like learning godot w/ LLM tooling.
are you doing 3d? The 3D tutorial ecosystem is very GUI heavy and I have had major problems trying to get godot to do anything 3D
I'm afraid I'm only doing 2d ... Yes, GUI related LLM instructions have been exceptionally bad, with multiple prompts me saying "no there is no such thing"... But as I commented earlier, GPT has had it's moments.
I strongly recommend giving Gemini 2.5 Pro a shot. Personally I don't like their bloated UI, but you can set the temperature value, which is especially helpful when you are more certain what and how you want, then just lower that value. If you want to get some wilder ideas, turn it up. Also highly recommend reading the thought process it does! That was actually key in having very complex ideas working. Just spotting couple of lines there, that seem too vague or even just a little bit inaccurate ... then pasting them back, with your own comments, have helped me a ton.
Is there a specific part in which you struggle? And FWIW, I've been on a heavy learning spree for 2 weeks. I feel like I'm starting to see glimbses from the barrel's bottom ... it's not so deep, you just gotta hang in there and bombard different LLMs with different questions, different angles, stripping away most and trying the simplest variation, for both prompt and godot. Or sometimes by asking more general advice "what is current godot best practice in doing x".
And YouTube has also been helpful source, by listening how more experienced users make their stuff. You can mostly skim through the videos with doublespeed and just focus on how they are doing the basics. Best of luck!
Try getting then to output Svelte 5 code...
Svelte 5 is the antidote to vibe coding.
usually enabling "Search" fixes it sometimes as they fetch the newer methods.
It it annoying. The bigger cheaper context windows help this a little though:
E.g.: If context windows get big and cheap enough (as things are trending), hopefully you can just dump the entire docs, examples, and more in every request.
sometimes it feels like openai keeps serving the same base dish—just adding new toppings. sure, the menu keeps changing, but it all kinda tastes the same. now the menu is getting too big.
nice to see that we aren't stuck in october of 2023 anymore!
The real news for me is GPT 4.5 being deprecated and the creativity is being brought to "future models" and not 4.1. 4.5 was okay in many ways but it was absolutely a genius in production for creative writing. 4o writes like a skilled human, but 4.5 can actually write a 10 minute scene that gives me goosebumps. I think it's the context window that allows for it to actually build up scenes to hammer it down much later.
Cool to hear that you got something out of it, but for most users 4.5 might have just felt less capable on their solution-oriented questions. I guess this why they are deprecating it.
It is just such a big failure of OpenAI not to include smart routing on each question and hide the complexity of choosing a model from users.
Most of the improvements in this model, basically everything except the longer context, image understanding and better pricing, are basically things that reinforcement learning (without human feedback) should be good at.
Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.
I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.
[1] https://arxiv.org/abs/2411.15124
If so, the loss of fidelity versus 4.5 is really noticeable and a loss for numerous applications. (Finding a vegan restaurant in a random city neighborhood, for example.)
In your example the LLM should not be responsible for that directly. It should be calling out to an API or search results to get accurate and up-to-date information (relatively speaking) and then use that context to generate a response
You should actually try it. The really big models (4 and 4.5, sadly not 4o) have truly breathtaking ability to dig up hidden gems that have a really low profile on the internet. The recommendations also seem to cut through all the SEO and review manipulation and deliver quality recommendations. It really all can be in one massive model.
ChatGPT currently recommends I use o3-mini-high ("great at coding and logic") when I start a code conversation with 4o.
I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?
4.1 is a pinned API variant with the improvements from the newer iterations of 4o you're already using in the app, so that's why the comparison focuses between those two.
Pricing wise the per token cost of o3-mini is less than 4.1 but keep in mind o3-mini is a reasoning model and you will pay for those tokens too, not just the final output tokens. Also be aware reasoning models can take a long time to return a response... which isn't great if you're trying to use an API for interactive coding.
> I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
There are tons of comparisons to o3-mini-high in the linked article.
Sam Altman wrote in February that GPT-4.5 would be "our last non-chain-of-thought model" [1], but GPT-4.1 also does not have internal chain-of-thought [2].
It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.
Did they not expect this model to turn out as well as it did?
[1] https://x.com/sama/status/1889755723078443244
[2] https://github.com/openai/openai-cookbook/blob/6a47d53c967a0...
Anyone making claims with a horizon beyond two months about structure or capabilities will be wrong - it's sama's job to show confidence and vision and calm stakeholders, but if you're paying attention to the field, the release and research cycles are still contracting, with no sense of slowing any time soon. I've followed AI research daily since GPT-2, the momentum is incredible, and even if the industry sticks with transformers, there are years left of low hanging fruit and incremental improvements before things start slowing.
There doesn't appear to be anything that these AI models cannot do, in principle, given sufficient data and compute. They've figured out multimodality and complex integration, self play for arbitrary domains, and lots of high-cost longer term paradigms that will push capabilities forwards for at least 2 decades in conjunction with Moore's law.
Things are going to continue getting better, faster, and weirder. If someone is making confident predictions beyond those claims, it's probably their job.
Maybe that's true for absolute arm-chair-engineering outsiders (like me) but these models are in training for months, training data is probably being prepared year(s) in advance. These models have a knowledge cut-off in 2024 - so they have been in training for a while. There's no way sama did not have a good idea that this non-COT was in the pipeline 2 months ago. It was probably finished training then and undergoing evals.
Maybe
1. he's just doing his job and hyping OpenAI's competitive advantages (afair most of the competition didn't have decent COT models in Feb), or
2. something changed and they're releasing models now that they didn't intend to release 2 months ago (maybe because a model they did intend to release is not ready and won't be for a while), or
3. COT is not really as advantageous as it was deemed to be 2+ months ago and/or computationally too expensive.
With new hardware from Nvidia announced coming out, those months turn into weeks.
I doubt it's going to be weeks, the months were already turning into years despite Nvidia's previous advances.
(Not to say that it takes openai years to train a new model, just that the timeline between major GPT releases seems to double... be it for data gathering, training, taking breaks between training generations, ... - either way, model training seems to get harder not easier).
GPT Model | Release Date | Months Passed Between Former Model
GPT-1 | 11.06.2018
GPT-2 | 14.02.2019 | 8.16
GPT-3 | 28.05.2020 | 15.43
GPT-4 | 14.03.2023 | 33.55
[1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when-will-...
The capabilities and general utility of the models are increasing on an entirely different trajectory than model names - the information you posted is 99% dependent on internal OAI processes and market activities as opposed to anything to do with AI.
I'm talking more broadly, as well, including consideration of audio, video, and image modalities, general robotics models, and the momentum behind applying some of these architectures to novel domains. Protocols like MCP and automation tooling are rapidly improving, with media production and IT work rapidly being automated wherever possible. When you throw in the chemistry and materials science advances, protein modeling, etc - we have enormously powerful AI with insufficient compute and expertise to apply it to everything we might want to. We have research being done on alternate architectures, and optimization being done on transformers that are rapidly reducing the cost/performance ratio. There are models that you can run on phones that would have been considered AGI 10 years ago, and there doesn't seem to be any fundamental principle decreasing the rate of improvement yet. If alternate architectures like RWKV get funded, there might be several orders of magnitude improvement with relatively little disruption to production model behaviors, but other architectures like text diffusion could obsolete a lot of the ecosystem being built up around LLMs right now.
There are a million little considerations pumping transformer LLMs right now because they work and there's every reason to expect them to continue improving in performance and value for at least a decade. There aren't enough researchers and there's not enough compute to saturate the industry.
Fair point, I guess my question is how long it would take them to train GPT-2 on the absolute bleedingest generation of Nvidia chips vs what they had in 2019, with the budget they have to blow on Nvidia supercomputers today.
the release and research cycles are still contracting
Not necessarily progress or benchmarks that as a broader picture you would look at (MMLU etc)
GPT-3 was an amazing step up from GPT-2, something scientists in the field really thought was 10-15 years out at least done in 2, instruct/RHLF for GPTs was a similar massive splash, making the second half of 2021 equally amazing.
However nothing since has really been that left field or unpredictable from then, and it's been almost 3 years since RHLF hit the field. We knew good image understanding as input, longer context, and improved prompting would improve results. The releases are common, but the progress feels like it has stalled for me.
What really has changed since Davinci-instruct or ChatGPT to you? When making an AI-using product, do you construct it differently? Are agents presently more than APIs talking to databases with private fields?
In some dimensions I recognize the slow down in how fast new capabilities develop, but the speed still feels very high:
Image generation suddenly went from gimmick to useful now that prompt adherence is so much better (eagerly waiting for that to be in the API)
Coding performance continues to improve noticeably (for me). Claude 3.7 felt like a big step from 4o/3.5. Gemini 2.5 in a similar way.compared to just 6 months ago I can give bigger and more complex pieces of work to it and get relatively good output back. (Net acceleration)
Audio-2-audio seems like it will be a big step as well. I think this has much more potential than the STT-LLM-TTS architecture commonly used today (latency, quality)
I see a huge progress made since the first gpt-4 release. The reliability of answers has improved an order of magnitude. Two years ago, more than half of my questions resulted in incorrect or partially correct answers (most of my queries are about complicated software algorithms or phd level research brainstorming). A simple “are you sure” prompt would force the model to admit it was wrong most of the time. Now with o1 this almost never happens and the model seems to be smarter or at least more capable than me - in general. GPT-4 was a bright high school student. o1 is a postdoc.
Excuse the pedantry; for those reading, it’s RLHF rather than RHLF.
> Things are going to continue getting better, faster, and weirder.
I love this. Especially the weirder part. This tech can be useful in every crevice of society and we still have no idea what new creative use cases there are.
Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?
> Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?
That would have been quite far down on my list of "major (unexpected) consequences of phones and social media"...
Yep, it’s literally just a slightly higher tech version of (for example) the 1992 Los Angeles riots over Rodney King but with phones and Facebook instead of handheld camcorders and television.
Maybe that's why they named this model 4.1, despite coming out after 4.5 and supposedly outperforming it. They can pretend GPT-4.5 is the last non-chain-of-thought model by just giving all non-chain-of-thought-models version numbers below 4.5
Ok, I know naming things is hard, but 4.1 comes out after 4.5? Just, wat.
For a long time, you could fool models with questions like "Which is greater, 4.10 or 4.5?" Maybe they're still struggling with that at OpenAI.
At this point, I'm just assuming most AI models — not just OpenAI's — name themselves. And that they write their own press releases.
Why do you expect to believe a single word Sam Altman says?
Everyone assumed malice when the board fired him for not always being "candid" - but it seems more and more that he's just clueless. He's definitely capable when it comes to raising money as a business, but I wouldn't count on any tech opinion from him.
I think that people balked at the cost of 4.5 and really wanted just a slightly more improved 4o . Now it almost seems that they will have a separate products that are non chain of thought and chain of thought series which actually makes sense because some want a cheap model and some don't.
> Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan.
Well they actually hinted already of possible depreciation in their initial announcement of gpt4.5 [0]. Also, as others said, this model was already offered in the api as chatgpt-latest, but there was no checkpoint which made it unreliable for actual use.
[0] https://openai.com/index/introducing-gpt-4-5/#:~:text=we%E2%...
Perhaps it is a distilled 4.5, or based on it's lineage, as some suggested.
When I saw them say 'no more non COT models', I was minorly panicked.
While their competitors have made fantastic models, at the time I perceived ChatGPT4 was the best model for many applications. COT was often tricked by my prompts, assuming things to be true, when a non-COT model would say something like 'That isnt necessarily the case'.
I use both COT and non when I have an important problem.
Seeing them keep a non-COT model around is a good idea.
Sam made a strange statement imo in a recent Ted Talk. He said (something like) models come and go but they want to be the best platform.
For me, it was jaw dropping. Perhaps he didn't mean it the way it sounded, but seemed like a major shift to me.
Before everyone caught up:
After everyone else caught up:OpenAI has been a product company ever since ChatGPT launched.
Their value is firmly rooted in how they wrap ux around models.
Looks like the Quasar and Optimus stealth models on Openrouter were in fact GPT-4.1. This is what I get when I try to access the openrouter/optimus-alpha model now:
As a user I'm getting so confused as to what's the "best" for various categories. I don't have time/want to dig into benchmarks for different categories, look into the example data to see which best maps onto my current problems.
The graphs presented don't even show a clear winner across all categories. The one with the biggest "number", GPT-4.5, isn't even in the best in most categories, actually it's like 3rd in a lot of them.
This is quite confusing as a user.
Otherwise big fan of OAI products thus far. I keep paying $20/mo, they keep improving across the board.
I think "best" is slightly subjective / user. But I understand your gripe. I think the only way is using them iteratively, settling on the one that best fits you / your use-case, whilst reading other peoples' experiences and getting a general vibe
No benchmark comparisons to other models, especially Gemini 2.5 Pro, is telling.
Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets 70%
They are reporting that GPT-4.1 gets 55%.
Very interesting. For my use cases, Gemini's responses beat Sonnet 3.7's like 80% of the time (gut feeling, didn't collect actual data). It beats Sonnet 100% of the time when the context gets above 120k.
As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.
Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.
Lmarena isn't that useful anymore lol
I actually agree with that, but it's generally better than other scores. Also, the quote is like a year old at this point.
In practice you have to evaluate the models yourself for any non-trivial task.
Are those with «thinking» or without?
Sonnet 3.7's 70% is without thinking, see https://www.anthropic.com/news/claude-3-7-sonnet
The thinking tokens (even just 1024) make a massive difference in real world tasks with 3.7 in my experience
based on their release cadence, I suspect that o4-mini will compete on price, performance, and context length with the rest of these models.
o4-mini, not to be confused with 4o-mini
With
Go look at their past blog posts. OpenAI only ever benchmarks against their own models.
This is pretty common across industries. The leader doesn’t compare themselves to the competition.
Okay, it's common across other industries, but not this one. Here is Google, Facebook, and Anthropic comparing their frontier models to others[1][2][3].
[1] https://blog.google/technology/google-deepmind/gemini-model-...
[2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
[3] https://www.anthropic.com/claude/sonnet
Right. Those labs aren’t leading the industry.
Confusing take - Gemini 2.5 is probably the best general purpose coding model right now, and before that it was Sonnet 3.5. (Maybe 3.7 if you can get it to be less reward-hacky.) OpenAI hasn't had the best coding model for... coming up on a year, now? (o1-pro probably "outperformed" Sonnet 3.5 but you'd be waiting 10 minutes for a response, so.)
Leader is debatable, especially given the actual comparisons...
That would make sense if OAI were the leader.
Except they are far from the lead in model performance
Who has a (publicly released) model that is SOTA is constantly changing. It’s more interesting to see who is driving the innovation in the field, and right now that is pretty clearly OpenAI (GPT-3, first multi-modal model, first reasoning model, ect).
There is no uniform tactic for this type of marketing. They will compare against whomever they need to to suit their marketing goals.
also sometimes if you get it wrong you catch unnecessary flak
• Flagship GPT-4.1: top‑tier intelligence, full endpoints & premium features
• GPT-4.1-mini: balances performance, speed & cost
• GPT-4.1-nano: prioritizes throughput & low cost with streamlined capabilities
All share a 1 million‑token context window (vs 120–200k on 4o-o3/o1), excelling in instruction following, tool calls & coding.
Benchmarks vs prior models:
• AIME ’24: 48.1% vs 13.1% (~3.7× gain)
• MMLU: 90.2% vs 85.7% (+4.5 pp)
• Video‑MME: 72.0% vs 65.3% (+6.7 pp)
• SWE‑bench Verified: 54.6% vs 33.2% (+21.4 pp)
The deprecation of GPT-4.5 makes me sad. It's an amazing model with great world-knowledge and subtly. It KNOWS THINGS that, on a quick experiment, 4.1 just does not. 4.5 could tell me what I would see from a random street corner in New Jersey, or how to use minor features of my niche API (well, almost), and it could write remarkably. But 4.1 doesn't hold a candle to it. Please, continue to charge me $150/1M tokens. Sometimes you need a Big Model. Tells me it was costing more than $150/1M to serve (!).
Is just this something I haven't noticed before? Or is this new?
Not new, launched in December 2024. https://community.openai.com/t/free-tokens-on-traffic-shared...
So, that's like $10/day to give all your data/prompts?
IIRC 4.5 was 75$/1M input and 150$/1M output.
O1 is 15$ in 60$ out.
So you could easily get 75+$ per day free from this.
Very important note:
>Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version
If anyone here doesn't know, OpenAI does offer the ChatGPT model version in the API as chatgpt-4o-latest, but it's bad because they continuously update it so businesses can't reliably rely on it being stable, that's why OpenAI made GPT 4.1.
> chatgpt-4o-latest, but it's bad because they continuously update it
Version explicitly marked as "latest" being continuously updated it? Crazy.
No one's arguing that it's improperly labelled, but if you're going to use it via API, you might want consistency over bleeding edge.
Lots of the other models are checkpoint releases, and latest is a pointer to the latest checkpoint. Something being continuously updated is quite different and worth knowing about.
It can be both properly communicated and still bad for API use cases.
OpenAI (and most LLM providers) allow model version pinning for exactly this reason, e.g. in the case of GPT-4o you can specify gpt-4o-2024-05-13, gpt-4o-2024-08-06, or gpt-4o-2024-11-20.
https://platform.openai.com/docs/models/gpt-4o
Yes, and they don't make snapshots for chatgpt-4o-latest, but they made them for GPT 4.1, that's why 4.1 is only useful for API, since their ChatGPT product already has the better model.
Okay so is GPT 4.1 literally just the current chatpt-4o-latest or not?
I feel like it is. But that's just the vibe.
It isn't.
Yeah, in the last week, I had seen a strong benchmark for chatgpt-4o-latest and tried it for a client's use case. I ended up wasting like 4 days, because after my initial strong test results, in the following days, it gave results that were inconsistent and poor, and sometimes just outputting spaces.
So you're saying that "ChatGPT-4o-latest (2025-03-26)" in LMarena is 4.1?
No, that is saying that some of the improvements that went into 4.1 have also gone into ChatGPT, including chatgpt-4o-latest (2025-03-26).
yeah I was surprised in they benchmarks during livestream they didn't compare to ChatGPT-4o (2025-03-26) but only older one.
Easy to miss in the announcement that 4.5 is being shut down
> GPT‑4.5 Preview will be turned off in three months, on July 14, 2025
Juice not worth the squeeze I imagine. 4.5 is chonky, and having to reserve GPU space for it must not have been worth it. Makes sense to me - I hadn't founding anything it was so much better at that it was worth the incremental cost over Sonnet 3.7 or o3-mini.
Marginally on-topic: I'd love if the charts included prior models, including GPT 4 and 3.5.
Not all systems upgrade every few months. A major question is when we reach step-improvements in performance warranting a re-eval, redesign of prompts, etc.
There's a small bleeding edge, and a much larger number of followers.
With these being 1M context size, does that all but confirm that Quasar Alpha and Optimus Alpha were cloaked OpenAI models on OpenRouter?
Yes, OpenRouter confirmed it here - https://x.com/OpenRouterAI/status/1911833662464864452
I think Quasar is fairly confirmed [0] to be OpenAI.
[0] https://x.com/OpenAI/status/1911782243640754634
Yes, confirmed by citing Aider benchmarks: https://openai.com/index/gpt-4-1/
Which means that these models are _absolutely_ not SOTA, and Gemini 2.5 pro is much better, and Sonnet is better, and even R1 is better.
Sorry Sam, you are losing the game.
Aren’t all of these reasoning models?
Won’t the reasoning models of openAI benchmarked against these be a test of if Sam is losing?
Sonnet 3.7 non-reasoning is better on its own. In fact even Sonnet 3.5-v2 is, and that was released 6 months ago. Now to be fair, they're close enough that there will be usecases - especially non-coding - where 4.1 beats it consistently. Also, 4.1 is quite a lot cheaper and faster. Still, OpenAI is clearly behind.
Even without reasoning, isn't Deepseek V3 from March better?
There is no OpenAI model better than R1, reasoning or not (as confirmed by the same Aider benchmark; non-coding tests are less objective, but I think it still holds).
With Gemini (current SOTA) and Sonnet (great potential, but tends to overengineer/overdo things) it is debatable, they are probably better than R1 (and all OpenAI models by extension).
Did some quick tests. I believe its the same model as Quasar. It struggles with agentic loop [1]. You'd have to force it to do tool calls.
Tool use ability feels ability better than gemini-2.5-pro-exp [2] which struggles with JSON schema understanding sometimes.
Llama 4 has suprising agentic capabilities, better than both of them [3] but isn't as intelligent as the others.
[1] https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
[2] https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
[3] https://github.com/rusiaaman/chat.md/blob/main/samples/llama...
Correct. They've mentioned the name during the live announcement - https://www.youtube.com/live/kA-P9ood-cE?si=GYosi4FtX1YSAujE...
Excited to see 4.1 in the API. The Nano model pricing is comparable to Gemini Flash but not where we would like it to be: https://composableai.de/openai-veroeffentlicht-4-1-nano-als-...
The user shoudn't have to research which model is the best for them. OpenAI needs to do a better job in UX and putting the best model forward in chatgpt.
Is there an API endpoint at OpenAI that gives the information on this page as structured data?
https://platform.openai.com/docs/models/gpt-4.1
As far as I can tell there's no way to discover the details of a model via the API right now.
Given the announced adoption of MCP and MCP's ability to perform model selection for Sampling based on a ranking for speed and intelligence, it would be great to have a model discovery endpoint that came with all the details on that page.
Company worth hundreds of billions of dollars, on paper at least, has one of the worst naming schemes for their products in the recent history.
Sam acknowledged this a few months ago, but with another release not really bringing any clarity, this is getting ridiculous now.
I like how Nano matches Gemini 2.0 Flash's price. That will help drive down prices which will be good for my app. However I don't like how Nano behaves worse than 4o Mini in some benchmarks. Maybe it will be good enough, we'll see.
> That will help drive down prices which will be good for my app
Why not use Gemini?
yeah and considering that gemini 2.0 flash is much better than 4o-mini. On top of that gemini have also audio input as modality and realtime API for both audio input and output + web search grounding + free tier.
Theory here is that 4.1-nano is competing with that tier, 4.1 with flash-thinking (although likely to do significantly worse), and o4-mini or o3-large will compete with 2.5 thinking
For conversational AI, the most significant part is GPT-4.1 mini being 2x faster than GPT-4o at basically the same reasoning capabilities.
pretty wild versioning that GPT 4.1 is newer and better in many regards than GPT 4.5.
it's worse on nearly every benchmark
OpenAI themselves said
> One last note: we’ll also begin deprecating GPT-4.5 Preview in the API today as GPT-4.1 offers improved or similar performance on many key capabilities at lower latency and cost. GPT-4.5 in the API will be turned off in three months, on July 14, to allow time to transition (and GPT 4.5 will continue to be available in ChatGPT).
https://x.com/OpenAIDevs/status/1911860805810716929
no? it's better on AIME '24, Multilingual MMLU, SWE-bench, Aider’s polyglot, MMMU, ComplexFuncBench
and it ties on a lot of benchmarks
look at all the graphs in the article
the data i posted all came from the graphs/charts in the article
I think they're doing it deliberately at this point
Tomorrow they are releasing the open source GPT-1.4 model :P
I'm wondering if one of the big reasons that OpenAI is making gpt-4.5 deprecated is not only because it's not cost-effective to host but because they don't want their parent model being used to train competitors' models (like deepseek).
I'm not really bullish on OpenAI. Why would they only compare with their own models? The only explanation could be that they aren't as competitive with other labs as they were before.
See figure 1 for up-to-date benchmarks https://github.com/KCORES/kcores-llm-arena
(Direct Link) https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...
Apple compares against its own products most of the times.
I don't mind what they benchmark against as long as, when I use the model, it continues to give me better results than their competition.
Go look at their past blog posts. OpenAI only ever benchmarks against their own models.
Oh, ok. But it's still quite telling of their attitude as an organization.
It's the same organization that kept repeating that sharing weights of GPT would be "too dangerous for the world". Eventually DeepSeek thankfully did something like that, though they are supposed to be the evil guys.
They continue to baffle users with their version numbering. Intiutively 4.5 is newer/better than 4.1 and perhaps 4o but of course this is not the case.
By leaving out scale or prior models they are effectively manipulating improvement. If from 3 to 4 it was from 10 to 80, and from 4 to 4o it was 80 to 82, leaving out 3 would let us see a steep line instead of steep decrease of growth.
Lies, damn lies and statistics ;-)
The benchmarks and charts they have up are frustrating because they don’t include 03-mini(-high) which they’ve been pushing as the low-latency+low-cost smart model to use for coding challenges instead of 4o and 4o-mini. Why won’t they include that in the charts?
I've been using it in Cursor for the past few hours and prefer it to Sonnet 3.7. It's much faster and doesn't seem to make the sort of stupid mistakes Sonnet has been making recently.
If reasoning models are any good, then can they figure out overpowered builds for poe2?
Wait, wouldn’t this be a decent test for reasoning ?
Every patch changes things, and there’s massive complexity with the various interactions between items, uniques, runes, and more.
Once they can do this we are probably at AGI
And I can get a one button build at league start
I feel there's some "benchmark-hacking" is going on with GPT4.1 model as its metrics on livebench.com aren't all that exciting.
- It's basically GPT4o level on average.
- More optimized for coding, but slightly inferior in other areas.
It seems to be a better model than 4o for coding tasks, but I'm not sure if it will replace the current leaders -- Gemini 2.5 Pro, o3-mini / o1, Claude 3.7/3.5.
Is the version number a retcon of 4.5? On OpenAI's models page the names appear completely reasonable [1]: The o1 and o3 reasoning models, and non-reasoning there is 3.5, 4, 4o and 4.1 (let's pretend 4o makes sense). But that is only reasonable as long as we pretend 4.5 never happened, which the models page apparently does
1: https://platform.openai.com/docs/models
They tried something and it didn't work well. Branching paths of experimentation is not compatible with number-goes-up versioning.
The increased context length is interesting.
It would be incredible to be able to feed an entire codebase into a model and say "add this feature" or "we're having a bug where X is happening, tell me why", but then you are limited by the output token length
As others have pointed out too, the more tokens you use, the less accuracy you get and the more it gets confused, I've noticed this too
We are a ways away yet from being able to input an entire codebase, and have it give you back an updated version of that codebase.
And it is available at https://t3.chat/ (as well as claude, grok, gemini etc) for 8usd/month
> These Terms and your use of T3 Chat will be governed by and construed in accordance with the laws of the jurisdiction where T3 Tools Inc. is incorporated, without regard to its conflict of law provisions. Any disputes arising out of or in connection with these Terms will be resolved exclusively in the courts located in that jurisdiction, unless otherwise required by applicable law.
Would be nice if there was at least some hint as to where T3 Tools Inc. is located and what jurisdiction applies.
I'm using models which scored at least 50% in Aider leaderboard but I'm micromanaging 50 line changes instead of being more vibe. Is it worth experimenting with a model that didnt crack 10%?
I just wish they would start using human friendly names for them, and use a YY.rev version number so it's easier to know how new/old something is.
Broad Knowledge 25.1 Coder: Larger Problems 25.1 Coder: Line focused 25.1
Lots of improvements here (hopefully), but still no image generation updates, which is what I'm most eager for right now.
Or text to speech generation ... but I guess that is coming.
Yeah, I tried the 4o models and they severely mispronounced common words and read numbers incorrectly (eg reading 16000 as 1600)
They just realised a new image generation a couple of weeks ago, why are you eager for another one so soon?
Are the image generation improvements available via API? Don't think so
While impressive that the assistants can use dynamic tools and reason about images, I'm most excited about the improvements to factual accuracy and instruction following. The RAG capabilities with cross-validation seem particularly useful for serious development work, not just toy demos.
Big focus on coding. It feels like a defensive move against Claude (and more recently, Gemini Pro) which became very popular in that regime. I guess they recently figured out some ways to train the model for these "agentic" coding through RL or something - and the finding is too new to apply 4.5 on time.
More information here:
My theory: they need to move off the 4o version number before releasing o4-mini next week or so.
The 'oN' schema was a such strange choice for branding. They had to skip 'o2' because it's already trademarked, and now 'o4' can easily be confused with '4o'.
I tried 4.1-mini and 4.1-nano. The response are a lot faster, but for my use-case they seem to be a lot worse than 4o-mini(they fail to complete the task when 4o-mini could do it). Maybe I have to update my prompts...
Even after updating my prompts, 4o-mini still seems to do better than 4.1-mini or 4.1-nano for a data-processing task.
Mind sharing your system prompt?
It's quite complex, but the task is to parse some HTML content, or to choose from a list of URLs which one is the best.
I will check again the prompt, maybe 4o-mini ignores some instructions that 4.1 doesn't (instructions which might result in the LLM returning zero data).
That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.
Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.
Hey OpenAI if you ever need a Version Engineer, I’m available.
4.10 > 4.5 — @stevenheidel
@sama: underrated tweet
Source: https://x.com/stevenheidel/status/1911833398588719274
Too bad OpenAI named it 4.1 instead of 4.10. You can either claim 4.10 > 4.5 (the dots separate natural numbers) or 4.1 == 4.10 (they are decimal numbers), but you can't have both at once
so true
> Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version (opens in a new window) of GPT‑4o, and we will continue to incorporate more with future releases.
The lack of availability in ChatGPT is disappointing, and they're playing on ambiguity here. They are framing this as if it were unnecessary to release 4.1 on ChatGPT, since 4o is apparently great, while simultaneously showing how much better 4.1 is relative to GPT-4o.
One wager is that the inference cost is significantly higher for 4.1 than for 4o, and that they expect most ChatGPT users not to notice a marginal difference in output quality. API users, however, will notice. Alternatively, 4o might have been aggressively tuned to be conversational while 4.1 is more "neutral"? I wonder.
There's a HUGE difference that you are not mentioning: there are "gpt-4o" and "chatgpt-4o-latest" on the API. The former is the stable version (there are a few snapshot but the newest snapshot has been there for a while), and the latter is the fine-tuned version that they often update on ChatGPT. All those benchmarks were done for the API stable version of GPT-4o, since that's what businesses rely on, not on "chatgpt-4o-latest".
Good point, but how does that relate to, or explain, the decision not to release 4.1 in ChatGPT? If they have a nice post-training pipeline to make 4o "nicer" to talk to, why not use it to fine-tune the base 4.1 into e.g. chatgpt-4.1-latest?
Because chatgpt-4o-latest already has all of those improvements, the largest point of this release (IMO) is to offer developers a stable snapshot of something that compares to modern 4o latest. Altman said that they'd offer a stable snapshot of chatgpt 4o latest on the API, he perhaps did really mean GPT 4.1.
> Because chatgpt-4o-latest already has all of those improvements
Does it, though? They said that "many" have already been incorporated. I simply don't buy their vague statements there. These are different models. They may share some training/post-training recipe improvements, but they are still different.
I disagree. From the average user perspective, it's quite confusing to see half a dozen models to choose from in the UI. In an ideal world, ChatGPT would just abstract away the decision. So I don't need to be an expert in the relatively minor differences between each model to have a good experience.
Vs in the API, I want to have very strict versioning of the models I'm using. And so letting me run by own evals and pick the model that works best.
> it's quite confusing to see half a dozen models to choose from in the UI. In an ideal world, ChatGPT would just abstract away the decision
Supposedly that’s coming with GPT 5.
I agree on both naming on stability. However, this wasn't my point.
They still have a mess of models in ChatGPT for now, and it doesn't look like this is going to get better immediately (even though for GPT-5, they ostensibly want to unify them). You have to choose among all of them anyway.
I'd like to be able to choose 4.1.
“GPT‑4.1 scores 54.6% on SWE-bench Verified, improving by 21.4%abs over GPT‑4o and 26.6%abs over GPT‑4.5—making it a leading model for coding.”
4.1 is 26.6% better at coding than 4.5. Got it. Also…see the em dash
What's wrong with the em-dash? That's just...the typographically correct dash AFAIK.
Maybe a reference to the OpenAI models loving to output em-dashes?
Should have named it 4.10
But it’s so much weaker than 4.5 in broader tasks… maybe more optimized against benchmarks but it’s just no replacement for a huge model.
> We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency.
why would they deprecate when it's the better model? too expensive?
> why would they deprecate when it's the better model? too expensive?
Too expensive, but not for them - for their customers. The only reason they’d deprecated it is if it wasn’t seeing usage worth keeping it up and that probably stems from it being insanely more expensive and slower than everything else.
Where did you find that 4.5 is a better model? Everything from the video told me that 4.5 was largely a mistake and 4.1 beats 4.5 at everything. There's no point keeping 4.5 at this point.
Bigger numbers are supposed to mean better. 3.5, 4, 4.5. Going from 4 to 4.5 to 4.1 seems weird to most people. If it's better, it should of been GPT-4.6 or 5.0 or something else, not a downgraded number.
OpenAI has decided to troll via crappy naming conventions as a sort of in joke. Sam Altman tweets about it pretty often
sits on too many GPUs, they mentioned it during the stream
I'm guessing the (API) demand isn't there to saturate them fully
GPT-4.1 Pricing (per 1M tokens):
gpt-4.1
- Input: $2.00
- Cached Input: $0.50
- Output: $8.00
gpt-4.1-mini
- Input: $0.40
- Cached Input: $0.10
- Output: $1.60
gpt-4.1-nano
- Input: $0.10
- Cached Input: $0.025
- Output: $0.40
Awesome, thank you for posting. As someone who regularly uses 4o mini from the API, any guesses or intuitions about the performance of Nano?
I'm not as concerned about nomenclature as other people, which I think is too often reacting to a headline as opposed to the article. But in this case, I'm not sure if I'm supposed to understand nano as categorically different than many in terms of what it means as a variation from a core model.
they share in livestream that 4.1-nano is worse than 4o-mini - so nano is cheaper, faster and have bigger context but worse in intelligence. 4.1mini is smarter but there is price increase.
The fact that they're raising the price for the mini models by 166% is pretty notable.
gpt-4o-mini for comparison:
- Input: $0.15
- Cached Input $0.075
- Output: $0.60
That's what I was thinking. I hoped to see a price drop, but this does not change anything for my use cases.
I was using gpt-4o-mini with batch API, which I recently replaced with mistral-small-latest batch API, which costs $0.10/$0.30 (or $0.05/$0.15 when using the batch API). I may change to 4.1-nano, but I'd have to be overwhelmed by its performance in comparision to mistral.
I don't think they ever committed themselves to uniformed pricing for mini models. Of course cheaper is better but I understand pricing to be contingent on factors specific to every next model rather than following from a blanket policy.
Seems like 4.1 nano ($0.10) is closer to the replacement and 4.1 mini is a new in-between price
The cached input price is notable here: previously with GPT-4o it was 1/2 the cost of raw input, now it's 1/4th.
It's still not as notable as Claude's 1/10th the cost of raw input, but it shows OpenAI's making improvements in this area.
Unless that has changed, anthropics (and gemini) caches are opt-in though if I recall, openai automatically chaches for you.
GPT-4.1 probably is a distilled version of GPT-4.5
I dont understand the constant complaining about naming conventions. The number system differentiates the models based on capability, any other method would not do that. After ten models with random names like "gemini", "nebula" you would have no idea which is which. Its a low IQ take. You dont name new versions of software as completely different software
Also, Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried. I have 15 years of backend experience at FAANG. Software will get automated, and it already is, people just havent figured it out yet
> The number system differentiates the models based on capability, any other method would not do that.
Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano, GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini, o3-mini-high, o3, and o4-mini in terms of capability without consulting any documentation.
Btw, as someone who agrees with your point, what’s the actual answer to this?
Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo are worse than GPT-4o in both speed and capabilities. o1 is worse than o3-mini-high in most aspects.
Then, some are not available yet: o3 and o4-mini. GPT-4.1 I haven't played with enough to give you my opinion on.
Among the rest, it depends on what you're looking for:
Multi-modal: GPT-4o > everything else
Reasoning: o1-pro > o3-mini-high > o3-mini
Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
(My personal favorite is o3-mini-high for most things, as it has a good tradeoff between speed and reasoning. Although I use 4o for simpler queries.)
So where was o1-pro in the comparisons in OpenAI's article? I just don't trust any of these first party benchmarks any more.
Is 4.5 not strictly better than 4o?
It depends on how you define "capability" since that's different for reasoning and nonreasoning models.
Whats the problem, for the layman it doesnt actually matter, and for the experts, its usually very obvious which model to use.
LLMs fundamentally have the same contraints no matter how much juice you give them or how much you toy with the models.
That’s not true. I’m a layman and 4.5 is obviously better than 4o for me, definitely enough to matter.
You are definitely not a layman if you know the difference between 4.5 and 4o. The average user thinks ai = openai = chatgpt.
Well, okay, but I'm certainly not an expert who knows the fine differences between all the models available on chat.com. So I'm somewhere between your definition of "layman" and your definition of "expert" (as are, I suspect, most people on this forum).
If you know the difference between 4.5 and 4o, it'll take you 20 minutes max to figure out the theoretical differences between the other models, which is not bad for a highly technical emerging field.
There's no single ordering -- it really depends on what you're trying to do, how long you're willing to wait, and what kinds of modalities you're interested in.
I recognize this is a somewhat rhetorical question and your point is well taken. But something that maps well is car makes and models:
- Is Ford Better than Chevy? (Comparison across providers) It depends on what you value, but I guarantee there's tribes that are sure there's only one answer.
- Is the 6th gen 2025 4Runner better than 5th gen 2024 4Runner? (Comparison of same model across new releases) It depends on what you value. It is a clear iteration on the technology, but there will probably be more plastic parts that will annoy you as well.
- Is the 2025 BMW M3 base model better than the 2022 M3 Competition (Comparing across years and trims)? Starts to depend even more on what you value.
Providers need to delineate between releases, and years, models, and trims help do this. There are companies that will try to eschew this and go the Tesla route without models years, but still can't get away from it entirely. To a certain person, every character in "2025 M3 Competition xDrive Sedan" matters immensely, to another person its just gibberish.
But a pure ranking isn't the point.
Yes, point taken.
However, it's still not as bad as Intel CPU naming in some generations or USB naming (until very recently). I know, that's a very low bar... :-)
Very easy with the naming system?
Really? Is o3-mini-high better than o1-pro?
In my experience it's better for value/price, but if you just need to solve a problem, o1 pro is the best tool available.
I meant this is actually straight-forward if you've been paying even the remotest of attention.
Chronologically:
GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini, o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
Model iterations, by training paradigm:
SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)
reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro
Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:
Creative Writing: gpt-4.5 -> gpt-4o
Business Comms: o1-pro -> o1 -> o3-mini
Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) -> o1-mini-preview
Shooting the shit: gpt-4o -> o1
It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.
> You dont name new versions of software as completely different software
macOS releases would like a word with you.
https://en.wikipedia.org/wiki/MacOS#Timeline_of_releases
Technically they still have numbers, but Apple hides them in marketing copy.
https://www.apple.com/macos/
Though they still have “macOS” in the name. I’m being tongue-in-cheek.
Just add SemVer with an extra tag:
4.0.5.worsethan4point5
> Software will get automated, and it already is, people just havent figured it out yet
To be honest I think this is most AI labs (particularly the American ones) not-so-secret goal now, for a number of strong reasons. You can see it in this announcements, Anthrophic's recent Claude 3.7 announcement, OpenAI's first planned agent (SWE-Agent), etc etc. They have to justify their worth somehow and they see it as a potential path to do that. Remains to be seen how far they will get - I hope I'm wrong.
The reasons however for picking this path IMO are:
- Their usage statistics show coding as the main user: Anthrophic recently released their stats. Its become the main usage of these models, with other usages at best being novelty or conveniences for people in relative size. Without this market IMO the hype would of already fizzled awhile ago at best a novelty when looking at the rest of the user base size.
- They "smell blood" to disrupt and fear is very effective to promote their product: This IMO is the biggest one. Disrupting software looks to be an achievable goal, but it also is a goal that has high engagement compared to other use cases. No point solving something awesome if people don't care, or only care for awhile (e.g. meme image generation). You can see the developers on this site and elsewhere in fear. Fear is the best marketing tool ever and engagement can last years. It keeps people engaged and wanting to know more; and talking about how "they are cooked" almost to the exclusion of everything else (i.e. focusing on the threat). Nothing motivates you to know a product more than not being able to provide for yourself, your family, etc to the point that most other tech topics/innovations are being drowned out by AI announcements.
- Many of them are losing money and need a market to disrupt: Currently the existing use cases of a chat bot are not yet impressive enough (or haven't been till very recently) to justify the massive valuations of these companies. Its coding that is allowing them to bootstrap into other domains.
- It is a domain they understand: AI dev's know models, they understand the software process. It may be a complex domain requiring constant study, but they know it back to front. This makes it a good first case for disruption where the data, and the know how is already with the teams.
TL;DR: They are coming after you, because it is a big fruit that is easier to pick for them than other domains. Its also one that people will notice either out of excitement (CEO, VC's, Management, etc) or out of fear (tech workers, academics, other intellectual workers).
Feel free to lay the naming convention rules out for us man.
> Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried.
Exactly. Those who do frontend or focus on pretty much anything Javascript are, how should I say it? Cooked?
> Software will get automated
The first to go are those that use JavaScript / TypeScript engineers have already been automated out of a job. It is all over for them.
Yeah its over for them. Complicated business logic and sprawling systems are what are keeping backend safe for now. But the big front end code bases where individual files (like react components) are largely decoupled from the rest of the code base is why front end is completely cooked
I have a medium-sized typescript personal project I work on. It probably has 20k LOC of well organized typescript (react frontend, express backend). I also have somewhat comprehensive docs and cursor project rules.
In general I use Cursor in manual mode asking it to make very well scoped small changes (e.g. “write this function that does this in this exact spot”). Yesterday I needed to make a largely mechanical change (change a concept in the front end, make updates to the corresponding endpoints, update the data access methods, update the database schema).
This is something very easy I would expect a junior developer to be able to accomplish. It is simple, largely mechanical, but touches a lot of files. Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes. It would add weird hard coded conditions, define new unrelated files, not follow the conventions of the surrounding code at all.
TLDR; I think LLMs right now are good for greenfield development (create this front end from scratch following common patterns), and small scoped changes to a few files. If you have any kind of medium sized refactor on an existing code base forget about it.
> Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes.
Gemini 2.5 is currently broken with the Cursor agent; it doesn't seem to be able to issue tool calls correctly. I've been using Gemini to write plans, which Claude then executes, and this seems to work well as a workaround. Still unfortunate that it's like this, though.
Interesting, I’ve found Gemini better than Claude so I defaulted to that. I’ll try another refactor in agent mode with Claude.
My personal opinion is leveraging LLMs on a large code base requires skill. How you construct the prompt, and what you keep in context, which model you use, all have a large effect on the output. If you just put it into cursor and throw your hands up, you probably didnt do it right
I gave it a list of the changes I needed and pointed it to the area of the different files that needed updated. I also have comprehensive cursor project rules. If I needed to hand hold any more than that it would take considerably less time to just make the changes myself.
> I don't understand the constant complaining about naming conventions.
Oh man. Unfolding my lawn chair and grabbing a bucket of popcorn for this discussion.
[flagged]
>Calling different opinions a low IQ take
I dont read it to imply like that.
> using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning
AI is amazing, now all you need to create a stunning UI is for someone else to make it first so an AI can rip it off. Not beating the "plagiarism machine" allegations here.
Heres a secret: Most of the highest funded VC backed software companies are just copying a competitor with a slight product spin/different pricing model
> Jim Barksdale, used to say there’s only two ways to make money in business: One is to bundle; the other is unbundle
https://a16z.com/the-future-of-work-cars-and-the-wisdom-in-s...
Exactly, they like to call it “bringing new energy to an old industry”.
Got any examples?
Rippling
I was hoping for native image gen in the API but better pricing is always appreciated.
Gemini was drastically cheaper for image/video analysis, I'll have to see how 4.1 mini and nano compare.
anyone want to guess parameter sizes here for
GPT‑4.1, GPT‑4.1 mini GPT‑4.1 nano
I'll start with
800 bn MoE (probably 120 bn activated), 200 bn MoE (33 bn activated), and 7bn parameter for nano
It's another Daft Punk day. Change a string in your program* and it's better, faster, cheaper: pick 3.
*Then fix all your prompts over the next two weeks.
Does someone have the benchmarks compared to other models?
claude 3.7 no thinking (diff) - 60.4%
claude 3.7 32k thinking tokens (diff) - 64.9%
GPT-4.1 (diff) - 52.9% (stat is from the blog post)
https://aider.chat/docs/leaderboards/
I know this is somewhat off topic, but can someone explain the naming convention used by OpenAI? Number vs "mini" vs "o" vs "turbo" vs "chat"?
Mini means the size of the model (less parameters)
"o" means "omni", which means its multimodal.
I wish they would deprecate all existing ones when they bake a new model instead of aiming for pointless model diversity.
Does this mean that the o1 and o3-mini models are also using 4.1 as the base now?
Testing against unspecified other "leading" models allows for shenanigangs:
> Qodo tested GPT‑4.1 head-to-head against other leading models [...] they found that GPT‑4.1 produced the better suggestion in 55% of cases
The linked blog post goes 404: https://www.qodo.ai/blog/benchmarked-gpt-4-1/
The post seems to be up now and seems to compare it slightly favorable to Claude 3.7.
Right, now it's up and comparison against Claude 3.7 is better than I feared based on the wording. Though why does the OpenAI announcement talk of comparison against multiple leading models when the Qodo blog post only tests against Claude 3.7...
it's worse than 4.5 on nearly every benchmark. just an incremental improvement. AI is slowing down
Or OpenAI is? After using Gemini 2.5, I did not feel "AI is slowing down". It's just this model isn't SOTA.
They don't disclose parameter counts so it's hard to say exactly how far apart they are in terms of size, but based on the pricing it seems like a pretty wild comparison, with one being an attempt at an ultra-massive SOTA model and one being a model scaled down for efficiency and probably distilled from the big one. The way they're presented as version numbers is business nonsense which obscures a lot about what's going on.
It's like 30x cheaper though. Probably just distilled 4.5
Sorry what is the source for this?
Maybe progress is slowing down but after using gemini 2.5 there clearly is still a lot being made.
It's better on AIME '24, Multilingual MMLU, SWE-bench, Aider’s polyglot, MMMU, ComplexFuncBench while being much much cheaper and smaller.
and it's worse on just as many benchmarks by a significant amount. as a consumer I don't care about cheapness, I want the maximum accuracy and performance
As a consumer you care about speed tho, and GPT-4.5 is extremely slow, at this point just use a reasoning model if you want the best of the best.
Can someone explain to me why we should take Aider's polyglot benchmark seriously?
All the solutions are already available on the internet on which various models are trained, albeit in various ratios.
Any variance could likely be due to the mix of the data.
If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful.
If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.
- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in
At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.
To join in the faux rigor?
Could any one guess the reason as to why they didn't ship this in the chat UI?
Answering my own question after some research. It looks like OpenAI decided not to introduce 4.1 in ChatGPT UI because 4.1 is not necessarily a better model than 4o because it is not multi modal.
Now you can imagine introducing a newer "type" of model like 4.1 that's better at following instructions and better at coding to bring a sort of overhead thats already too much with the given options.
OpenAI confirmed somewhere that they have already incorporated the enhancements made in 4.1 to 4o model in ChatGPT UI. I assume they would delegate to 4.1 model if the prompt doesn't require specific 4o capabilities.
Also one of the improvements made to 4.1 is following instructions. This type of thing is better suited for agentic use cases that are typically used in the form of an API.
The memory thing? More resources intensive?
i've recently set claude 3.7 as the default option for customers when they start new chats in my app. this was a recent change, and i'm feeling good about it. supporting multiple providers can be a nightmare for customer service, especially when it comes to billing and handling response quality queries. with so many choices from just one provider, it simplifies things significantly. curious about how openai manages customer service internally.
LLMs are not intelligent
The big change about this announcement is the 1M context window on all models.
But the price is what matters.
Nothing compared to Llama 4's 7M. What matters is how well it performs with such long context, not what the technical maximum is.
I feel overwhelmed
Is this correct: OpenAI will sequester 4.1 in the API permanently? And, since November 2024, they've already wrapped much of 4.1's features into ChatGPT 4o?
More season 4’s than attack on titan
Main takeaways:
- Coding accuracy improved dramatically
- Handles 1M-token context reliably
- Much stronger instruction following
ok.
It seems that OpenAI is really differentiating itself in the AI market by developing the most incomprehensible product names in the history of software.
They learned from the best: Microsoft
Microsoft Neural Language Processing Hyperscale Datacenter Enterprise Edition 4.1
A massive transformer-based language model requiring:
- 128 Xeon server-grade CPUs
- 25,000MB RAM minimum (40,000MB recommended)
- 80GB hard disk space for model weights
- Dedicated NVIDIA Quantum Accelerator Cards (minimum 8)
- Enterprise-grade cooling solution
- Dedicated 30-amp power circuit
- Windows NT Advanced Server with Parallel Processing Extensions
~
Features:
- Natural language understanding and generation
- Context window of 8,192 tokens
- Enterprise security compliance module
- Custom prompt engineering interface
- API gateway for third-party applications
*Includes 24/7 on-call Microsoft support team and requires dedicated server room with raised floor cooling
GPT 4 Workgroups
GpTeams Classic
Or Intel.
"Hey buddy, want some .Net, oh I mean dotnet"
I wonder how they decide whether the o or the digit needs to come first. (eg. o3 vs 4o)
Reasoning models have the o first, non-reasoners have the digit first.
I need an AI to understand the naming conventions that OpenAI is using.
They envy the USB committee.
OAI are so ahead of the competition, they don't need to compare with the competition anymore /s
hahahahaha
[dead]
[dead]
> We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months, on July 14, 2025, to allow time for developers to transition.
Well, that didn't last long.
so we're going back... .4 of a gpt? make it make sense openai..
Think of 4.5 as being the lacklustre major upgrade to a software package, pick one maybe Photoshop or whatever. The 4.0 version is still available and most people are continuing to use it, then suddenly 4.0 gets a small upgrade which makes it considerably better and the vendor starts talking about how the real future is in 5.0.
I wish OpenAI had invented this but it’s not that uncommon.
The plagiarism machine got an update! Yay!
The better the benchmarks, the worse the model is. Subjectively for me the more advanced models dont follow instructions, and are less capable of implementing features or building stuff. I could not tell a difference in blind testing SOTA models gemini, claude, openai, deepseek. There has been no major improvements in the LLM space since the original models gained popularity. Each release claims to be much better the last, and every time i have been disappointed and think this is worse.
First it was the models stopped putting in effort and felt lazy, tell it to do something and it will tell you to do it your self. Now its the opposite and the models go ham changing everything they see, instead of changing one line, SOTA models rather rewrite the whole project and still not fix the issue.
Two years back I totally thought these models are amazing. I always would test out the newest models and would get hyped up about it. Every problem i had i thought if i just prompt it differently I can get it to solve this. Often times i have spent hours prompting starting new chats, adding more context. Now i realize its kinda useless and its better to just accept the models where they are, rather then try and make them a one stop shop, or try to stretch capabilities.
I think this release I won’t even test it out, im not interested anymore. I’ll probably just continue using deepseek free, and gemini free. I canceled my openai subscription like 6 months ago, and canceled claude after 3.7 disappointment.