Fewer instructions doesn't mean it's faster. It can be faster but it's not guaranteed in general. Obvious counterexample is single threaded vs multi-threaded code. Single threaded code will have fewer instructions but won't necessarily be faster.
I didn’t ask you to be rude or wrong either, yet here we are. The assignment is explicitly single core and cycle accurate. Your point is completely irrelevant and shows a disconnect with the content being discussed.
It's neither rude nor wrong to ask for evidence to support claims being made in what appears to be corporate advertising. The claim is their LLM is better than a person, I asked for evidence. None was presented. It's not complicated.
You first claimed this task was poorly specified (it’s not) and then completely misrepresented what it’s looking for. When I pointed this out you became defensive and claimed this was not your point at all. That’s what I’m talking about.
It's hard to cut through the AI hype when there are billions of dollars at stake. I usually trust negative comments more, as long as the person isn't trying to sell a course. Even though Terence Tao is a respected scientist, I wonder if his recent comments are driven by a need for funding due to federal cuts. I’ve had similar experiences with LLMs—whenever I ask them about hard math or RL theory, they almost always give me the wrong answers.
I also care more about the failure modes than the successes, although in my case, it's because I keep finding them exceptionally useful at software development, and I:
1. Don't want to use them where they suck.
Think normalisation of deviance: "the problems haven't affected me therefore they don't exist" is a way to get really badly burned.
2. Want to train up in things they will still suck at by the time I've leared whatever it is.
I find LLMs seem kinda bad at writing sheet music, and Suno is kinda bad at werid instructions (like Stable Diffusion for images), but I expect them to get good before I can.
I also find them inconsistent at non-Euclidian problems: sometimes they can, sometimes they can't. I have absolutely no idea how to monetise that, but even if I could, "inconsistent" is itself an improvement on "cannot ever" which is what SOTA was a few years ago.
This is response from mathematician:
"This is quite something, congratulations to Boris and Aristotle!
On one hand, as the nice sketch provided below by tsaf confirms, the final proof is quite simple and elementary - indeed, if one was given this problem in a maths competition (so therefore expected a short simple solution existed) I'd guess that something like the below would be produced. On the other hand, if something like this worked, then surely the combined talents of Burr, Erdős, Graham, and Li would have spotted it.
Normally, this would make me suspicious of this short proof, in that there is overlooked subtlety. But (a) I can't see any and (b) the proof has been formalised in Lean, so clearly it just works!
Perhaps this shows what the real issue in the [BEGL96] conjecture is - namely the removal of 1 and the addition of the necessary gcd condition. (And perhaps at least some subset of the authors were aware of this argument for the easier version allowing 1, but this was overlooked later by Erdős in [Er97] and [Er97e], although if they were aware then one would hope they'd have included this in the paper as a remark.)
At the moment I'm minded to keep this as open, and add the gcd condition in the main statement, and note in the remarks that the easier (?) version allowing 1 and omitting the gcd condition, which was also asked independently by Erdős, has been solved."
The commentator is saying: "I can't believe this famous problem was solved so easily. I would have thought it was a fake proof, but the computer verified it. It turns out the solution works because it addresses a slightly different set of constraints (regarding the number 1) than what Erdős originally struggled with. (Generated by Gemini)
I started fully coding with Claude Code. It's not just vibe coding, but rather AI-assisted coding. I've noticed there's a considerable decrease in my understanding of the whole codebase, even though I'm the only one who has been coding this codebase for 2 years. I'm struggling to answer my colleagues' questions.
I am not defending we should drop AI, but we should really measure its effects and take actions accordingly. It's more than just getting more productivity.
This is the chief reason I don't use integrations. I just use chat, because I want to physically understand and insert code myself. Else you end up with the code overtaking your understanding of it.
Yes. I'm happy to have a sometimes-wrong expert to hand. Sometimes it provides just what I need, sometimes like with a human (who are also fallible), it helps to spur my own thinking along, clarify, converge on a solution, think laterally, or other productivity boosting effects.
I’m experiencing something similar. We have a codebase of about 150k lines of backend code. On one hand, I feel significantly more productive - perhaps 400% more efficient when it comes to actually writing code. I can iterate on the same feature multiple times, refining it until it’s perfect.
However, the challenge has shifted to code review. I now spend the vast majority of my time reading code rather than writing it. You really need to build strong code-reading muscles. My process has become: read, scrap it, rewrite it, read again… and repeat until it’s done. This approach produces good results for me.
The issue is that not everyone has the same discipline to produce well-crafted code when using AI assistance. Many developers are satisfied once the code simply works. Since I review everything manually, I often discover issues that weren’t even mentioned. During reviews, I try to visualize the entire codebase and internalize everything to maintain a comprehensive understanding of the system’s scope.
I'm very surprised you find this workflow more efficient than just writing the code. I find constructing the mental model of the solution and how it fits into existing system and codebase to be 90% of effort, then actually writing the code is 10%. Admittedly, I don't have to write any boilerplate due to the problem domain and tech choices. Coding agents definitely help with the last 10% and also all the adjacent work - one-off scripts where I don't care about code quality.
I doubt it actually is. All the extra effort it takes to make the AI do something useful on non trivial tasks is going to end up being a wash in terms of productivity, if not a net negative. But it feels more productive because of how fast the AI can iterate.
And you get to pay some big corporation for the privilege.
> Many developers are satisfied once the code simply works.
In the general case, the only way to convince oneself that the code truly works is to reason through it, as testing only tests particular data points for particular properties. Hence, “simply works” is more like “appears to work for the cases I tried out”.
I wrote a couple python scripts this week to help me with a midi integration project (3 devices with different cable types) and for quick debugging if something fails (yes, I know there are tools out there that do this but I like learning).
I’m could have used an LLM to assist but then I wouldn’t have learned much.
But I did use an LLM to make a
management wrapper to present a menu of options (cli right now) and call the scripts. That probably saved me an hour, easily.
That’s my comfort level for anything even remotely “complicated”.
I keep wanting to go back to using claudecode but I get worried about this issue. How best to use it to complement you, without it rewriting everything behidn the scenes? whats the best protocol? constnat commit requests and reviews?
Yesterday, I was asked to scrape data from a website. My friend used ChatGPT to scrape data but didn't succeded even spent 3h+. I looked website code and understand with my web knowledge and do some research with LLM. Then I described how to scrape data to LLM it took 30 minutes overall. The LLM cant create best way but you can create with using LLM. Everything is same, at the end of the day you need someone who can really think.
LLM's can do anything, but the decision tree for what you can do in life is almost infinite. LLM's still need a coherent designer to make progress towards a goal.
it is not that easy, there is lazy loading in the page that is triggered by scroll of specific sections. You need to find clever way, no way to scrape with bs4, so tough with even selenium.
Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.
So we're only about a year since the last big breakthrough.
I think we got a second big breakthrough with Google's results on the IMO problems.
For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.
IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?
Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.
It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.
I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.
We are not close to solving IMO with publicly known methods.
test time scaling is based on methods from pre-2020. If you look details of modern LLMs its pretty small prob to encounter method from 2020+(ROPE,GRPO). I am not saying IMO is not impressive, but it is not breakthrough, if they said they used different paradigm then test-time scaling I would say breakthrough.
> We are not close to solving IMO with publicly known methods.
The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.
Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.
Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.
We still don't have reasoning. We have synthetic text extrusion machines priming themselves to output text that looks a certain way by first generating some extra text that gets piped back into their own input for a second round.
It's sometimes useful, it seems. But when and why it helps is unclear and understudied, and the text produced in the "reasoning trace" doesn't necessarily correspond to or predict the text produced in the main response (which, of course, actual reasoning would).
Boosters will often retreat to "I don't care if the thing actually thinks", but the whole industry is trading on anthropomorphic notions like "intelligence", "reasoning", "thinking", "expertise", even "hallucination", etc., in order to drive the engine of the hype train.
The massive amounts of capital wouldn't be here without all that.
i think this is more an effect of releasing a model every other month with gradual improvements. if there was no o-series/other thinking models on the market - people would be shocked by this upgrade. the only way to keep up with the market is to release improvements asap
I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.
I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.
this is a very odd perspective. as someone who uses LLMs for coding/PRs - every time a new model released my personal experience was that it was a very solid improvement on the previous generation and not just meant to "confuse". the jump from raw GPT-4 2 years ago to o3 full is so unbelievable if you traveled back in time and showed me i wouldn't have thought such technology would exist for 5+ years.
to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.
I notice you don't bring any examples despite claiming the improvements are frequent and solid. It's likely because the improvements are actually hard to define and quantify. Which is why throughout this period of LLM development, there has been such an emphasis on synthetic benchmarks (which tell us nothing), rather than actual capabilities and real world results.
i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).
heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful
I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.
The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.
Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.
not to offend - but it sounds like your response/worries are based more on an emotional reaction. and rightly so, this is by all means a very scary and uncertain time. and undeniably these companies have not taken into account the impact their products will cause and the safety surrounding that.
however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned
"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."
now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like
> Inability to recall simple information
> inability to stay on task
> (doesn't) support its output
> (no) long term planning
ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.
as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.
Big claims of prs and shipped code then links to people who are financially interested in hype claims.
Not saying things are not getting better but i have found that those that claim amazing results are from people who are not expert enough in the output of the given domain to comment on the actual quality of output.
I love vibing out rust and it compiles and runs but i have no idea if it is good rust because well, i barely understand rust.
> now id like to ask you for evidence that none of these aspects have been improved
You're arguing against a strawman. I'm not saying there haven't been incremental improvements for the benchmarks they're targeting. I've said that several times now. I'm sure you're seeing improvements in the tasks you're doing.
But for me to say that there is more a shell game going on, I will have to see tools that do not hallucinate. A (claimed, who knows if that's right, they can't even get the physics questions or the charts right) reduction of 65% is helpful but doesn't make these things useful tools in the way they're claiming they are.
> sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter
I'm not asking for all of them, you didn't even share one!
Like I said, despite all the advances touted in the breathless press releases you're touting, the brand new model is just a bad roll away from like the models from 3 years ago, and until that isn't the case, I'll continue to believe that the technology has hit a wall.
If it can't do this after how many years, then how is it supposed to be the smartest person I know in my pocket? How am I supposed to trust it, and build a foundation on it?
Interesting thread. I think the key around hallucinations is analogous to compilers. In order for output to be implicitly trusted it has to be as stable as a compiler. Hallucinations mean i cannot yolo trust the output. Having to manually scan the code for issues defeats the fundamental benefit.
Compilers were not and are not always perfect but i think ai has a long way to go before it passes that threshold. People act like it will in the next few years which the current trajectory strongly suggests that is not the case.
ill leave it at this: if “zero-hallucination omniscience” is your bar, you’ll stay disappointed - and that’s on your expectations, not the tech. personally i’ve been coding/researching faster and with fewer retries every time a new model drops - so my opinion is based on experience. you’re free to sit out the upgrade cycle
you dont remember deepseek introducing reasoning and blowing benchmarks led by private american companies out of the water? with an api that was way cheaper? and then offered the model free in a chat based system online? and you were a big fan?
Isn't the fact that it produced similar performance about 70x more cheaply a breakthrough? In the same way that the Hall-Héroult process was a breakthrough. Not like we didn't have aluminum before 1886.
I think the llm wall was hit a while ago and the jumps have been around finessing llms in novel ways for a better result. But the core is still very much the same it has been for a while.
The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.
This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.
It's seemed that way for the last year. The only real improvements have been in the chat apps themselves (internet access, function calling). Until AI gets past the pre-training problem, it'll stagnate.
It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.
This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.
How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?
GPT5 doesn't add any cues to whether we hit the wall, as OpenAI only needs to go one step beyond the competition. They are market leaders and more profitable than the others, so it's possible are not showing us everything they have, until they really need to.
Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.
> If an AI can replace these repeated tasks, I could spend more time with my fiancé, family, friends, and dog, which is awesome, and I am looking forward to that.
I could not understand this optimism, aren't we living in a capitalist world ?
It is indeed completely stupid: if he can do that, others can too, which means they can be more productive than he is, and the only way he would spend more time with his fiancé, family, friends, and dog is by becoming quickly unemployed.
Yes this is what people constantly get wrong about AI. When AI starts to replace certain tasks, we will then create newer, larger tasks that will keep us busy, even when using AI to its full advantage.
Exactly. I am yet to see the manager that says to their employees: "Ah nice, you became 10% more efficient using AI, from now on you can work 4 hours less every week".
I don't think its about capitalism, people have repeatedly shown we simply just don't like idle time over the long run.
Plenty of people could already work less today if they just spent less. Historically any of the last big productivity booms could have similarly let people work less, but here we are.
If AI actually comes about and if AGI replaces humans at most cognitive labor, we'll find some way to keep ourselves busy even if the jobs ultimately are as useless as the pet rock or the Jump to Conclusions Mat (Office Space reference for anyone who hasn't seen it).
I don’t think it’s that simple. Productivity gains are rarely universal. Much of the past century’s worth of advancement into automation and computing technology has generated enormous productivity gains in manufacturing, communication, and finance industries but had little or no benefit for a lot of human capital-intensive sectors such as service and education.
It still takes basically the same amount of labour hours to give a haircut today as it did in the late 19th century. An elementary school teacher today can still not handle more than a few tens up to maybe a hundred students at the extreme limit. Yet the hairdressing and education industries must still compete — on the labour market — with the industries showing the largest productivity gains. This has the effect of raising wages in these productivity-stagnant industries and increasing the cost of these services for everyone, driving inflation.
Inflation is the real time-killer, not a fear of idleness. The cost of living has gone up for everyone — rather dramatically, in nominal terms — without even taking housing costs into account.
Productivity gains aren't universal, agreed there for sure, though we have long since moved past needing to optimize productivity for the basics. Collectively we're addicted to trading our time and effort for gadgets, convenience, and status symbols.
I'm not saying those are bad things, people can do whatever they want with their own time and effort. It just seems obvious to me that we aren't interested in working less over any meaningful period of time, if that was a goal we could have reached it a long time ago by defining a lower bar for when we have "enough."
> But they're not talking about idle time, they're talking about quality time with loved ones.
I totally agree there, I wasn't trying to imply that "idle time" is a bad thing, in this context I simply meant its time not filled by obligations allowing them to choose what they do.
> But spending for leisure is often a part of that quality time.
I expect that varies a lot by person and situation. Some of the most enjoyable experiences I've had involved little or no cost; having a camp fire with friends, going on a hike, working outside in the garden, etc.
> I wasn't trying to imply that "idle time" is a bad thing
I you, I just mean what they're talking about is also not idle time as it's active time. If they were replacing work with sitting around at home, watching TV or whatever, then it would be idle time and drive them crazy no doubt. But spending time actively with their family is quite different, and would give satisfaction in a way that work does.
> I expect that varies a lot by person and situation.
Indeed. Spending isn't an inherent part of leisure. But it can be a part of it, and important part for some people. Telling them they could have more free time if they just gave up their passions or hobbies which cost money isn't likely to lead anywhere.
It's slightly more complicated than that. If people work less, they make less money, and that means they can't buy a house, to name just one example. Housing is not getting any cheaper for a myriad of reasons. The same goes for healthcare, and even for drinking beer.
People could work less, but it's a group effort. As long as some narcissistic idiots who want more instead of less are in charge, this is not going to change easily.
Yes, and now we have come full circle back to capitalism. As soon as a gap forms between capital and untapped resources, the capitalist engine keeps running: the rich get richer and the poor get poorer. It is difficult or impossible to break out of this on a large scale.
The poor dont necessarily get poorer. That is not a given in capitalism. But at some point capitalism will converge to feudalism, at that point, the poor will become slaves.
And if not needed, culled. For being "unproductive" or "unattractive" or generally "worthless".
That's my cynical take.
As long as the rich can be reigned in in a way, the poor will not necessarily become poorer.
In neoliberal capitalism they do, though. Because companies can maximize profits without internalizing external costs (such as health care, social welfare, environmental costs).
I am from EU, so I can see it happening here, or in some smaller countries. Here, you already sort-of have an UBI, where you get enough social benefits to live off if unemployed.
This is bad use of AI, we spend our compute to make science faster. I am pretty confident computational cost of this will be maybe 100x of chatgpt query. I don't want to think even environmental effects.
reply