Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.

So we're only about a year since the last big breakthrough.

I think we got a second big breakthrough with Google's results on the IMO problems.

For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.



IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?

Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.


It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.

I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.

We are not close to solving IMO with publicly known methods.


test time scaling is based on methods from pre-2020. If you look details of modern LLMs its pretty small prob to encounter method from 2020+(ROPE,GRPO). I am not saying IMO is not impressive, but it is not breakthrough, if they said they used different paradigm then test-time scaling I would say breakthrough.

> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.

Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.


What kind of test time scaling did we have pre-2020?

Non-output tokens were basically introduced by QuietSTaR, which is rather new. What method from five years ago does anything like that?


Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.


Fine, but to me reasoning is this the where you have <think> tags and use RL to decide what's to be generated in-between them.

Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.


We still don't have reasoning. We have synthetic text extrusion machines priming themselves to output text that looks a certain way by first generating some extra text that gets piped back into their own input for a second round.


Right. The "reasoning" is an illusion. It's another hype generation tool.


It's sometimes useful, it seems. But when and why it helps is unclear and understudied, and the text produced in the "reasoning trace" doesn't necessarily correspond to or predict the text produced in the main response (which, of course, actual reasoning would).

Boosters will often retreat to "I don't care if the thing actually thinks", but the whole industry is trading on anthropomorphic notions like "intelligence", "reasoning", "thinking", "expertise", even "hallucination", etc., in order to drive the engine of the hype train.

The massive amounts of capital wouldn't be here without all that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: