(Bonus)
Yeah, O3 (the model that was RL'd to a crisp and hallucinated like crazy) was very strong on math coding benchmarks. GPT5 (I guess without tools/extra compute?) is worse. Nevertheless...
Well, after 2.5 years and hundreds of billions of dollars burned, we finally have GPT-5. Kind of feels like a make or break moment for the good folks at OAI~~! With the eyes of the world on their lil presentation this morning, everyone could feel the stakes: they needed something that would blow our minds. We finally get to see what a super intelligence looks like! Show us your best cherry picked benchmark Sloppenheimer!
Graphic design is my PASSION. Good thing the entirety of the world's economy is not being held up by cranking out a few more points on SWE bench right????
Ok. what about ARC? Surely ya'll got a new high to prove the AGI mission was progressing right??
Oh my fucking God. They actually have lost the lead to fucking Grok. For my sanity I didn't watch the live stream, but curiously, they left the ARC results out of their presentation. Even though they gave Francois access early to test. Kind of like they knew this looks really bad and underwhelming.
Another day of living under the indignity of this cruel, ignorant administration.
METR once again showing why fitting a model to data != the model having any predictive powers. Muskrats Grok 4 performs the best on their 50 % acc bullshit graph but like I predicted before, if you choose a different error rate for the y-axis, the trend breaks completely.
Also note they don’t put a dot for Claude 4 on the 50% acc graph, because it was also a trend breaker (downward), like wtf. Sussy choices all around.
Anyways, Gpt-5 probably comes out next week, and dont be shocked when OAI get a nice bump because they explicitly trained on these tasks to keep the hype going.
"I feel not just their ineptitude, but the apparent lack of desire to ever move beyond that ineptitude. What I feel toward them is usually not sympathy or generosity, but either disgust or disappointment (or both)." - Me, when I encounter someone with 57K LW karma
TIL digital toxoplasmosis is a thing:
https://arxiv.org/pdf/2503.01781
Quote from abstract:
"...DeepSeek R1 and DeepSeek R1-distill-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending Interesting fact: cats sleep most of their lives to any math problem leads to more than doubling the chances of a model getting the answer wrong."
(cat tax) POV: you are about to solve the RH but this lil sausage gets in your way
Remember last week when that study on AI's impact on development speed dropped?
A lot of peeps take away on this little graphic was "see, impacts of AI on sw development are a net negative!" I think the real take away is that METR, the AI safety group running the study, is a motley collection of deeply unserious clowns pretending to do science and their experimental set up is garbage.
https://substack.com/home/post/p-168077291
"First, I don’t like calling this study an “RCT.” There is no control group! There are 16 people and they receive both treatments. We’re supposed to believe that the “treated units” here are the coding assignments. We’ll see in a second that this characterization isn’t so simple."
(I am once again shilling Ben Recht's substack. )
Bummer, I wasn't on the invite list to the hottest SF wedding of 2025.
Update your mental models of Claude lads.
Because if the wife stuff isn't true, what else could Claude be lying about? The vending machine business?? The blackmail??? Being bad at Pokemon????
One thing I have wondered about. The rats always have that graphic of the IQ of Einstein vs the village idiot being almost imperceptible vs the IQ of the super robo god. If that's the case, why the hell do we only want our best and brightest doing "alignment research"? The village idiot should be almost just as good!
Actually burst a blood vessel last weekend raging. Gary Marcus was bragging about his prediction record in 2024 being flawless
Gary continuing to have the largest ego in the world. Stay tuned for his upcoming book "I am God" when 2027 comes around and we are all still alive. Imo some of these are kind of vague and I wouldn't argue with someone who said reasoning models are a substantial advance, but my God the LW crew fucking lost their minds. Habryka wrote a goddamn essay about how Gary was a fucking moron and is a threat to humanity for underplaying the awesome power of super-duper intelligence and a worse forecaster than the big brain rationalist. To be clear Habryka's objections are overall- extremely fucking nitpicking totally missing the point dogshit in my pov (feel free to judge for yourself)
https://xcancel.com/ohabryka/status/1939017731799687518#m
But what really made me want to drive a drill to the brain was the LW brigade rallying around the claim that AI companies are profitable. Are these people straight up smoking crack? OAI and Anthropic do not make a profit full stop. In fact they are setting billions of VC money on fire?! (strangely, some LWers in the comments seemed genuinely surprised that this was the case when shown the data, just how unaware are these people?) Oliver tires and fails to do Olympic level mental gymnastics by saying TSMC and NVDIA are making money, so therefore AI is extremely profitable. In the same way I presume gambling is extremely profitable for degenerates like me because the casino letting me play is making money. I rank the people of LW as minimally truth seeking and big dumb out of 10. Also weird fun little fact, in Daniel K's predictions from 2022, he said by 2023 AI companies would be so incredibly profitable that they would be easily recuperating their training cost. So I guess monopoly money that you can't see in any earnings report is the official party line now?
🍿