Chatbots ate my cult.
Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?
I didn't use any notation in the prompt, but gemini 2.5 pro seem to always represent state of the problem after every step in some way. When asked if it does anything with it says it is "very important", so it may be that there's some huge invisible prompt that says its very important to do this.
It also mentioned N cannibals and M missionaries.
My theory is that they wrote a bunch of little scripts that generate puzzles and solutions in that format. Since river crossing is one of the top most popular puzzles, it would be on the list (and N cannibals M missionaries is easy to generate variants of), although their main focus would have been the puzzles in the benchmarks that they are trying to cheat.
edit: here's one of the logs:
Basically it keeps on trying to brute force the problem. It gets first 2 moves correct, but in a stopped clock style manner - if there's 2 people and 1 boat they both take the boat, if there's 2 people and >=2 boats, then each of them takes a boat.
It keeps doing the same shit until eventually its state tracking fails, or its reading of the state fails, and then it outputs the failure as a solution. Sometimes it deems it impossible:
All tests done with gemini 2.5 pro, I can post links if you need them but links don't include their "thinking" log and I also suspect that if >N people come through a link they just look at it. Nobody really shares botshit unless its funny or stupid. A lot of people independently asking the same problem, that would often happen if there's a new homework question so they can't use that as a signal so easily.
And it is Google we're talking about, lol. If no one uses their AI shit they just replace something people use with it (also see search).
It's google though, if nobody uses their shit they just put it inside their search.
It's only gonna go away when they run out of cash.
edit: whoops replied to the wrong comment
I just describe it as "computer scientology, nowhere near as successful as the original".
The other thing is that he's a Thiel project, different but not any more sane than Curtis Yarvin aka Moldbug. So if they heard of moldbug's political theories (which increasingly many people heard about because of, well, them being enacted) it's easy to give a general picture of total fucking insanity funded by thiel money. It doesn't really matter what the particular insanity is, and it matters even less now as the AGI shit hit mainstream entirely bypassing anything Yudkowsky had to say on the subject.
Not really. Here's the chain-of-word-vomit that led to the answers:
Note that in "its impossible" answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can't actually choose the correct answer (a problem shared with the monkeys on typewriters).
I'd say it clearly isn't ignoring the prompt or differences from the original river crossings. It just can't actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.
It’s a failure mode that comes from pattern matching without actual reasoning.
Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.
Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.
In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.
Here's the full logs:
Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.
Nobel prize in Physics for attempting to use physics in AI but it didn't really work very well and then one of the guys working on a better more pure mathematics approach that actually worked and got the Turing Award for the latter, but that's not what the prize is for, while the other guy did some other work, but that is not what the prize is for. AI will solve all physics!!!111
Maybe if the potato casserole is exploded in the microwave by another physicist, on his way to start a resonance cascade...
(i'll see myself out).
Frigging exactly. Its a dumb ass dead end that is fundamentally incapable of doing vast majority of things ascribed to it.
They keep imagining that it would actually learn some underlying logic from a lot of text. All it can do is store a bunch of applications of said logic, as in a giant table. Deducing underlying rules instead of simply memorizing particular instances of rules, that's a form of compression, there wasn't much compression going on and now that the models are so over-parametrized, even less.
The counting failure in general is even clearer and lacks the excuse of unfavorable tokenization. The AI hype would have you believe just an incremental improvement in multi-modality or scaffolding will overcome this, but I think they need to make more fundamental improvements to the entire architecture they are using.
Yeah.
I think the failure could be extremely fundamental - maybe local optimization of a highly parametrized model is fundamentally unable to properly learn counting (other than via memorization).
After all there's a very large number of ways how a highly parametrized model can do a good job of predicting the next token, which would not involve actual counting. What makes counting special vs memorization is that it is relatively compact representation, but there's no reason for a neural network to favor compact representations.
The "correct" counting may just be a very tiny local minimum, with tall hill all around it and no valley leading there. If that's the case then local optimization will never find it.
It would have to be more than just river crossings, yeah.
Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.