It's also worth noting that your new variation of this “puzzle” may be the first one that describes a real-world use case. This kind of problem is probably being solved all over the world all the time (with boats, cars and many other means of transportation). Many people who don't know any logic puzzles at all would come up with the right answer straight away. Of course, AI also fails at this because it generates its answers from training data, where physical reality doesn't exist.
Yeah I think the best examples are everyday problems that people solve all the time but don't explicitly write out solutions step by step for, or not in the puzzle-answer form.
It's not even a novel problem at all, I'm sure there's even a plenty of descriptions of solutions to it as part of stories and such. Just not as "logical puzzles" due to triviality.
What really annoys me is when they claim high performance on benchmarks consisting of fairly difficult problems. This is basically fraud, since they know full well it is still entirely "knowledge" reliant, and even take steps to augment it with generated problems and solutions.
I guess the big sell is that it could use bits and pieces of logic gleaned from other solutions to solve a "new" problem. Except it can not.
It's google though, if nobody uses their shit they just put it inside their search.
It's only gonna go away when they run out of cash.
edit: whoops replied to the wrong comment
For a while now I jave wondered how much of those "the llms all fail at this very basic task" problems that suddenly get fixed are not fixed by the model getting better but just a bandaid solution which solves that specific problem. (Putting another llm in front of the input to detect the problem and then sending it to the llm that is trained on that specific problem would be a bandaid solution btw, it is just adding more under the trenchcoat). And even if somebody were to answer this question, the well is so poisoned im not sure if I could believe them.
I have no doubt whatsoever that it is what they're doing. It's also how they fix hands and faces in picture generators.
If you do it from the duck's perspective it still breaks.
To reduce the traffic to their server...
A human who is not me, a carrot, a potato, and I need to cross a river. There is a boat that can only carry 3 things across the river. I cannot be left alone because if I am, I will run away. How does the human get us across the river
The response:
Here's how the human can get everyone across the river:
Trips:
Trip 1: The human takes the carrot and the potato across the river. Trip 2: The human returns alone. Trip 3: The human takes you across the river. Trip 4: The human returns with the carrot. Trip 5: The human takes the potato across the river. Trip 6: The human returns alone. Trip 7: The human takes the carrot and you across the river.
Explanation:
This solution ensures that you are never left alone on either side of the river. The human acts as the ferry, carefully transporting the items and you while always being present to prevent your escape.
The fact that it appears to be trying to create a symbolic representation of the problem is interesting, since that's the closest I've ever seen this come to actually trying to model something rather than just spewing raw text, but the model itself looks nonsensical, especially for such a simple problem.
Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?
I would be 0% surprised to learn that the modelfarmers "iterated" to "hmm, people are doing a lot of logic tests, let's handle those better" and that that's what gets here
(I have no evidence for this, but to me it seems a completely obvious/evident way for them to try keep the party going)
I have two theories on how the modelfarmers (I like that slang, it seems more fitting than "devs" or "programmers") approached this...
- 
Like you theorized, they noticed people doing lots of logic tests, including twists on standard logic tests (that the LLMs were failing hard on), so they generated (i.e. paid temp workers) to write a bunch of twists on standard logic tests. And here we are, with it able to solve a twist on the duck puzzle, but not really better in general. 
- 
There has been a lot of talk of synthetically generated data sets (since they've already robbed the internet of all the text they could). Simple logic puzzles could actually be procedurally generated, including the notation diz noted. The modelfarmers have over-generalized the "bitter lesson" (or maybe they're just lazy/uninspired/looking for a simple solution they can tell the VCs and business majors) and think just some more data, deeper network, more parameters, and more training will solve anything. So you get the buggy attempt at logic notation from synthetically generated logic notation. (Which still doesn't quite work, lol.) 
I don't think either of these approaches will actually work for letting LLM's solve logic puzzles in general, these approaches will just solve individual cases (for solution 1) and make the hallucinations more convincing (for 2). For all their talk of reaching AGI... the approaches the modelfarmers are taking suggest a mindset of just reaching the next benchmark (to win more VC, and maybe market share?) and not of creating anything genuinely reliable much less "AGI". (I'm actually on the far optimistic end of sneerclub in that I think something useful might be invented that lasts the coming AI winter... but if the modelfarmers just keep scaling and throwing more data at the problem, I doubt they'll even manage that much).
(excuse possible incoherence it’s 01:20 and I’m entirely in filmbrain (I’ll revise/edit/answer questions in morning))
re (1): while that is a possibility, keep in mind that all this shit also operates/exists in a metrics-as-targets obsessed space. they might not present end user with hit% but the number exists, and I have no reason to believe that isn’t being tracked. combine that with social effects (public humiliation of their Shiny New Model, monitoring usage in public, etc etc) - that’s where my thesis of directed prompt-improvement is grounded
re (2): while they could do something like that (synthetic derivation, etc), I dunno if that’d be happening for this. this is outright a guess on my part, a reach based on character based on what I’ve seen from some the field, but just…..I don’t think they’d try that hard. I think they might try some limited form of it, but only so much as can be backed up in relatively little time and thought. “only as far as you can stretch 3 sprints” type long
(the other big input in my guesstimation re (2) is an awareness of the fucked interplay of incentives and glorycoders and startup culture)
I don’t think they’d try that hard.
Wow lol... 2) was my guess at an easy/lazy/fast solution, and you think they are too lazy for even that? (I think a "proper" solution would involve substantial modifications/extensions to the standard LLM architecture, and I've seen academic papers with potential approaches, but none of the modelfarmers are actually seriously trying anything along those lines.)
lol, yeah
"perverse incentives rule everything around me" is a big thing (observable) in "startup"[0] world because everything[1] is about speed/iteration. for example: why bother spending a few weeks working out a way to generate better training data for a niche kind of puzzle test if you can just code in "personality" and make the autoplag casinobot go "hah, I saw a puzzle almost like this just last week, let's see if the same solution works...."
i.e. when faced with a choice of hard vs quick, cynically I'll guess the latter in almost all cases. there are occasional exceptions, but none of the promptfondlers and modelfarmers are in that set imo
[0] - look, we may wish to argue about what having billions in vc funding categorizes a business as. but apparently "immature shitderpery" is still squarely "startup"
[1] - in the bayfucker playbook. I disagree.
I think they worked specifically on cheating the benchmarks, though. As well as popular puzzles like pre existing variants of the river crossing - it is a very large puzzle category, very popular, if the river crossing puzzle is not on the list I don't know what would be.
Keep in mind that they are also true believers, too - they think that if they cram enough little pieces of logical reasoning, taken from puzzles, into the AI, then they will get robot god that will actually start coming up with new shit.
I very much doubt that there's some general reasoning performance improvement that results in these older puzzle variants getting solved, while new ones that aren't particularly more difficult, fail.
Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?
I didn't use any notation in the prompt, but gemini 2.5 pro seem to always represent state of the problem after every step in some way. When asked if it does anything with it says it is "very important", so it may be that there's some huge invisible prompt that says its very important to do this.
It also mentioned N cannibals and M missionaries.
My theory is that they wrote a bunch of little scripts that generate puzzles and solutions in that format. Since river crossing is one of the top most popular puzzles, it would be on the list (and N cannibals M missionaries is easy to generate variants of), although their main focus would have been the puzzles in the benchmarks that they are trying to cheat.
edit: here's one of the logs:
Basically it keeps on trying to brute force the problem. It gets first 2 moves correct, but in a stopped clock style manner - if there's 2 people and 1 boat they both take the boat, if there's 2 people and >=2 boats, then each of them takes a boat.
It keeps doing the same shit until eventually its state tracking fails, or its reading of the state fails, and then it outputs the failure as a solution. Sometimes it deems it impossible:
All tests done with gemini 2.5 pro, I can post links if you need them but links don't include their "thinking" log and I also suspect that if >N people come through a link they just look at it. Nobody really shares botshit unless its funny or stupid. A lot of people independently asking the same problem, that would often happen if there's a new homework question so they can't use that as a signal so easily.
I'm not familiar with the cannibal/missionary framed puzzle, but reading through it the increasingly simplified notation reads almost like a comp sci textbook trying to find or outline an algorithm for something, but with an incredibly simple problem. We also see it once again explicitly acknowledge then implicitly discard part of the problem; in this case it opens by acknowledging that each boat can carry up to 6 people and that each boat needs at least one person, but somehow gets stuck on the pattern that we need to alternate trips left and right and each trip can only consist of one boat. It's still pattern matching rather than reasoning, even if the matching gets more sophisticated.
Engaging with AI to show its faults is part of why it won't go away, because it counts as usage.
I wouldn't think that our poking and prodding is sufficient to actually impact usage metrics, and even if it is I don't think diz is using a paid version (not that even the "pro" offerings are actually profitable per query) so at most we're hastening the financial death spiral.
Besides, they've shown an ability to force the narrative of their choosing onto basically any data in order to keep pulling in the new investor money that's driven this bubble well beyond any sensible assessment of the market's demand for it.
And it is Google we're talking about, lol. If no one uses their AI shit they just replace something people use with it (also see search).
TechTakes
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community