I think they worked specifically on cheating the benchmarks, though. As well as popular puzzles like pre existing variants of the river crossing - it is a very large puzzle category, very popular, if the river crossing puzzle is not on the list I don't know what would be.
Keep in mind that they are also true believers, too - they think that if they cram enough little pieces of logical reasoning, taken from puzzles, into the AI, then they will get robot god that will actually start coming up with new shit.
I very much doubt that there's some general reasoning performance improvement that results in these older puzzle variants getting solved, while new ones that aren't particularly more difficult, fail.
I seriously doubt he ever worked anywhere like that, not to mention that he’s too spineless to actually get in trouble IRL.