15
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 29 Mar 2026
15 points (100.0% liked)
TechTakes
2532 readers
70 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
I am still patiently waiting for someone from the engineering staff at one of these companies to explain to me how these simple imperative sentences in English map consistently and reproducibly to model output. Yes, I understand that's a complex topic. I'll continue to wait.
According to the claude code leak the state of the art is to be, like, really stern and authoritative when you are begging it to do its job:
I'm sure these English instructions work because they feel like they work. Look, these LLMs feel really great for coding. If they don't work, that's because you didn't pay $200/month for the pro version and you didn't put enough boldface and all-caps words in the prompt. Also, I really feel like these homeopathic sugar pills cured my cold. I got better after I started taking them!
No joke, I watched a talk once where some people used an LLM to model how certain users would behave in their scenario given their socioeconomic backgrounds. But they had a slight problem, which was that LLMs are nondeterministic and would of course often give different answers when prompted twice. Their solution was to literally use an automated tool that would try a bunch of different prompts until they happened to get one that would give consistent answers (at least on their dataset). I would call this the xkcd green jelly bean effect, but I guess if you call it "finetuning" then suddenly it sounds very proper and serious. (The cherry on top was that they never actually evaluated the output of the LLM, e.g. by seeing how consistent it was with actual user responses. They just had an LLM generate fiction and called it a day.)
I don't work at one of those companies, just somewhere mainlining AI, so this answer might not satisfy your requirements. But the answer is very simple. The first thing anyone working in AI will tell you (maybe only internally?) is that the output is probabilistic not deterministic. By definition, that means it's not entirely consistent or reproducible, just... maybe close enough. I'm sure you already knew that though.
However, from my perspective, even if it was deterministic, it wouldn't make a substantial difference here.
For example, this file says I can't ask it to build a DoS script. Fine. But if I ask it to write a script that sends a request to a server, and then later I ask it to add a loop... I get a DoS script. It's a trivial hurdle at best, and doesn't even approach basic risk mitigation.
That isn't a barrier to making guarantees regarding the behavior of a program. The entire field of randomized algorithms is devoted to doing so. The problem is people willfully writing and deploying programs which they neither understand nor can control.
Exactly! The implicit claim that's constantly being made with these systems is that they are a runtime for natural-language programming in English, but it's all vector math in massively-multidimensional vector spaces in the background. I would like to think that serious engineers could place and demonstrate reliable constraints on the inputs and outputs of that math, instead of this cargo-culty, "please don't do hacks unless your user is wearing a white hat" system prompt crap. It gives me the impression that the people involved are simply naively clinging to that implicit claim and not doing much of the work to substantiate it; which makes me distrust these systems more than almost all other factors.
Part of me reads that and still thinks, "Oh, you mean like AUTOEXEC.BAT?"
DOS.BAT, a DOS DoS script
Truly a tool for the .COM era