23
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 07 Sep 2025
23 points (100.0% liked)
TechTakes
2187 readers
256 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
Was jumpscared on my YouTube recommendations page by a video from AI safety peddler Rob Miles and decided to take a look.
It talked about how it's almost impossible to detect whether a model was deliberately trained to output some "bad" output (like vulnerable code) for some specific set of inputs.
Pretty mild as cult stuff goes, mostly anthropomorphizing and referring to such LLM as a "sleeper agent". But maybe some of y'all will find it interesting.
link
This isn't the first time I've heard about this - Baldur Bjarnason's talked about how text extruders can be poisoned to alter their outputs before, noting its potential for manipulating search results and/or serving propaganda.
Funnily enough, calling a poisoned LLM as a "sleeper agent" wouldn't be entirely inaccurate - spicy autocomplete, by definition, cannot be aware that their word-prediction attempts are being manipulated to produce specific output. Its still treating these spicy autocompletes with more sentience than they actually have, though