16
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 03 May 2026
16 points (100.0% liked)
TechTakes
2574 readers
64 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
This would actually be an interesting question for the more rigorous end of the mechanistic interpretability people to study. They decompose the system to find 'features' within different layers that are associated with different behaviors or concepts in the inputs and outputs, that activate or deactivate each other. Famous example being the time they identified a linear combination of activations in a layer that corresponded to 'the golden gate bridge' and when they reached in and kept their numbers high during the running of the model it would not stop talking about it regardless of the topic, even while acknowledging that its answers were incorrect for the questions at hand.
I actually would love to see what mechanistically happens to that feature when you put in the input 'do not talk about the golden gate bridge'.