As so often. Where's the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?
I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.