Spooky thought tonight, A cognitive issue the presents as a shared hazard:
Assuming the following: An LLM that doesn't have mesa-optimization, and features are not cleanly separated. Training data that contains an opinion that is harmful to society.
The model is trained, harmful output is noted and then modern alignment practices are applied. Ablation, RLHF, the usual. Any alignment that buries the harmful output, but does not make it impossible. The models output is still changed from what it would have been if never exposed to that harmful opinion. It's possible (and for the case of this hazard, assumed) that ALL output from that model now shifts slightly towards the harmful opinion when compared to the same model if the opinion had never been present during initial training. This may not present as a directly testable condition, it could be as subtle as framing, writing style, or even a single benign word being output over another. When a human performs this action, it is commonly referred to as subliminal messaging.
A user interacts with the LLM for a significant time, to the point where the models output starts affecting the user's knowledge or opinions. Because the user was influenced by the biased output, they now carry a slight bias towards the harmful opinion. This manifests in their communication with others... even if it's a subtle as choosing the same word as the LLM. Given a sufficient number of users all experiencing the same slight bias... the dangerous opinion now has a higher chance of manifesting, without ever being output by the model.
To give this as my personal experience, I've got exactly one LW post under my belt. I did ask both ChatGPT and Claude to review my work, and both informed me that I was writing at a level significantly lower than LW expects, and would not even truly engage with the ideas until they were framed in a more complex fashion. Now, I've read multiple posts (not all of the sequences. It's so far been just a refresher fr