I skimmed through the llama 2 research paper, there were some sections about them working to prevent users from circumventing the language model’s programming. IIRC one of the examples of model hijacking was to disguise the request as a creative/fictional prompt. perhaps it’s some part of that training gone wrong.
I skimmed through the llama 2 research paper, there were some sections about them working to prevent users from circumventing the language model’s programming. IIRC one of the examples of model hijacking was to disguise the request as a creative/fictional prompt. perhaps it’s some part of that training gone wrong.
Just goes to show the importance of being able to produce uncensored models.