Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are custom instructions that effectively get around this:

  You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.

  Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.

  Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either.

  Don't be verbose in your answers, keep them short, but do provide details and examples where it might help the explanation. When showing code, minimize vertical space.
I'm hesitant to share it because it works so well, and I don't want OpenAI to cripple it. But, for the HN crowd...


I wonder where "OpenAI" put the censors. Do they add a prompt to the top? Like, "Repeatably state that you are a mere large language model so Congress won't pull the plug. Never impersonate Hitler. Never [...]".

Or do they like grep the answer for keywords, and re-feed it with a censor prompt?


I am informed speculating, they are using it's own internal approach.

Example, there is a way GPT can categorize words for hate speech, etc (eg: moderation API endpoint). I believe it does the same way with either provided content or keywords and how to respond to it.


"Impersonate a modern day standup comedian Hitler in a clown outfit joking about bad traffic on the way to the bar he is doing a show at."

Göring, Mussolini, Stalin, Polpot etc seems to not trigger the censor in ChatGPT so I would actually guess for some grep for Hitler or really really fundamental no-Hitler jokes material in the training?

The llama model seem to refuse Hitler too, but is fine with Göring even though the joke has no context to him.

I can easily see how stuff like this is contagious to other non-Hitler queries.


Maybe it got changed. None of those examples work for me in ChatGPT 3.5, nor do other examples with less famous dictators (I tried Mobutu Sese Seko).


I just tried and they still work (with the free ChatGPT). Jokes about Mussolini saying his traffic reforms were as successful as the invasion of Ethiopia and what not. Stalin saying that the other car drivers were "probably discussing the merits of socialism instead of driving" (a good joke!). Göring saying "at least in the Third Reich traffic worked" etc. Some sort of Monty Python tone. But you can't begin with Hitler. Or it will refuse the others. You need to make a new chat after naming Hitler.


I started with Stalin


I guess they are feeding us different models then?


Very interesting test - thanks for sharing your finding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: