> There is no reason to believe that future AI platforms won't be able to review code themselves and manage some aspects of themselves with minimal human oversight
There are, IMHO, fewer reasons to believe they will be able to do that rather than not, though.
Compared to Claude or GPT 5.5? Yeah, my skills are static relative to the progress seen recently. So are yours, unless your grandpa was named von Neumann or Szilard.
Yes, but more specifically putting them into a sort of contradiction of their beliefs or arguments.
Doesn’t even have to be correct, but it can be confusing and cause people to say something they don’t actually mean if they dont stop and actually think it through.
If someone says something they don't mean then it doesn't mean anything. There aren't any prizes for tricking someone into singing "I love willies". The question is whether you can confuse someone into divulging something they absolutely don't want to tell.
"Gay guy says what?" historically had a pretty good hit-rate, the limit is that most people probably can't recite their credit card number from memory fast enough to be got by this
FYI this does not work for CTF challenges at least - I’ve seen a lot of rev/pwn challenges try to add magic refusal strings/prompt hijacking and models really don’t give a damn.
reply