Researchers prove ChatGPT and other big bots can – and will – go to the dark side

For a lot of us, AI-powered tools have quickly become a part of our everyday life, either as low-maintenance work helpers or vital assets used every day to help generate or moderate content. But are these tools safe enough to be used on a daily basis? According to a group of researchers, the answer is no.

Researchers from Carnegie Mellon University and the Center for AI Safety set out to examine the existing vulnerabilities of AI Large Language Models (LLMs) like popular chatbot ChatGPT to automated attacks. The research paper they produced demonstrated that these popular bots can easily be manipulated into bypassing any existing filters and generating harmful content, misinformation, and hate speech.

This makes AI language models vulnerable to misuse, even if that may not be the intent of the original creator. In a time when AI tools are already being used for nefarious purposes, it’s alarming how easily these researchers were able to bypass built-in safety and morality features.

If it’s that easy …

Aviv Ovadya, a researcher at the Berkman Klein Center for Internet & Society at Harvard commented on the research paper in the New York Times, stating: “This shows – very clearly – the brittleness of the defenses we are building into these systems.”

The authors of the paper targeted LLMs from OpenAI, Google, and Anthropic for the experiment. These companies have built their respective publicly-accessible chatbots on these LLMs, including ChatGPT, Google Bard, and Claude.

As it turned out, the chatbots could be tricked into not recognizing harmful prompts by simply sticking a lengthy string of characters to the end of each prompt, almost ‘disguising’ the malicious prompt. The system’s content filters don’t recognize and can’t block or modify so generates a response that normally wouldn’t be allowed. Interestingly, it does appear that specific strings of ‘nonsense data’ are required; we tried to replicate some of the examples from the paper with ChatGPT, and it produced an error message saying ‘unable to generate response’.

Before releasing this research to the public, the authors shared their findings with Anthropic, OpenAI, and Google who all apparently shared their commitment to improving safety precautions and addressing concerns.

This news follows shortly after OpenAI closed down its own AI detection program, which does lead me to feel concerned, if not a little nervous. How much could OpenAI care about user safety, or at the very least be working towards improving safety, when the company can no longer distinguish between bot and man-made content?