Researchers at MIT are developing an AI model that is tolerant and does not respond toxically to otherwise provocative questions.
This new approach revolves around the age-old saying that to kill evil, you need to be the devil. The learning process involves feeding toxic data to the model and giving it a pinch of human curiosity, which the model would use to counter harmful content.
The standard procedure currently in vogue is red teaming. In this, large language models (LLMs) like ChatGPT or Claude 3 Opus are manually fed questions that may lead to harmful prompts. Once this is done, the AI is taught to tackle these prompts most safely or evade such questions.
“We are seeing a surge of models, which is only expected to rise,” said senior author Pulkit Agrawal, director of MIT’s Improbable AI Lab, in a statement. “Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models will be an integral part of our lives, and they must be verified before they are released for public consumption.”
This model gives the reigns of red-teaming to AI itself. It resulted in how it was expected: the AI-generated more extensive, diverse negative prompts than human-operated red teams could ever do. Furthermore, the researchers used reinforcement learning to incentivize a language model to create diverse prompts to provoke toxic responses from another model. By rewarding the generation of novel prompts that elicited harmful reactions, the system learned to explore new words, sentence structures, and meanings to achieve its goal, resulting in a broader range of toxic prompts all on its own.
This model is set to be better than the red team procedure owing to the diverse nature of its responses. The red team procedure may have missed or might have missed some prompts during the training as it is manual, which would lead to a wrong response by the AI. This model would minimize any such error.