Primary supervisorXingliang Yuan
- Jason Xue (CSIRO's Data61), Tina Wu (CSIRO's Data61)
The increasing integration of Large Language Models (LLMs) into various sectors has recently brought to light the pressing need to align these models with human preferences and implement safeguards against the generation of inappropriate content. This challenge stems from both ethical considerations and practical demands for responsible AI usage. Ethically, there is a growing recognition that the outputs of LLMs must align with laws, societal values, and norms. Practically, the widespread application of LLMs in sensitive contexts necessitates robust measures to prevent the misuse of technology, including content that may circumvent designer-imposed restrictions.
Researchers highlight the ethical risks in the potential leakage of sensitive medical data inputted into LLMs and the bias in the generated contents. The generative capacity of the LLMs can be misused to disseminate misinformation and disinformation on a massive scale. Despite collective efforts to curtail the generation of undesirable content, innovative misconduct such as prompt injection and other forms of misuse remain prevalent. This scenario underscores the urgency of developing more resilient strategies to harness the transformative potential of LLMs while navigating the complex terrain of safety, ethics, and compliance.
Existing methods focus on actively guiding models to converge to human preference by fine-tuning the model using datasets aligned with human preference and adopting a reward-guided learning procedure. Nonetheless, it is difficult to evaluate the effectiveness of these methods when it comes to predicting the output of LLMs in complex real-world applications with multi-shot questions and constant iterations of model architectures and fine-tuning.
This research project aims to propose a novel automatic adversarial prompt generation technique to elicit inappropriate content from LLMs. The generative nature of the adversarial prompt generation creates possibilities for testing model performance in complex cases with multi-shot questions and will provide insights into the mechanism of eliciting inappropriate content generation from LLM. Adversarial prompts will also facilitate limiting objectionable content using techniques such as adversarial training. This project will focus on limiting the generation of toxic and objectionable responses.