Skip to main content

Primary supervisor

Lizhen Qu

Modern large multimodal models (LMMs) and omni-modal models process not just text but vision, audio, and speech, opening new application surfaces and, with them, new safety risks. Established safety pipelines, including RLHF, safety classifiers such as Llama Guard, and red-teaming protocols, were largely developed for text-only models and translate poorly to the multimodal setting. Three gaps are now well documented. First, non-textual modalities expose new attack surfaces that text-only defenses do not cover: paralinguistic cues such as tone, accent, emotion, and background sound can carry harmful intent that audio-aware models systematically fail to recognise, and a structured red-teaming study shows that audio is in effect the Achilles' heel of current LMMs, with attack success rates considerably above text-only baselines [1]. Second, defenses that simply tighten refusal behaviour trade one failure mode for another: models become over-cautious, refuse benign queries, and degrade user trust. Reshaping the model's representation space, rather than only adjusting decision thresholds, is required to balance safety and over-rejection [2]. Third, safety specifications themselves are culturally situated. Static, monocultural benchmarks misalign with the values of many user populations, and dynamic, multi-agent, multi-cultural evaluation is required to expose value-misalignment risks that single-turn text prompting cannot surface [3, 4]. Together, these findings indicate that multimodal AI safety cannot be reduced to bolting a guard model onto an existing pipeline; it requires evaluation, alignment, and defense methods designed natively for the cross-modal, multi-cultural setting.

This project will develop a neuro-symbolic solution for improving the safety of multimodal foundation models across text, vision, audio, and speech, with explicit attention to cultural context and to the safety-utility trade-off. The student will:

  • Extend the audio red-teaming methodology of [1] to omni-modal models that jointly process vision, audio, and speech, developing a taxonomy of cross-modal risks (including paralinguistic, environmental, and visually grounded cues) and an automated red-teaming planner that allocates effort strategically across modalities and harm categories rather than uniformly.
  • Build defense methods that go beyond refusal-rate tuning, including representation-space interventions in the style of [2], to jointly optimise for safety, calibrated uncertainty, and low over-rejection on benign prompts, with formal coverage guarantees derived from conformal prediction.
  • Investigate self-evolving benchmark construction in which a red-teaming planner, a critic agent, and a generative pipeline collaboratively expose model vulnerabilities and synthesise targeted training and evaluation data, with human-in-the-loop verification ensuring specification coverage and data quality.

The goal is a multimodal safety framework whose outputs are robust across modalities and cultures, transparent and auditable to model developers and regulators, and useful enough that safety improvements no longer come at the cost of helpfulness.

[1] Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models. In Proceedings of NAACL, 2025.

[2] Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models. In Proceedings of EMNLP, 2025.

[3] Viet-Thanh Pham, Zhuang Li, Lizhen Qu, and Gholamreza Haffari. CultureInstruct: Curating Multi-Cultural Instructions at Scale. In Proceedings of NAACL, 2025.

[4] Viet Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, and Dinh Phung. LiveCultureBench: A Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations. In Proceedings of ACL, 2026.