Verifiable, Uncertainty-Aware World Models as Safety Guardrails for AI Agents

Primary supervisor

Lizhen Qu

Research area

Data Science and Artificial Intelligence

The rapid deployment of increasingly capable AI agents has prompted a fundamental reassessment of how safety should be built into AI systems. Bengio and colleagues have argued that purely agentic training objectives are intrinsically risky and have proposed an alternative paradigm: a non-agentic "Scientist AI" that explains the world from observations rather than acting in it, combining a world model that generates explanatory theories with a question-answering inference machine, and operating with explicit notions of uncertainty so as to mitigate overconfident predictions [1]. A complementary line of theoretical work asks whether such a system could serve as a runtime guardrail, deriving probabilistic bounds on the harm probability of a candidate agent action by reasoning over a Bayesian posterior across plausible hypotheses about the world [2]. Together, these two papers set out a research agenda in which the central object of study is no longer the agent itself but the world model that surrounds, predicts, and ultimately constrains it.

Realising this agenda requires substantial technical advances on the world model itself. A useful world model for safety must do more than pattern-match: it must support causal queries about what would happen under intervention, generalise reliably to deployment contexts not present in training data, and expose its reasoning chain to inspection so that risk estimates can be audited by humans. Current LLM-based world models fall short on each of these requirements. They confuse correlation with causation, hallucinate causal links that contradict scientific evidence, and produce opaque predictions whose calibration is unknown [3, 4]. Causal discovery applied directly to observational data, conversely, is brittle outside narrow tabular regimes and rarely scales to the unstructured, multimodal data on which modern AI agents operate [5, 6]. The result is a gap between the conceptual proposal of a Scientist-style world-model guardrail and the technical machinery needed to instantiate it.

This project will develop a verifiable, causal, and uncertainty-aware world model designed specifically to function as a safety substrate for downstream AI agents. The student will:

Develop neuro-symbolic causal world models that combine LLM-based hypothesis generation with structured causal graphs, building on iterative and integrated verifiable causal discovery from non-tabular sources [3] and the abstract causal event discovery substrate of [6], to support faithful counterfactual and interventional queries in settings where tabular data are unavailable.
Build calibrated reliability and confidence estimation pipelines for LLM-generated causal claims [4, 5], so that the world model's outputs come with conformal-style probabilistic bounds that can be plugged directly into Bayesian-oracle-style guardrails of the kind formalised in [2].
Investigate runtime use of the world model as a Scientist-AI-style guardrail [1] for agentic systems: given a candidate action proposed by an external agent, the world model simulates likely consequences, scores them against an explicit safety specification, and accepts, rejects, or escalates the action to a human reviewer.
Evaluate the framework on safety-critical settings where the world-model substrate is most needed, including multimodal agents whose risks surface through non-textual modalities such as audio and speech [7], and report empirically on the cost of safety (over-refusal, latency, sample complexity) so that the guardrail's operating regime can be honestly characterised.

The goal is to develop a world-model that translates the conceptual proposal [1, 2] into an empirically validated system, with explicit causal structure, calibrated uncertainty, and audit-ready explanations that make it suitable both as a research instrument for safety science and as a deployable guardrail for emerging AI agents.

[1] Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? arXiv:2502.15657, 2025.

[2] Yoshua Bengio, Michael K. Cohen, Nikolay Malkin, Matt MacDermott, Damiano Fornasiere, Pietro Greiner, and Younesse Kaddar. Can a Bayesian Oracle Prevent Harm from an Agent? In Proceedings of UAI, PMLR 286:257–270, 2025.

[3] Tao Feng, Lizhen Qu, Niket Tandon, and Gholamreza Haffari. IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data. In Proceedings of ACL, 2025.

[4] Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, and Gholamreza Haffari. On the Reliability of Large Language Models for Causal Discovery. In Proceedings of ACL, 2025.

[5] Tao Feng, Lizhen Qu, Xiaoxi Kang, and Gholamreza Haffari. CausalScore. In Proceedings of COLING, 2025.

[6] Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, and Gholamreza Haffari. ACCESS: A Benchmark for Abstract Causal Event Discovery and Reasoning. In Proceedings of NAACL, 2025.

[7] Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models. In Proceedings of NAACL, 2025.

Verifiable, Uncertainty-Aware World Models as Safety Guardrails for AI Agents

Primary supervisor

Research area

Primary supervisor

Lizhen Qu

Supervisor Connect

Browse

Recently added