Metasurveillance: Understanding Failure Modes in LLM-as-a-Judge Systems

Primary supervisor

Abdul Rafae Khan

Large Language Models (LLMs) are increasingly used to automatically evaluate other AI systems in tasks such as writing, reasoning, and question answering. This approach called LLM-as-a-Judge is now widely used in research benchmarks and AI development pipelines.

However, recent studies show that these automated judges can be unreliable. They may be influenced by subtle factors such as response length, position of answers, writing style, or domain mismatch (e.g., medical or legal content). These hidden biases can distort evaluation results and lead to misleading conclusions about model performance.

This project investigates how and why these failures occur. We will systematically test state-of-the-art LLM judges under controlled conditions to understand their weaknesses. The goal is to build a structured taxonomy of failure modes and develop tools for assessing the reliability of AI evaluators.

Students will work with modern LLMs, design experiments, and analyze how different types of input changes affect model judgment behavior.

Aim/outline

The project aims to:

Identify and categorize failure modes in LLM-based evaluators
Study how biases (e.g., length, position, style) affect judgments
Investigate how performance changes across different domains (e.g., general QA vs. medical or legal text)
Build a dataset of "challenging evaluation cases" to test LLM reliability
Develop a structured framework (taxonomy) for understanding evaluator failures

Planned work includes:

Designing controlled experiments with LLM judges (e.g., GPT-5.5, Llama, Qwen)
Creating adversarial or perturbed evaluation examples
Running large-scale experiments and analyzing results
Identifying patterns of inconsistency and bias
Summarizing findings into a research publication or thesis

URLs/references

Core LLM-as-a-Judge foundation:

Zheng et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. https://arxiv.org/abs/2306.05685

Position, verbosity, and evaluation biases:

Shi et al. (2024). Judging the judges: A systematic study of position bias in LLM-as-a-judge. https://arxiv.org/abs/2406.07791
Stureborg et al. (2024). Large language models are inconsistent and biased evaluators. https://arxiv.org/abs/2405.01724

Self-preference and judge bias mechanisms:

Wataoka et al. (2024). Self-preference bias in LLM-as-a-judge. https://arxiv.org/abs/2410.21819
Wang et al. (2024). Eliminating position bias of language models: A mechanistic approach. https://arxiv.org/abs/2407.01100

Domain shift and evaluation reliability:

Xie et al. (2024). An empirical analysis of uncertainty in large language model evaluations. https://arxiv.org/abs/2502.10709
Huang et al. (2024). An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. https://arxiv.org/abs/2403.02839

Surveys/broader context:

Li et al. (2024). LLMs-as-judges: A survey on LLM-based evaluation methods. https://arxiv.org/abs/2412.05579

Required knowledge

Students should ideally have some background in:

Python programming
Basic machine learning / NLP concepts
Familiarity with large language models (e.g., GPT, LLaMA)
Basic statistics (helpful but not strictly required)

Nice to have (not required):

Experience with PyTorch or HuggingFace
Prior ML/NLP coursework
Interest in AI safety or evaluation

Metasurveillance: Understanding Failure Modes in LLM-as-a-Judge Systems

Primary supervisor

Aim/outline

URLs/references

Required knowledge

Honours projects

Supervisor Connect

Browse

Recently added