Skip to main content

Metasurveillance: Understanding Failure Modes in LLM-as-a-Judge Systems

Primary supervisor

Abdul Rafae Khan

Large Language Models (LLMs) are increasingly used to automatically evaluate other AI systems in tasks such as writing, reasoning, and question answering. This approach called LLM-as-a-Judge is now widely used in research benchmarks and AI development pipelines.

However, recent studies show that these automated judges can be unreliable. They may be influenced by subtle factors such as response length, position of answers, writing style, or domain mismatch (e.g., medical or legal content). These hidden biases can distort evaluation results and lead to misleading conclusions about model performance.

This project investigates how and why these failures occur. We will systematically test state-of-the-art LLM judges under controlled conditions to understand their weaknesses. The goal is to build a structured taxonomy of failure modes and develop tools for assessing the reliability of AI evaluators.

Students will work with modern LLMs, design experiments, and analyze how different types of input changes affect model judgment behavior.

Aim/outline

The project aims to:

  • Identify and categorize failure modes in LLM-based evaluators
  • Study how biases (e.g., length, position, style) affect judgments
  • Investigate how performance changes across different domains (e.g., general QA vs. medical or legal text)
  • Build a dataset of "challenging evaluation cases" to test LLM reliability
  • Develop a structured framework (taxonomy) for understanding evaluator failures

Planned work includes:

  • Designing controlled experiments with LLM judges (e.g., GPT-5.5, Llama, Qwen)
  • Creating adversarial or perturbed evaluation examples
  • Running large-scale experiments and analyzing results
  • Identifying patterns of inconsistency and bias
  • Summarizing findings into a research publication or thesis

URLs/references

Core LLM-as-a-Judge foundation:

  • Zheng et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. https://arxiv.org/abs/2306.05685

Position, verbosity, and evaluation biases:

  • Shi et al. (2024). Judging the judges: A systematic study of position bias in LLM-as-a-judge. https://arxiv.org/abs/2406.07791
  • Stureborg et al. (2024). Large language models are inconsistent and biased evaluators. https://arxiv.org/abs/2405.01724

Self-preference and judge bias mechanisms:

  • Wataoka et al. (2024). Self-preference bias in LLM-as-a-judge. https://arxiv.org/abs/2410.21819
  • Wang et al. (2024). Eliminating position bias of language models: A mechanistic approach. https://arxiv.org/abs/2407.01100

Domain shift and evaluation reliability:

  • Xie et al. (2024). An empirical analysis of uncertainty in large language model evaluations. https://arxiv.org/abs/2502.10709
  • Huang et al. (2024). An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. https://arxiv.org/abs/2403.02839

Surveys/broader context:

  • Li et al. (2024). LLMs-as-judges: A survey on LLM-based evaluation methods. https://arxiv.org/abs/2412.05579

Required knowledge

Students should ideally have some background in:

  • Python programming
  • Basic machine learning / NLP concepts
  • Familiarity with large language models (e.g., GPT, LLaMA)
  • Basic statistics (helpful but not strictly required)

Nice to have (not required):

  • Experience with PyTorch or HuggingFace
  • Prior ML/NLP coursework
  • Interest in AI safety or evaluation