From Main Sequence to Red Giant: Studying the Lifecycle of AI Evaluation Benchmarks and Leaderboards

Primary supervisor

Abdul Rafae Khan

Evaluation benchmarks are a foundational component of artificial intelligence (AI) research, providing standardized ways to measure and compare the capabilities of AI systems. Benchmarks such as MMLU, GSM8K, HumanEval, and HellaSwag have been instrumental in tracking progress in large language models and related systems. However, benchmark usefulness is not static. As modern models improve rapidly, many benchmarks transition through distinct stages of utility: from early emergence and rapid improvement, to maturity, and eventually to saturation where performance differences between models become less meaningful.

This project investigates the lifecycle of AI evaluation benchmarks from a meta-evaluation perspective. Rather than focusing on building new evaluation systems, it focuses on understanding how benchmarks themselves evolve over time and how their ability to measure progress changes as AI systems advance.

The project develops Benchmark Observatory, an analytical framework designed to study benchmark evolution using historical performance data obtained from existing evaluation infrastructures, public leaderboards, and benchmark repositories (including outputs from systems such as Benchmark-as-a-Service). The goal is to analyze how benchmark characteristics change across time, model generations, and task domains.

Using this data, the project explores whether benchmark "health" can be quantified through measurable indicators such as performance ceilings, score dispersion, improvement rates, and discriminative power between models. These indicators are used to identify lifecycle patterns and potential signs of benchmark saturation or declining usefulness.

The outcome of this project is both a reusable analytical framework and an empirical study of AI benchmark evolution. It contributes to a deeper understanding of how evaluation benchmarks behave over time and how they influence the interpretation of AI progress.

Aim/outline

The objectives of this project are:

1. Benchmark Lifecycle Data Integration

Assemble and curate benchmark performance histories by leveraging existing evaluation infrastructures (e.g., benchmark evaluation platforms and public leaderboards). The focus is on consolidating structured historical data rather than building new evaluation execution systems.

2. Benchmark Lifecycle Dataset Construction

Construct a structured dataset containing:

Benchmark identifiers and metadata
Historical performance records across models and time
Task categories and evaluation domains
Model generation and release context
Temporal performance trends

This dataset forms the basis for longitudinal analysis of benchmark behaviour.

3. Benchmark Health and Lifecycle Metrics

Design and implement quantitative metrics to characterize benchmark evolution, including:

Performance growth and saturation trends
Score dispersion and variability
Benchmark discriminative power over time
Stability of rankings across model generations
Indicators of benchmark maturity and decline

Where appropriate, novel metrics may be proposed and evaluated.

4. Benchmark Evolution Analysis

Conduct systematic analysis of benchmark trajectories to study:

How quickly different benchmarks saturate
Which benchmarks remain informative over long time periods
Differences in lifecycle patterns across benchmark types (reasoning, coding, mathematics, etc.)
Structural differences between stable and rapidly saturating benchmarks

5. Visualization and Exploratory Tools

Develop visualization tools (e.g., dashboards) to support exploration of:

Benchmark lifecycle curves
Model performance evolution across benchmarks
Comparative benchmark health indicators
Cross-benchmark lifecycle patterns

6. Empirical Evaluation and Interpretation

Evaluate the proposed metrics and analysis methods through case studies of selected benchmarks. The study will interpret results in the context of AI evaluation practices and discuss implications for:

Benchmark design and longevity
Interpretation of model performance improvements
Future directions in AI evaluation methodology

URLs/references

Benchmark and Evaluation Platforms

Open LLM Leaderboard (Hugging Face). https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
OpenRouter Rankings. https://openrouter.ai/rankings
HELM: Holistic Evaluation of Language Models (Stanford CRFM). https://crfm.stanford.edu/helm
EleutherAI LM Evaluation Harness. https://github.com/EleutherAI/lm-evaluation-harness
Artificial Analysis (LLM performance tracking). https://artificialanalysis.ai

Key Benchmark Papers

Hendrycks, D. et al. (2020). MMLU: Measuring Massive Multitask Language Understanding. https://arxiv.org/abs/2009.03300
Cobbe et al. (2021). GSM8K: Training Verifiers to Solve Math Word Problems. https://arxiv.org/abs/2110.14168
Chen et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval) https://arxiv.org/abs/2107.03374
Zellers et al. (2019). HellaSwag: Can a Machine Really Finish Your Sentence? https://arxiv.org/abs/1905.07830

AI Evaluation and Research Context

Stanford Center for Research on Foundation Models (CRFM). https://crfm.stanford.edu
OpenAI Research. https://openai.com/research

Required knowledge

Essential

Python programming
Data processing and analysis
Working with structured data (CSV, JSON, APIs, databases)
Basic software engineering principles
Data visualization fundamentals

Recommended

Machine learning and AI fundamentals
Statistics and exploratory data analysis
SQL and database design
Time-series analysis and trend modelling
Web scraping and API integration
Dashboard development (Streamlit or similar tools)

Desirable

Familiarity with AI evaluation benchmarks
Experience reading research papers
Interest in AI evaluation methodology
Reproducible research practices

From Main Sequence to Red Giant: Studying the Lifecycle of AI Evaluation Benchmarks and Leaderboards

Primary supervisor

Aim/outline

URLs/references

Required knowledge

Honours projects

Supervisor Connect

Browse

Recently added