Project Background & Motivation
Active learning (AL) mitigates the heavy annotation costs of deep learning by strategically querying the most informative unlabeled samples. However, traditional AL paradigms rely on a fragile "closed-world" assumption: that the unlabeled pool perfectly reflects the distribution of the labelled seed set. In real-world deployments, this is rarely true. Data streams are inherently non-stationary, subjected to temporal drift, environmental domain shifts, and the continuous emergence of novel categories (open-set conditions).
Under such distribution drift, standard AL frameworks—particularly those reliant on epistemic uncertainty or entropy-based querying—become catastrophically brittle. They fall victim to the purity-informativeness dilemma: the acquisition function cannot distinguish between highly informative in-distribution samples and out-of-distribution (OOD) noise. Consequently, the model wastes vital labelling budgets on anomalous samples that are impossible to classify, yielding negligible performance gains or even inducing catastrophic forgetting.
To bridge the gap between theoretical AL and real-world deployment, this PhD project will develop resilient active learning methodologies capable of operating under severe distribution drift. By decoupling OOD filtering from informativeness scoring and leveraging adaptive, drift-aware querying strategies, this research aims to maintain high label efficiency in non-stationary environments, supported by reproducible benchmarks and principled evaluation protocols.
Key Objectives
-
O1: Drift-aware querying: Develop acquisition functions that explicitly account for drift and open-set contamination.
-
O2: Robust uncertainty estimation: Improve calibration and uncertainty reliability under drift (e.g., ensembles, Bayesian approximations, conformal prediction).
-
O3: Joint OOD detection + AL: Combine selection with OOD filtering/triage policies that decide what to label, what to defer, and what to reject.
-
O4: Human- and compute-aware AL: Incorporate labelling cost, label noise, and selection-time constraints into AL design.
-
O5: Standardised benchmarks: Build or extend a platform (e.g., ALScope-like) for CV + NLP drift settings: open-set, imbalanced, and temporal drift.
Aimed Publications (Targets)
-
NeurIPS / ICML / ICLR (method + theory)
-
ACL / EMNLP (NLP drift settings)
-
TPAMI / JMLR (benchmark + comprehensive study)
Required knowledge
-
Strong Mathematical Foundation: Demonstrated expertise in Probability and Statistics, Linear Algebra, and Optimisation. A deep understanding of Bayesian Inference (e.g., Variational Inference, Monte Carlo methods) is highly preferred.
-
Machine Learning Expertise: Mastery of Deep Learning fundamentals, including experience with Vision Transformers (ViTs), Generative Models (e.g., Diffusion or VAEs), and Self-Supervised Learning.
-
Information Theory: Familiarity with concepts like Entropy, Mutual Information, and Kullback-Leibler (KL) Divergence, which are critical for designing acquisition functions.
-
Deep Learning Frameworks: Advanced proficiency in PyTorch. The candidate must be able to implement custom training loops, handle complex data pipelines, and modify model architectures.
-
Software Engineering: Strong Python programming skills, including experience with version control (Git) and writing clean, reproducible, and modular code.