Generative Active Learning with Large Language Model

Primary supervisor

Trang Vu

Traditional active learning helps reduce labeling costs by selecting the most useful examples from a large pool of unlabeled data. However, in many real-world cases, such a large pool doesn't exist or is expensive to collect. This project explores a new approach using large language models to create synthetic unlabeled text data instead. Rather than just picking data to label, the model will also generate new examples that are diverse and potentially helpful for learning. The aim is to reduce both the amount of data we need to collect and the number of labels required to train accurate models. The project will implement and test different generation strategies to ensure the diversity and coverage of synthetic data, and integrate with active learning to help the model learn faster and better.

Student cohort

Double Semester

Aim/outline

An open source code
A publication in NLP venues such as ACL/EMNLP/NAACL/EACL

URLs/references

* Xia, Yu, et al. "From Selection to Generation: A Survey of LLM-based Active Learning." arXiv preprint arXiv:2502.11767 (2025).

Required knowledge

Must: fluency in Python and PyTorch
Must: academic or working knowledge of Large Language Models
Must: fluent in basic machine learning concepts (both theory and hands-on)
Preferred: have built a small fine-tuned language model (i.e., LLaMA)
Preferred: interested in doing a PhD

Primary supervisor

Student cohort

Aim/outline

URLs/references

Required knowledge

Honours projects

Supervisor Connect

Browse

Recently added