Primary supervisor
Trang VuTraditional active learning helps reduce labeling costs by selecting the most useful examples from a large pool of unlabeled data. However, in many real-world cases, such a large pool doesn't exist or is expensive to collect. This project explores a new approach using large language models to create synthetic unlabeled text data instead. Rather than just picking data to label, the model will also generate new examples that are diverse and potentially helpful for learning. The aim is to reduce both the amount of data we need to collect and the number of labels required to train accurate models. The project will implement and test different generation strategies to ensure the diversity and coverage of synthetic data, and integrate with active learning to help the model learn faster and better.
Student cohort
Aim/outline
- An open source code
- A publication in NLP venues such as ACL/EMNLP/NAACL/EACL
URLs/references
* Xia, Yu, et al. "From Selection to Generation: A Survey of LLM-based Active Learning." arXiv preprint arXiv:2502.11767 (2025).
Required knowledge
- Must: fluency in Python and PyTorch
- Must: academic or working knowledge of Large Language Models
- Must: fluent in basic machine learning concepts (both theory and hands-on)
- Preferred: have built a small fine-tuned language model (i.e., LLaMA)
- Preferred: interested in doing a PhD