Skip to main content

Active learning for a text classifier using small data

Primary supervisor

Wray Buntine

Co-supervisors


Text classification has extensive uses and deep learning has improved its performance using transformer language models.  A major hurdle for its use, however, is the paucity of labelled or annotated data. The data labelling process performed by domain experts is expensive and tedious to produce.  Active Learning is an approach to speeding up learning by judiciously selecting data to be annotated. Recently, advances in active learning theory have been made, but some experimental anomalies occur which need investigating.  Noteably, in BatchBALD (see references below) in Figure 4, it can be seen that batching data in small batches always beats a one-step lookahead technique.
 

Student cohort

Single Semester
Double Semester

Aim/outline

The project's objective is to develop an empirical understanding of the batch anomaly for active learning, and further developing the algorithm.   This will build on an advanced active learning system for text classification implemented in PyTorch.  Hopefully, relevant theory can also be developed, supported by the supervisory team.
 

URLs/references

"Active learning literature survey," by Burr Settles.  Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
"BatchBALD: Efficient and diverse batch acquisition for deep Bayesian active learning," Andreas Kirsch, Joost van Amersfoort, and Yarin Gal.  In Advances in Neural Information Processing Systems, 2019.  https://arxiv.org/abs/1906.08158

Required knowledge

Practical knowledge of using modern deep learning methods as well as extensive experience with Python programming, and some with PyTorch.

Standard Machine Learning, Artificial Intelligence and Natural Language Processing as covered in masters or advanced undergraduate subjects.

Good understanding of Machine Learning principles.