Skip to main content

Prioritizing sample annotation for deep learning applications

Primary supervisor

Xiaoning Du

Despite the rapid progress made recently, deep learning (DL) approaches are data-hungry. To achieve their optimum performance, a significantly large amount of labeled data is required. Very often, unlabelled data is abundant but acquiring their labels is costly and difficult. Many domains require a specialist to annotate the data samples, for instance, the medical domain. Data dependency has become one of the limiting factors to applying deep learning in many real-world scenarios. As reported, it costs more than 49,000 workers from 167 countries for about 9 years to label the data in ImageNet, one of the largest visual recognition datasets containing millions of images in more than 20,000 categories. To make the training and evaluation process of DL applications more efficient, there is an increasing need to make the most out of limited available resources and select the most valuable inputs for manual annotation. This project aims to addresses the problem of prioritizing error-revealing samples from a large set of unlabelled data for various DL tasks.

Student cohort

Double Semester

Required knowledge

Natural language processing, software testing