Skip to main content

Developing a pipeline for semi-automatic annotation with transfer learning

Primary supervisor

Leimin Tian


  • Dr Pamela Carreno-Medrano (Engineering)

The quality of machine learning and deep learning models depends on the quality of data they are trained on. Thus, data annotation has been an essential task in many ML studies, especially when the goal is to analyze data with subjective labels, such as dialogue topics or emotions of a person. Manual annotation of such subjective data is costly. Recent development in transfer learning and few-shot learning has enabled a more efficient approach of semi-automatic annotation, in which a classification model can be trained with a small amount of manually labelled data and generate reasonably accurate labels for the remaining dataset. The goal of this project is to develop a semi-automatic annotation pipeline using transfer learning.

Student cohort

Double Semester


  1. Develop a semi-automatic annotation model capable of learning an annotation scheme with a small amount (10%) of labelled training data
  2. Integration of the semi-automatic annotation model with an existing annotation tool to allow manual inputs of gold-standard labels, auto completion of labels with a confidence level given by the annotation model trained on the gold-standard labels, and manual correction of the auto completed labels


  1. Heimerl, A., Baur, T., Lingenfelser, F., Wagner, J. and André, E., 2019, September. NOVA-a tool for eXplainable Cooperative Machine Learning. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 109-115). IEEE.
  2. ELAN (Version 6.0) [Computer software]. (2020). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive. Retrieved from
  3. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683
  4. Rietz, T. and Maedche, A., 2021, May. Cody: An AI-Based System to Semi-Automate Coding for Qualitative Research. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
  5. Marathe, M. and Toyama, K., 2018, April. Semi-automated coding for qualitative research: A user-centered inquiry and initial prototypes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-12).

Required knowledge

Machine Learning and Deep Learning, Python Programming