Skip to main content

Extreme Multi-label Text Classification with Metadata and Pretrained knowledge

Primary supervisor

Ethan Zhao

Co-supervisors


In multi-label classifications, a data sample is associated with more than one active label, which is a more challenging task than conventional single-label classifications. This project will focus on eXtreme Multi-Label (XML) classifications for text data (i.e., documents), where the label set can be extremely large, e.g., more than 10,000. For example, the input texts can be the item descriptions of an e-commerce website (e.g., Amazon) and one needs to classify them into a large set of item categories. The project is to develop novel machine learning and deep learning models for XML of text data by leveraging metadata of documents and knowledge in pretrained language models.

Student cohort

Single Semester
Double Semester

Aim/outline

We aim to propose a new method for XML of text data. The primary goal is to publish the proposed methods in top machine learning, data mining, and natural language processing venues (i.e., CORE ranking A* or A conferences). The second goal is to develop a demo and research code package along with the publications. This project is particularly suitable for students who aim to pursue their research degrees in machine learning and deep learning. The planned publications are expected to put weight on their future PhD applications.

Required knowledge

  • Proficiency in Python especially Tensorflow and/or PyTorch
  • Foundations of machine learning and deep learning (e.g., FIT3181 or FIT5215)
  • Basic knowledge in probabilities and statistics
  • Prior familiarity in pretrained language models (e.g., BERT) is preferred but not required