A Data-Centric Study of Dataset Quality for TTP Extraction

Primary supervisor

Mengmeng Ge

Cyber Threat Intelligence (CTI) plays a vital role in today's cybersecurity landscape by collecting and analysing data about current and potential threats, providing insights to better understand, mitigate and respond in this ever-evolving environment. A core component of CTI is the identification of adversarial Tactics, Techniques, and Procedures (TTPs), which describe how attackers operate at a strategic and operational level. These TTPs are commonly structured using frameworks such as MITRE ATT&CK and are widely used to support threat hunting, attacker attribution, and incident response.

In recent years, substantial research effort has focused on automating the extraction of TTPs from unstructured CTI reports using Natural Language Processing (NLP) and machine learning techniques. While increasingly sophisticated models, including large language models (LLMs), have been proposed, recent systematisation studies reveal that performance improvements remain insufficient for reliable real-world deployment. A key reason identified is the lack of high-quality datasets for training and evaluating TTP extraction systems.

Aim/outline

This project focuses on a data-centric investigation of TTP extraction. It aims to examine existing datasets used for TTP extraction evaluation, analyse their limitations in depth, and explore ways to improve dataset quality and evaluation practices. The main objectives are listed below.

To survey and analyse publicly available datasets used for TTP extraction.
To identify sources of issues and study how dataset limitations affect model evaluation.
To propose practical recommendations and design guidelines for creating higher-quality TTP extraction datasets.
To implement small-scale analyses or experiments demonstrating how dataset improvements could lead to more reliable evaluation.

Required knowledge

Strong Python programming skills.
Have some interest and basic knowledge in cybersecurity and/or machine learning.

A Data-Centric Study of Dataset Quality for TTP Extraction

Primary supervisor

Aim/outline

Required knowledge

Honours projects

Supervisor Connect

Browse

Recently added