Skip to main content

Primary supervisor

Lizhen Qu

Modern NLP applications increasingly process text carrying sensitive personal information, including clinical conversations, legal correspondence, customer support transcripts, and social media posts. Sharing such text with third-party models, annotators, or downstream pipelines remains constrained by data protection legislation (e.g., GDPR, the EU AI Act) and growing user expectations around transparency. Differential privacy (DP) provides formal protection but typically yields representations that are either non-human-readable, such as distributed representations augmented with random noise [1], or, when materialised as text, unnatural and disfluent through random token replacement [2]. Such outputs complicate compliance audits, force downstream models to be re-trained on the noised distribution, and offer no protection when private attributes are statistically correlated with publicly observable ones. Rule-based redaction and named-entity scrubbing yield human-readable outputs but break utterance-level semantics by substituting sensitive spans with placeholder tokens like <PERSON> or <LOCATION>, degrading downstream tasks that depend on coherent context. Privacy-aware text rewriting [3] addresses these limitations by reformulating privacy preservation as a generation task: produce a rewrite that removes or attenuates sensitive content while preserving meaning, fluency, and downstream utility. Building on the NAP² benchmark for evaluating the naturalness-privacy trade-off [4] and recent zero-shot rewriting methods based on iterative tree search [5], this project will develop the next generation of privacy-aware rewriting techniques. The student will investigate (i) controllable rewriting that lets data subjects specify which categories and granularity of attributes to protect, (ii) calibrated privacy guarantees over rewrites against membership-inference and attribute-inference adversaries, including settings where private attributes are correlated with public ones, and (iii) extensions to multi-turn dialogue and domain-specific text such as clinical interviews and legal records. The goal is a rewriting framework whose outputs are simultaneously useful for downstream tasks, natural enough to be reviewed by humans, and auditable for transparency and regulatory compliance.

 

[1] Lingjuan Lyu, Xuanli He, and Yitong Li. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness. In Findings of the ACL, pages 2355–2365, 2020.

[2] Xiang Yue, Minxin Du, Tianhao Wang, Yaliang Li, Huan Sun, and Sherman SM Chow. Differential Privacy for Text Analytics via Natural Text Sanitization. In Findings of the ACL-IJCNLP, 2021.

[3] Qiongkai Xu, Lizhen Qu, Chenchen Xu, and Ran Rui. Privacy-Aware Text Rewriting. In Proceedings of INLG, 2019.

[4] Shuo Huang, William Maclean, Xiaoxi Kang, Qiongkai Xu, Zhuang Li, Xingliang Yuan, Gholamreza Haffari, and Lizhen Qu. NAP²: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human. In Findings of EMNLP, 2025.

[5] Shuo Huang, Xingliang Yuan, Gholamreza Haffari, and Lizhen Qu. Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search. In Findings of EMNLP, pages 9175–9190, 2025.

Required knowledge

Candidates are expected to have a solid background in machine learning and Language Technology. Preference will be given to candidates who have strong written and oral communication skills, as well as strong programming skills. It is desirable that the candidates already have research experience in at least one of the following areas: deep learning, deep reinforcement learning, causality, natural language generation and differential privacy.