Despite the popularity of providing text analysis as a service by high-tech companies, it is still challenging to develop and deploy NLP applications involving sensitive and demographic information, especially when the information is expected to be shared with transparency and legislative compliance. Differential privacy (DP) is widely applied to protect privacy of individuals by achieving an attractive trade-off between utility of information and confidentiality. However, current DP techniques largely neglect readability and naturalness of privacy-preserving representations by producing either non-human-readable representations, such as distributed representations of text augmented with random noises [1] or unnatural text curated by replacing sensitive tokens with random non-sensitive ones [2]. First, such representations complicate transparency and compliance checks with data protection and privacy legislation (e.g., GDPR) whether performed by humans or computer systems. Second, both privacy-preserving distributed representations and texts augmented with random noise are difficult to share with models for the downstream tasks, because they need to be re-trained or fine-tuned on those representations. Lastly, DP does not guarantee that what one believes to be one’s secrets will remain secret. Namely, a DP algorithm cannot ensure that private attributes cannot be inferred from publicly observable attributes if they have strong correlations. Therefore, the project aims to devise novel rewriting techniques to find a proper trade-off between utility of information and meeting individual needs of privacy protection, and to provide transparency and readability by ensuring naturalness of rewrites.
[1] Lingjuan Lyu, Xuanli He, and Yitong Li. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness. In Findings of the ACL, page 2355–2365, 2020
[2] Xiang Yue, Minxin Du, Tianhao Wang, Yaliang Li, Huan Sun, and Sherman SM Chow. Differential Privacy for Text Analytics via Natural Text Sanitization. In Findings of the ACL-IJCNLP, 2021.
Required knowledge
Candidates are expected to have a solid background in machine learning and Language Technology. Preference will be given to candidates who have strong written and oral communication skills, as well as strong programming skills. It is desirable that the candidates already have research experience in at least one of the following areas: deep learning, deep reinforcement learning, causality, natural language generation and differential privacy.