Data-Efficient Deep Learning for De Novo Molecular Design from Analytical Spectra

Primary supervisor

Lan Du

Research area

Machine Learning

Project Background and Motivation

The "inverse design" of molecules from analytical spectra (such as MS2, NMR, or IR) is a fundamental bottleneck in analytical chemistry, metabolomics, and drug discovery. While deep generative models have shown promise in proposing novel molecular structures, they typically require massive, cleanly labelled datasets to train effectively.

In practice, acquiring high-quality spectral data from wet-lab experiments is expensive and time-consuming. Furthermore, relying on a single spectral modality often leads to ambiguous generation, as different molecules can yield similar fragmentation patterns. There is a critical need for generative frameworks that can learn from limited data by intelligently querying the most informative samples, and that can fuse multiple views of analytical data to guide the generation of highly accurate 3D/2D molecular graphs or SMILES representations.

Research Aims and Objectives

This project aims to develop a robust, data-efficient deep learning framework capable of translating complex analytical spectra into valid molecular structures.

Multimodal and Multiview Spectral Encoding. Develop advanced neural architectures (e.g., adapted Vision Transformers or 1D sequence models) capable of fusing multiple distinct analytical "views" (e.g., combining MS2 and NMR data) into a rich, shared latent representation.
Conditional Molecular Generation. Design and train a conditional generative model that maps these fused spectral representations directly to 1D SMILES strings or 2D/3D molecular graphs.
Uncertainty-Guided Molecular Generation. Design and train a conditional generative model that dynamically adapts to the confidence of the spectral representations. By quantifying predictive uncertainty, the framework will guide the generation process—firmly constraining the output to specific substructures when spectral evidence is highly confident, and prioritising structural diversity to cover the probability space when the spectral signals are ambiguous.
Active Learning for Data Efficiency. Design and implement novel active learning algorithms tailored for deep generative models. The system will iteratively evaluate its own performance and identify the most informative "missing" spectral data points, minimising the number of costly wet-lab experiments required to train the model to state-of-the-art accuracy.

Expected Outcomes

A suite of novel, data-efficient algorithms for inverse molecular design.
High-impact publications in premier machine learning conferences (e.g., ICML, NeurIPS, ICLR) and leading interdisciplinary journals at the intersection of AI and chemistry.
Open-source software tools that bridge the gap between IT-driven deep learning research and practical analytical chemistry applications.

Required knowledge

The ideal candidate will have a strong background in Computer Science, Data Science, or Information Technology. Demonstrated proficiency in deep learning frameworks (PyTorch) is required. Prior experience or strong theoretical knowledge in active learning, multimodal deep learning, diffusion models, or representation learning is highly desirable.

A background in chemistry is not required, provided the candidate is willing to engage with chemical informatics.

Data-Efficient Deep Learning for De Novo Molecular Design from Analytical Spectra

Primary supervisor

Research area

Required knowledge

Primary supervisor

Lan Du

Supervisor Connect

Browse

Recently added