Skip to main content

Audio captioning using machine learning

Primary supervisor

Thanh Thi Nguyen

This project involves the automated generation of textual descriptions for audio content, such as spoken language, sound events, or music. This process typically employs deep learning techniques, such as recurrent neural networks, transformer models, and so on, to analyse audio signals and generate coherent captions. By training on large datasets that include both audio recordings and corresponding textual descriptions, these models learn to recognize patterns and contextual meanings within the audio. This project entails the collection and generation of audio-description datasets to create a robust foundation for analysis. In addition, various deep learning models will be proposed and implemented to explore their effectiveness in processing and interpreting the audio data. Finally, a comprehensive evaluation will be conducted to assess the performance of these models, identifying their strengths and areas for improvement.

Student cohort

Single Semester
Double Semester

Required knowledge

Python programming

Machine learning background