Primary supervisor
Thanh Thi NguyenThis project involves the automated generation of textual descriptions for audio content, such as spoken language, sound events, or music. This process typically employs deep learning techniques, such as recurrent neural networks, transformer models, and so on, to analyse audio signals and generate coherent captions. By training on large datasets that include both audio recordings and corresponding textual descriptions, these models learn to recognize patterns and contextual meanings within the audio. This project entails the collection and generation of audio-description datasets to create a robust foundation for analysis. In addition, various deep learning models will be proposed and implemented to explore their effectiveness in processing and interpreting the audio data. Finally, a comprehensive evaluation will be conducted to assess the performance of these models, identifying their strengths and areas for improvement.
Student cohort
Required knowledge
Python programming
Machine learning background