Evaluating Large Language Model Accuracy for Clinical Document Parsing in Australian Healthcare Contexts

Primary supervisor

Enes Makalic

Co-supervisors

Lisa Ellis

Background and Motivation

Australian healthcare generates substantial volumes of unstructured clinical documentation including referral letters, discharge summaries, specialist correspondence, pathology reports, and medication lists. All vary significantly in format, terminology, and quality across providers and institutions. Large language models (LLMs) have demonstrated promising capability in extracting structured information from unstructured text, yet their performance on Australian medical documents specifically remains poorly characterised. As AI-assisted health information tools increasingly enter consumer and patient-facing applications, understanding where these models succeed and fail becomes a patient safety question, not merely a technical one. This project addresses a significant gap in the Australian health AI literature with findings applicable across digital health, clinical decision support, and health information management.

Aim/outline

Project Description

This project will systematically evaluate the accuracy and failure modes of current large language models when parsing unstructured Australian medical documents. The student will develop a benchmarking framework and annotated document corpus, evaluate model performance across document types and clinical specialties, and classify errors by frequency and potential for downstream clinical harm.

Key research questions include: How accurately do LLMs extract structured clinical information from Australian medical documents (including medications, conditions, dosages, dates, and provider details)? How does accuracy vary across document types, clinical specialties, and document quality including handwritten and scanned formats? What prompt engineering approaches or output validation strategies most effectively reduce clinically significant errors? How do models perform on Australian-specific medical terminology and drug names compared to international benchmarks?

The project sits at the intersection of natural language processing, health informatics, and AI safety, and offers opportunities to extend into areas including human-AI interaction in health contexts, error taxonomy development, and evaluation framework design. The student will work with an industry partner with clinical domain expertise, providing access to realistic document types and real-world validation of findings.

URLs/references

Speech and Language Processing, Jurafsky, D., & Martin, J. H. Speech and Language Processing (3rd ed. draft).
Deep Learning, Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Brown, T. et al. (2020). “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems (NeurIPS).

Required knowledge

Students undertaking this project should ideally have knowledge or experience in several of the following areas:

Programming experience in Python, including data processing and scripting
Familiarity with machine learning or natural language processing concepts
Understanding of large language models and prompt engineering fundamentals
Experience working with structured and unstructured textual data
Basic statistical analysis and evaluation methodology
Familiarity with version control systems such as Git
Interest in healthcare AI, digital health, or health informatics
Ability to critically analyse model outputs and error patterns
Experience with libraries or frameworks such as Hugging Face Transformers, PyTorch, LangChain, spaCy, or similar is desirable but not essential
Strong written communication and research skills

The project is suitable for students with backgrounds in computer science, data science, artificial intelligence, software engineering, biomedical engineering, or related disciplines.

Primary supervisor

Co-supervisors

Aim/outline

URLs/references

Required knowledge

Honours projects

Supervisor Connect

Browse

Recently added