Primary supervisor
Helen PurchaseCo-supervisors
- Jonathan Yu (CSIRO)
This project relates to the visualisation of the source of data used in scientific experiments, and their results. The visualisation focus is graphs.
Trust in the results of scientific experiments and scientific modelling relies on knowing how they have been derived – that is, the ‘scientific workflow’ that led to their production. Being able to reproduce the scientific workflow that led to such results is critical in ensuring trust, confidence and transparency [2].
Capturing the provenance (that is, the source) of the information used for scientific workflows is therefore foundational in achieving transparency and reproducibility. There has been an increased adoption of tools such as electronic notebooks, Laboratory Information Management Systems (LIMS) and Jupyter notebooks for computational modelling that allow automated capture of provenance records. However, solutions and approaches which collate provenance information across systems in a scalable and general way are lacking.
Knowledge graph technology tools can capture concepts and the relationships between them [5, 6], and provenance ontologies are well established in the community [7]. While some prior architectures exist for implementing a knowledge graph for scientific workflows, robust implementations are not yet widespread in scientific practice [1].
This project aims to build on existing technologies and ontologies to explore how knowledge graphs can be used to represent scientific workflows, using provenance information from a variety of sources – within the context of real-world science projects. In particular, the Provena open source system curates provenance data [8], and will form the basis for the project.
Specifically, this project will explore several aspects of provenance knowledge graphs. Possible approaches include
- Developing adapters and client libraries to allow easy recording of provenance of scientific workflows, e.g. electronic notebooks, Laboratory Information Management System (LIMS), Jupyter notebooks for computational modelling.
- Developing relevant and effective queries and visualisations for knowledge discovery of scientific workflows
- Testing and benchmarking the performance of different knowledge graph implementations, e.g. neo4j [5], GraphDB [6]
An exciting prospect will be the application of this implementation to several CSIRO projects in different impact areas including Modelling national bushfire risk and resilience and Modelling interventions on the Great Barrier Reef and Hydrological modelling.
Student cohort
Aim/outline
This aim of this project is to develop tools to assist in the recording of the provenance of data collected as part of scientific workflows, to visualise these workflows as knowledge graphs, and to facilitate effective querying of the information.
The tools will be tested and applied to a range of different CSIRO projects.
URLs/references
[1] Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., & Vidal, M. (2018). Towards a Knowledge Graph for Science. Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics.
[2] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a
[3] Lebo, T., Sahoo, S.S., McGuinness, D.L., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., & Zhao, J. (2013). PROV-O: The PROV Ontology.
[4] Miller, J. J., Graph database applications and concepts with Neo4j. In Proceedings of the southern association for information systems conference, Atlanta, GA, USA (Vol. 2324, No. 36).
[5] neo4j, https://neo4j.com/ (Accessed 18 August 2022)
[6] GraphDB, https://graphdb.ontotext.com/ (Accessed 18 August 2022)
[7] Prov-O ontology, https://www.w3.org/TR/prov-o/ (Accessed 18 August 2022)
[8] Yu, J., Baker, P., Cox, S.J.D., Petridis, R., Freebairn, A., Mirza, F., Thomas, L., Tickell, S., Lemon, D., Rezvani, M., Provena: A provenance system for large distributed modelling and simulation workflows, 25th International Congress on Modelling and Simulation (MODSIM), Darwin, Australia, July 2023 (to appear).
Required knowledge
The student needs to be a competent and versatile programmer. While knowledge of Information Visualisation tools and methods is not essential, it would be useful.