Skip to main content

MML decision trees for survival analysis

Primary supervisor

Enes Makalic

Decision trees are powerful, interpretable models for prediction and classification that recursively partition the feature space into regions with homogeneous outcomes. Traditional decision tree algorithms like CART and C4.5 rely on heuristic splitting criteria and require ad-hoc pruning methods to prevent overfitting. In contrast, the Minimum Message Length (MML) framework provides a principled, information-theoretic approach to tree induction that naturally balances model complexity against data fit without requiring separate pruning phases.

Wallace and Patrick (1993) introduced MML decision trees, demonstrating how the two-part message length criterion simultaneously handles tree structure selection, split point determination, and leaf model specification. The MML approach encodes both the tree structure (assertion) and the data given the tree (detail), with the total message length providing an objective function for tree construction. This eliminates the need for cross-validation or hold-out sets for model selection, as the MML criterion inherently penalizes overly complex trees.

Time-to-event data (survival data) presents unique analytical challenges including censoring (incomplete observation of event times), time-varying covariates, and non-standard distributional assumptions. Traditional survival analysis methods like Cox proportional hazards regression and Kaplan-Meier estimation are well-established but often assume linear relationships and proportional hazards. Decision trees for survival analysis (e.g., survival trees and random survival forests) offer flexible, non-parametric alternatives that can capture complex interactions and non-linear effects.

Combining MML principles with survival tree methodology presents an exciting opportunity to develop theoretically principled, interpretable models for time-to-event data. This approach can leverage parametric survival distributions (Weibull, log-normal, exponential) or semi-parametric methods within an MML framework, providing objective model selection across tree structures while handling censored observations appropriately.

Aim/outline

Aim 1: Develop MML Codelengths for Survival Distributions
Derive rigorous MML87 codelengths for common parametric survival distributions including exponential, Weibull, log-logistic, and log-normal models, accounting for right-censored, left-censored, and interval-censored data. Extend these results to handle time-varying covariates and competing risks, establishing the theoretical foundation for MML-based survival tree construction.

Aim 2: Design and Implement MML Survival Tree Algorithm
Develop a complete algorithm for constructing MML decision trees specifically tailored to time-to-event data. This includes: (i) defining appropriate splitting criteria that account for censoring, (ii) implementing efficient search procedures for optimal split points in continuous and categorical predictors, (iii) determining optimal leaf models (parametric distributions or Kaplan-Meier estimates), and (iv) incorporating the tree structure encoding into the total message length calculation.

Aim 3: Empirical Evaluation and Clinical Applications
Conduct comprehensive simulation studies comparing MML survival trees against established methods including Cox regression, conditional inference trees, random survival forests, and other survival tree variants. Apply the methodology to real clinical datasets (cancer registries, cardiovascular studies, clinical trials) to demonstrate interpretability, predictive performance, and the ability to identify clinically meaningful patient subgroups with distinct survival profiles.

URLs/references

MML Decision Trees Foundation:

  • Makalic, E., & Schmidt, D. F. (2022). Introduction to minimum message length inference. arXiv:2209.14571
  • Wallace, C. S., & Patrick, J. D. (1993). Coding decision trees. Machine Learning, 11(1), 7-22. [Seminal paper on MML tree induction]
  • Tan, P. J., & Dowe, D. L. (2003). MML inference of decision graphs with multi-way joins and dynamic attributes. AI 2003: Advances in Artificial Intelligence, 269-281.

Survival Analysis Fundamentals:

  • Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer. [Comprehensive survival analysis textbook]

Survival Trees Literature:

  • LeBlanc, M., & Crowley, J. (1992). Relative risk trees for censored survival data. Biometrics, 48(2), 411-425. [Early survival tree method]
  • Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests. Annals of Applied Statistics, 2(3), 841-860.