Skip to main content

Optimal clustering of DNA and RNA binding sites from de novo motif discovery using Minimum Message Length

Primary supervisor

David Dowe

Co-supervisors

  • Mirana Ramialison

    DNA or RNA motif discovery is a popular biological method to identify over-represented DNA or RNA sequences in next generation sequencing experiments. These motifs represent the binding site of transcription factors or RNA-binding proteins. DNA or RNA binding sites are often variable. However, all motif discovery tools report redundant motifs that poorly represent the biological variability of the same motif, hence renders the identification of the binding protein difficult. Here we propose for the first time to apply the Bayesian information-theoretic Minimum Message Length (MML) principle to optimise the clustering of over-represented DNA or RNA motifs in order to predict binding sites that are biologically relevant.

    This will be used primarily for regenerative medicine.

    Minimum Message Length (MML) (Wallace and Boulton, 1968; Wallace and Dowe, 1999a; Wallace, 2005) is a Bayesian information-theoretic principle in machine learning, statistics and data science.  MML can be thought of in different ways - it is like Ockham's razor, seeking a simple theory that fits the data well. It can also be thought of as file compression - where data has structure, it is more likely to compress, and the greater the structure the more it should compress.

The relationship (in principle) between MML and Solomonoff-Kolmogorov (Wallace and Dowe, 1999a) means that MML can, given sufficient data and sufficient search time, infer arbitrarily closely to any model underlying data.


     

    Student cohort

    Double Semester

    Aim/outline

    We endeavour to do the following:

    1. Apply MML to test datasets (degenerate motifs of 6-12 base pairs and with 1 to 2 variable nucleotides)
    2. Apply MML to real ChIP-seq and CLIP-seq datasets with well-defined binding sites as a control that it works with real biological data
    3. Apply MML to ChIP-seq cardiac datasets where no primary or secondary binding site has been identified

    URLs/references

    References:

     Comley, Joshua W. and D.L. Dowe (2003). General Bayesian Networks and Asymmetric Languages, Proc. 2nd Hawaii International Conference on Statistics and Related Fields, 5-8 June, 2003

     Comley, Joshua W. and D.L. Dowe (2005). ``Minimum Message Length and Generalized Bayesian Nets with Asymmetric Languages'', Chapter 11 (pp265-294) in P. Gru:nwald, I. J. Myung and M. A. Pitt (eds.), Advances in Minimum Description Length: Theory and Applications, M.I.T. Press (MIT Press), April 2005, ISBN 0-262-07262-9. [Final camera ready copy was submitted in October 2003.]

      Fitzgibbon, L.J., D. L. Dowe and F. Vahid (2004). Minimum Message Length Autoregressive Model Order Selection. In M. Palanaswami, C. Chandra Sekhar, G. Kumar Venayagamoorthy, S. Mohan and M. K. Ghantasala (eds.), International Conference on Intelligent Sensing and Information Processing (ICISIP), Chennai, India, 4-7 January 2004 (ISBN: 0-7803-8243-9, IEEE Catalogue Number: 04EX783), pp439-444.

      Molloy, S., D.W. Albrecht, D. L. Dowe and K.M. Ting (2006). Model-Based Clustering of Sequential Data, Proc. 5th Annual Hawaii Intl. Conf. on Statistics, Mathematics and Related Fields, 22 pages, 16th - 18th January, 2006, Hawaii, U.S.A.

      P. J. Tan and D. L. Dowe (2003). MML Inference of Decision Graphs with Multi-Way Joins and Dynamic Attributes, Proc. 16th Australian Joint Conference on Artificial Intelligence (AI'03), Perth, Australia, 3-5 Dec. 2003, Published in Lecture Notes in Artificial Intelligence (LNAI) 2903, Springer-Verlag, pp269-281

      Wallace, C.S. (2005), ``Statistical and Inductive Inference by Minimum Message Length'', Springer  (Link to the preface [and p vi, also here])

      Wallace, C.S. and D.L. Dowe (1994b), Intrinsic classification by MML - the Snob program. Proc. 7th Australian Joint Conf. on Artificial Intelligence, UNE, Armidale, Australia, November 1994, pp37-44

      Wallace, C.S. and D.L. Dowe (1999a). Minimum Message Length and Kolmogorov Complexity, Computer Journal (special issue on Kolmogorov complexity), Vol. 42, No. 4, pp270-283

      Wallace, C.S. and D.L. Dowe (2000). MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions, Statistics and Computing, Vol. 10, No. 1, Jan. 2000, pp73-83.

    Required knowledge

    At least first year undergraduate mathematics, preferably more.

    Statistics, machine learning and/or data science at least to the level of an undergraduate degree.

    An ability to program.

    At least an interest in biology.