posted on 2017-03-02, 04:18authored byAlgama Appuhamilage Dona, Manjula Dilhani
Identifying and discerning the function of non-coding RNAs (ncRNAs) is an important
goal of genetic research. Much evidence suggests that ncRNAs play an important role
in the aetiology of many complex genetic diseases. Therefore the task of developing
methods to identify these elements in genomes has become increasingly urgent.
In this research my focus was to use a Bayesian approach to identify putative functional
non-coding genomic sequences contributing to various diseases. The analysis was mainly
carried out using a Bayesian segmentation model, implemented in the software package
changept, designed to segment discrete genomic data. In the first phase of the research,
I developed methods to expand the capabilities of changept. One simple but powerful
innovation was to develop several ways of encoding an alignment of sequences using a
D-character representation (D is a positive integer). This enables sequence alignments
to be segmented based on multiple data types: specifically conservation, GC content
and transition/transversion ratio and significantly generalizes the capacity of changept,
which previously could only segment on the basis of one of these characteristics at a
time. Incorporating multiple data types greatly helped to clearly identify complex
segmentation patterns and functional signatures among species, especially between
closely related species. A second methodological innovation was a new model selection
procedure to decide the optimal model for the data. A third, and most important,
methodological innovation was to build a process for systematically discovering genome-
wide putative ncRNAs, including data selection, cleaning, encoding, analysis and
post-processing. To validate these findings, both experimental methods and currently
available bioinfomatics resources were used.
In the second phase of the research, my focus turned to application of changept, and
the new methods developed, to identify genome-wide putative non-coding elements
that may be associated with diseases. I was able to discover more than a thousand
highly conserved non-coding sequences in human, mouse and zebrafish genomes. A
complementary analysis focused on a set of genes involved in muscle development. Some
of these elements identified may contribute to muscle diseases. Discovery of putative
small ncRNAs in the bacterium Wolbachia pipientis is another successful application
of the new methods; this work was undertaken as part of the eradicate dengue project.
Application to malaria genomes revealed genetic mechanisms important in infecting
multiple hosts. I also identified putative regulatory sequences in 3' UTRs in 3 closely
related Drosophila species. Although this work focussed on Drosophila rather than
human diseases, mutations in 3' UTRs have been shown to play a crucial role in human
health and diseases.