posted on 2017-01-13, 03:57authored byCao, Minh Duc
Molecular biology is the first information processing system on the planet. The genome
sequence of an organism stores the genetic information that virtually defines the organism.
Analysis of genomic sequences can help elucidate many aspects of life. This thesis
investigates approaches for sequence analysis that make use of the information content
of the sequences.
The information content of a sequence can be estimated by lossless compression.
The thesis develops the expert model, a fast and effective algorithm for compression of
biological sequences. The expert model uses a novel adaptive technique to combine
predictions from different sub-models for compression based on the well-founded Bayesian
statistical framework. Experiments show that the expert model outperforms existing
biological compression algorithms on standard DNA and protein sequence data sets while
maintaining a practical running time. Moreover, the expert model is capable of compressing
long sequences. It is applied to estimate the information content of the genomes of
species at various organism levels, including viruses, bacteria, archaea, single cell eukaryotes,
invertebrates, plants and mammals. Most importantly, the expert model can produce
an estimate of the information content of every symbol in a sequence using background
knowledge in the form of known sequences or contexts. This is useful for performing
information extraction from genomic sequences.
The thesis suggests that since genomic sequences carry genetic information, sequence
analysis can be performed at the information level. A method for pairwise local alignment
of genomes, namely XMAligner, is presented. Instead of comparing sequences at
the character level, XMAligner considers a pair of sequences to be related if their mutual
information is significant. XMAligner is shown to be superior to conventional alignment
methods, especially on distantly related sequences or statistically biased data. The
method aligns sequences of eukaryote genome size with only modest hardware requirements. Importantly, the method has an objective function which can obviate the need to
choose parameter values for high quality alignment.
The information content of sequences can also be used for phylogenetic analysis. The
thesis formulates XMDistance, a measure of genetic distances between sequences based
on their information content estimated by lossless compression. The measure does not
rely on an evolutionary model. It is shown to be proportional to elapsed time if the
evolutionary rate is constant. The distance measure can be used for phylogenetic analysis
of sequences that cannot be reliably aligned, for example, whole genomes. On a set of
simulated data, phylogenetic analysis using XMDistance outperforms maximum parsimony
method and the standard character-based distance measure. For small sequences,
the maximum likelihood method, which requires much longer time to run, performs better.
XMDistance successfully infers plausible trees from real data, and most importantly
manages problematic sets of whole genome sequences.
History
Campus location
Australia
Principal supervisor
Trevor Dix
Additional supervisor 1
Lloyd Allison
Year of Award
2010
Department, School or Centre
Information Technology (Monash University Clayton)