Information theoretic approaches to biological sequence analyses

Cao, Minh Duc

doi:10.4225/03/58785044858c3

monash63149.pdf (2.49 MB)

Information theoretic approaches to biological sequence analyses

thesis

posted on 2017-01-13, 03:57 authored by Cao, Minh Duc

Molecular biology is the first information processing system on the planet. The genome sequence of an organism stores the genetic information that virtually defines the organism. Analysis of genomic sequences can help elucidate many aspects of life. This thesis investigates approaches for sequence analysis that make use of the information content of the sequences. The information content of a sequence can be estimated by lossless compression. The thesis develops the expert model, a fast and effective algorithm for compression of biological sequences. The expert model uses a novel adaptive technique to combine predictions from different sub-models for compression based on the well-founded Bayesian statistical framework. Experiments show that the expert model outperforms existing biological compression algorithms on standard DNA and protein sequence data sets while maintaining a practical running time. Moreover, the expert model is capable of compressing long sequences. It is applied to estimate the information content of the genomes of species at various organism levels, including viruses, bacteria, archaea, single cell eukaryotes, invertebrates, plants and mammals. Most importantly, the expert model can produce an estimate of the information content of every symbol in a sequence using background knowledge in the form of known sequences or contexts. This is useful for performing information extraction from genomic sequences. The thesis suggests that since genomic sequences carry genetic information, sequence analysis can be performed at the information level. A method for pairwise local alignment of genomes, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of sequences to be related if their mutual information is significant. XMAligner is shown to be superior to conventional alignment methods, especially on distantly related sequences or statistically biased data. The method aligns sequences of eukaryote genome size with only modest hardware requirements. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The information content of sequences can also be used for phylogenetic analysis. The thesis formulates XMDistance, a measure of genetic distances between sequences based on their information content estimated by lossless compression. The measure does not rely on an evolutionary model. It is shown to be proportional to elapsed time if the evolutionary rate is constant. The distance measure can be used for phylogenetic analysis of sequences that cannot be reliably aligned, for example, whole genomes. On a set of simulated data, phylogenetic analysis using XMDistance outperforms maximum parsimony method and the standard character-based distance measure. For small sequences, the maximum likelihood method, which requires much longer time to run, performs better. XMDistance successfully infers plausible trees from real data, and most importantly manages problematic sets of whole genome sequences.

History

Campus location

Australia

Principal supervisor

Trevor Dix

Additional supervisor 1

Lloyd Allison

Year of Award

2010

Department, School or Centre

Information Technology (Monash University Clayton)

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Usage metrics

Keywords

Sequence alignment Information theory Compression ethesis-20100930-19543 thesis(doctorate)Phylogenetics Open access monash:63149 Biological sequence analyses 2010 1959.1/471800

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Information theoretic approaches to biological sequence analyses

History

Campus location

Principal supervisor

Additional supervisor 1

Year of Award

Department, School or Centre

Course

Degree Type

Faculty

Usage metrics

Categories

Keywords

Licence

Exports