Statistical inference problems with applications to computational structural biology

Kasarapu, Parthan

doi:10.4225/03/58b8a74c0f6c5

monash_169728.pdf (13.12 MB)

Statistical inference problems with applications to computational structural biology

thesis

posted on 2017-03-02, 23:14 authored by Kasarapu, Parthan

In this data pervasive world, the efficient and accurate modelling of data is crucial to support reliable analyses and to improve the solution to related problems. In order to describe the given data, the problem of selecting a suitable model has to be carefully addressed. Traditional approaches to the problem of optimal model selection have relied predominantly on the number of model parameters rather than the actual parameters themselves. This limits the ability of traditional methods to correctly distinguish among models that, while being of different type, have the same number of model parameters. In order to address the problem of model selection satisfactorily, this thesis explores the Bayesian information-theoretic principle of minimum message length (MML). The inference framework based on the MML principle enables the optimal selection of models by using the constituent parameters to better balance the trade-off between the model’s complexity and its goodness-of-fit to the data. The core of this thesis explores the MML-based inference of some of the commonly used probability distributions whose parameters have not yet been characterized and of mixtures of these probability distributions. The models of these probability distributions allow for accurate modelling of data in the Euclidean space and data that is directional in nature. These probabilistic models and their mixtures have widespread uses in statistical machine learning tasks. In this context, we have developed a general purpose search method to determine the optimal number of mixture components and their parameters that describe the given data in a completely unsupervised setting. The use of the MML modelling paradigm and our proposed search method is explored in detail on a variety of real-world data, specifically on directional text data and on the spatial orientation data of protein three-dimensional structures. Further, mixtures of directional probability distributions have facilitated the design of reliable computational models for protein structural data. Furthermore, the inference framework has been used for concise representations of protein folding patterns using a combination of non-linear parametric curves. The results of this work have a wide-variety of important uses including direct applications in protein structural biology.

History

Campus location

Australia

Principal supervisor

Arun Konagurthu

Additional supervisor 1

Maria Garcia de la Banda

Year of Award

2016

Department, School or Centre

Information Technology (Monash University Clayton)

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Usage metrics

Keywords

Computational biology thesis(doctorate)1959.1/1258758 monash:169728 Directional statistics Open access ethesis-20160421-07284 2016 Minimum message length Statistical modelling

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Statistical inference problems with applications to computational structural biology

History

Campus location

Principal supervisor

Additional supervisor 1

Year of Award

Department, School or Centre

Degree Type

Faculty

Usage metrics

Categories

Keywords

Licence

Exports