Incorporating and Generating Prior Knowledge to Improve Gene Regulatory Network Inference

2017-09-17T23:56:27Z (GMT) by Ajay Nair
Cells regulate the gene expression and protein activity to grow and adapt to the external environment. Identifying the regulatory interactions in a cell is critical to understand and engineer the life process. Gene regulatory network (GRN) inference is the process of reconstructing the network of regulatory interactions from experimental data by using statistical or machine-learning techniques. GRN inference remains an unsolved grand challenge. Incorporating prior knowledge into GRN inference is a promising approach proposed in literature for accurate GRN reconstruction.

There are limitations in the reported methods of incorporating prior knowledge (termed priors). Firstly, the current methods focus on the knowledge of the presence of interactions between genes (edge priors). Secondly, only a few methods are known to incorporate priors, which incorporate it `before' the inference. Thus, many high-performing methods are not known to incorporate priors. Thirdly, priors exist only for a few well-studied organisms.

The thesis demonstrated that the edge priors provide only a limited improvement in the accuracy of GRN inference. It proposed and demonstrated that prior knowledge of the absence of interactions between genes (non-edge priors) is significant in improving the overall accuracy. The specificity, precision, and F1-score improved by 2-10%, 5-40%, and 5-12%, respectively. A method to generate around 70% of non-edge priors was also demonstrated.

This thesis analysed the maxP technique, which is widely used to reduce computational time, and identified its limitations. Two algorithms that overcome the limitations but retain the strengths of maxP, by incorporating GRN topology priors 'during' the inference, were proposed and developed. The theoretical and experimental results showed that these algorithms take only one-third of the normal computational time, without sacrificing the accuracy.

The thesis proposed and developed two algorithms that integrate priors 'after' the GRN inference process. Further, a method to identify and remove wrong interactions by using priors was proposed and developed. The results showed that the accuracy improved and errors reduced; around 970 additional correct edges were obtained and 1300 wrong interactions were removed with the incorporation of half of the total priors, when compared to a normal GRN inference. Moreover, the limitation that only a few GRN inference methods can incorporate the priors is overcome.

A generic mapping pipeline for predicting regulatory interactions with confidence ranks in an organism by using the known regulatory interactions from another organism was developed. This mapping pipeline was used to predict 20,280 regulatory interactions in 30 strains of cyanobacteria, which are a less-studied but scientifically and industrially relevant. A database, the RegCyanoDB, for these regulatory interactions is developed and made available for public access.

Thus, this thesis has focused on developing efficient methods for incorporating priors into GRN inference and generating priors for less-studied organisms. The thesis demonstrated that non-edge priors are significant in priors 'before' inference methods. Further, priors 'during' and 'after' inference methods were proposed and developed. A bioinformatic pipeline to predict regulatory interactions in less-studied organisms was also developed and applied.