Integrating multiple biological datasets, especially across different
types of experiments for example RNA-Seq, ChIP-Seq, ATAC-Seq, Hi-C and
single cell sequencing data coherently is a difficult task. Many
existing data integration strategies involve repeatedly summarising
layers of information, as raw sequence data from each of the different
types of sequencing experiments is not directly comparable. This process
usually collapses the information to sets of gene regulatory networks
for direct comparison. As a result, a significant volume of quantitative
information is lost.
Therefore, a data-driven approach was taken
to address this problem, designed to take high throughput sequencing
data directly as input. In the overall framework, known models of gene
regulatory patterns such as position weight matrices will be
incorporated. This will be supplemented with available biological
information of the system such as evolutionary information in the form
of phylogenetic distances, interaction maps of biomolecules (DNA, RNA or
protein).
The end result is an agnostic framework which is
capable of taking any combination of types of high throughput sequencing
data, and identifying any regulatory patterns present within DNA
sequences of interest. A major advantage of the design is that it limits
significant assumptions about the data as the user will be required to
input high throughput sequencing data directly, instead of summarised or
heavily processed data. At the same time, providing data in its primary
form reduces information loss, allowing the algorithm to be more
sensitive to weak signals in the data.