Monash University
Browse

A Fast Statistical Biological Sequence Compressor for Pattern Discovery

Download (257.97 kB)
report
posted on 2022-07-25, 00:35 authored by M D Cao, T I Dix, L Allison, C Mears
This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences by maintaining a panel of experts to predict the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time. The resulting information sequence provides insight for further study of the biological sequence. We demonstrate a number pattern discovery tasks using our model.

History

Technical report number

2006/203

Year of publication

2006

Usage metrics

    Monash Information Technology Technical Reports

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC