posted on 2022-08-31, 03:06authored byL Allison, D Powell, T I Dix
Sequences having low information content cause problems for standard algorithms, e.g. causing false-positive matches. Shuffling is a popular technique of correcting for the abnormally low alignment costs (or high scores) between such sequences. Shuffling cannot be used safely on arbitrary populations of sequences. It is only used "after the fact" to judge the significance of alignments and does not change their rank-order. We seek a better solution.
An alternative alignment methodology is described which directly models the information content of sequences. It can be used with a very large class of statistical models for different populations of sequences. In general, it not only judges the significance of alignments but can change their rank-order, promoting some and demoting others. The populations that the sequences come from can be identified, probably. The new methodology is compared to shuffling for the purpose of juding the significance of optimal alignments.
The methodology described can be incorporated into any alignment algorithm that allows mutation costs to be treated as (-logs of) probabilities.