Skip to main content

Statistical Tool Finds ‘Gaps’ in DNA Data Sets Shouldn’t Be Ignored

NC State gateway at sunset

For Immediate Release

A simple statistical test shows that contrary to current practice, the “gaps” within DNA protein and sequence alignments commonly used in evolutionary biology can provide important information about nucleotide and amino acid substitutions over time. The finding could be particularly relevant to those studying distantly related species.

Biologists studying evolution do so by looking at how DNA and protein sequences change over time. These changes can be sequence length changes – when specific nucleotides are deleted or added at certain positions – or substitutions, where one nucleotide type is exchanged for a different type at a given point.

“Think of the DNA sequence and its evolution as a sentence being copied by different people over time,” says Jeff Thorne, professor of biological sciences and statistics at NC State and a co-corresponding author of the research. “Over time, a letter in a word will change – that’s a substitution. Leaving out or adding letters or words correspond to deletions or insertions.”

The first step analysts usually perform when looking at evolutionary DNA changes is to construct a sequence alignment. This means figuring out how all of the sequences correspond to one another and then aligning those corresponding positions into columns for comparison. Due to substitutions, insertions and deletions, however, nucleotide types within columns can vary among sequences, or be absent altogether. When a sequence does not have a corresponding nucleotide, a gap is placed in the alignment column for that sequence.

“Conventionally, when using sequence alignments to do analyses, the gaps within alignment columns are treated as missing data that provide no information about the substitutions,” Thorne says. “Historically, the research community has assumed that gap locations are independent of the substitution process. But what if that assumption is incorrect?”

Thorne and his colleagues created a simple statistical test to assess whether gap locations are independent of the amino acid replacement process. They tested 1390 different sets of sequence alignments, and found that in roughly two-thirds of the sets, the usual assumption of independence between gap locations and amino acid replacement was rejected.

“One possibility is that gap locations provide useful information about the amino acid replacement process,” Thorne says. “If so, evolutionary biologists should develop better techniques for extracting this information.”

The research also illustrated how the usual approach of constructing a sequence alignment and then basing evolutionary conclusions on that single optimal alignment can be problematic. What if the alignment is wrong? Even worse, what if the alignment is biased?

For example, if substitutions occur more often than gaps, then researchers tend to repeatedly choose substitutions over gaps when building the sequence alignment and the resulting alignment can contain too few gaps overall. And while those little errors in alignments between closely related species will most likely not affect outcomes, over time – and particularly in comparisons between diverse species – that bias can create error that could affect subsequent analyses.

“Sometimes our best guesses are biased,” says Tae-Kun Seo, principal research scientist at the Korea Polar Research Institute and co-corresponding author of the research. “There’s no simple solution, but hopefully this study will help us be mindful about potential pitfalls. We need to be aware of the problems with conventional statistical methods and work toward fixing them.”

The work appears in Proceedings of the National Academy of Sciences and was supported by the National Science Foundation and the Korea Polar Research Institute. Ben Redelings, research scientist at Duke University and the University of Kansas, also contributed to the work.


Note to editors: An abstract follows.

“Correlations between alignment gaps and nucleotide substitution or amino acid replacement”

DOI: 10.1073/pnas.2204435119

Authors:Tae-Kun Seo, Korean Polar Research Institute; Benjamin Redelings, Duke University and the University of Kansas; Jeffrey Thorne, North Carolina State University
Published: The week of Aug. 15, 2022 in Proceedings of the National Academy of Sciences

To assess the conventional treatment in evolutionary inference of alignment gaps as missing data, we propose a simple nonparametric test of the null hypothesis that the locations of alignment gaps are independent of the nucleotide substitution or amino acid replacement process. When we apply the test to 1390 protein alignments that are informed by protein tertiary structure and use a 5% significance level, the null hypothesis of independence between amino acid replacement and gap location is rejected for approximately 65% of data sets. Via simulations that include substitution and insertion-deletion, we show that the test performs well with true alignments. When we simulate according to the null hypothesis and then apply the test to optimal alignments that are inferred by each of four widely-used software packages, the null hypothesis is rejected too frequently. Via further simulations and analyses, we show that the overly frequent rejections of the null hypothesis are not solely due to weaknesses of widely-used software for finding optimal alignments. Instead, our evidence suggests that optimal alignments are unrepresentative of true alignments and that biased evolutionary inferences may result from relying upon individual optimal alignments.