Interpreting ConSurf Results
This page is under construction and incomplete. Eric Martz Eric Martz 01:24, 24 December 2021 (UTC)
This page discusses how to decide whether a ConSurf result is optimal for the questions you wish to ask about a protein. It assumes that you already have one or more completed ConSurf results. For background principles and instructions on how to get a ConSurf result, please see ConSurf/Index.
Diversity in the MSADiversity in the MSA
A ConSurf result depends crucially on the sequences included in the multiple sequence alignment (MSA). The optimal diversity in those sequences depends on your goal. The diversity in an MSA is represented in the #Average Pairwise Distance (APD).
- If you want to know which residues are important for the specific function of one protein, then the MSA should not include proteins with different functions. See Limiting ConSurf Analysis to Proteins of a Single Function.
- If you want to know which residues remain important throughout an entire protein family (or superfamily), then the MSA should be broad enough to include representives of the entire family. Conserved residues will include residues that are conserved to enable proper folding of the domain. Such an MSA may obscure conservation of some residues important for a specific function.
For more about this, see What is the best way to collect homologous sequences in order to construct an MSA? at the ConSurf Server.
Average Pairwise DistanceAverage Pairwise Distance
The average pairwise distance (APD) in a multiple sequence alignment (MSA) is a measure of the evolutionary breadth of the range of sequences included. The APD is "The average number of replacements between any two sequences in the alignment; A distance of 0.01 means that on average, the expected replacement for every 100 positions is 1." (quoted from the ConSurf Server).
Generally, an APD of < ~1 is consistent with an MSA whose sequences are limited to proteins with one specific function. As the APD climbs above 1, it is more likely that proteins of multiple functions are included in the MSA.
ExampleExample
At the ConSurf Server, click on Gallery, then MHC Class I heavy chain (2VAA). In the finished results for chain A of 2VAA, under the subheading Sequence Data, click on Sequences Used.
The APD is 0.99. The MSA has 150 sequences, largely limited to sequences for major histocompatibility complex class I proteins. The labels of 101 sequences (67% of 150) contain "class I" or "class 1". There is only one class II protein sequence. Three sequences are labeled "zinc-alpha-2-glycoprotein", clearly a different function. There are 22 sequences labeled "uncharacterized protein" which nevertheless have high similarity to the query. 19 sequences are labeled "UPI000... related cluster". If the uncharacterized and "UPI000..." sequences are in fact class I sequences, then up to 142/150 (95%) of the sequences could be MHC-I.
In contrast, ConSurfDB used 300 sequences for its 2VAA chain A result. The APD is 1.62, suggesting that a number of non-MHC-I proteins were included in the MSA. Only 146/300 sequences (49% of 300 total) in the MSA have labels that include "class I" (excluding the count with "class II"). The MSA includes 62 sequences labeled "Ig-like domain-containing protein", 20 "T-cell surface glycoprotein" sequences of the CD1 family, 17 apparently unrelated proteins (one or a few each), 14 histocompatibility class II proteins, 8 sequences for "hereditary hemochromatosis protein", 8 for "zinc-alpha-2-glycoprotein", and 11 uncharacterized proteins. Excluding the uncharacterized proteins, that leaves 129 (43% of 300) that do not or may not function as MHC I proteins.
Distribution of ResiduesDistribution of Residues
FirstGlance in Jmol shows the distribution of amino acids across the 9 conservation grades. This helps to alert you to problems with the MSA. Some examples follow.
IMPORTANT (December, 2021): For the steps below, use the unreleased beta-test version FirstGlance 3.8 Beta2. The publicly available version 3.7 does not display the distribution of residues. At your finished ConSurf Job Status page, under the heading PDB Files, right click on PDB File with ConSurf Results in its Header, for FirstGlance in Jmol and select Copy Link Address. Then, at FirstGlance 3.8 Beta2, click enter a molecule's URL, paste the address into the slot, and click Submit. (You cannot upload the PDB file to 3.8Beta2 because the upload mechanism always goes to version 3.7.) |
Good Distributions: 150 SequencesGood Distributions: 150 Sequences
Here are some examples of distributions for satisfactory ConSurf results.
Good Distributions: <100 SequencesGood Distributions: <100 Sequences
Satisfactory results are sometimes obtained when fewer than 100 unique sequences are obtained. In the case of 1sy7, similar results were obtained for 43 sequences (total unique sequences found) vs. 150 sequences (sampled from 22,855 unique sequences). The APD values were close, and 75% of the residues with the highest conservation grade 9 were in common.
Here are more cases of satisfactory results from <100 sequences:
Poor Distributions: Too Few SequencesPoor Distributions: Too Few Sequences
When the number of sequences falls below roughly 25, the result is unlikely to be satisfactory. The distribution alerts you to the problem.
[[