Interpreting ConSurf Results: Difference between revisions

Eric Martz (talk | contribs)
Eric Martz (talk | contribs)
 
(38 intermediate revisions by the same user not shown)
Line 1: Line 1:
<font color="red">This page is under construction and incomplete.</font> Eric Martz [[User:Eric Martz|Eric Martz]] 01:24, 24 December 2021 (UTC)
This page discusses how to decide whether a [http://consurf.tau.ac.il ConSurf] result is optimal for the questions you wish to ask about a protein. It assumes that you already have one or more completed ConSurf results. For background principles and instructions on how to get a ConSurf result, please see [[ConSurf/Index]].


This page discusses how to decide whether a [http://consurf.tau.ac.il ConSurf] result is optimal for the questions you wish to ask about a protein. It assumes that you already have one or more completed ConSurf results. For background principles and instructions on how to get a ConSurf result, please see [[ConSurf/Index]].
This page does not go into detail about the changes in settings needed to optimize a ConSurf results. The options for increasing (or decreasing) the diversity and number of sequences in the underlying multiple sequence alignment (MSA) are evident in the job submission forms of ConSurf. You are encouraged to try various options, and the information below will help you to decide which options give the most satisfactory result for your purposes.


==Diversity in the MSA==
==Diversity in the MSA==
A ConSurf result depends crucially on the sequences included in the multiple sequence alignment (MSA). The optimal diversity in those sequences depends on your goal. The diversity in an MSA is represented in the [[#Average Pairwise Distance]] (APD).
A ConSurf result depends crucially on the sequences included in the multiple sequence alignment (MSA). The optimal diversity in those sequences depends on your goal. The diversity in an MSA is quantitated by the [[#Average Pairwise Distance]] (APD).


*If you want to know which residues are important for the '''specific function of one protein''', then the MSA should not include proteins with different functions. See [[ConSurfDB_vs._ConSurf#Limiting_ConSurf_Analysis_to_Proteins_of_a_Single_Function|Limiting ConSurf Analysis to Proteins of a Single Function]].
*If you want to know which residues are important for the '''specific function of one protein''', then the MSA should not include proteins with different functions. See [[ConSurfDB_vs._ConSurf#Limiting_ConSurf_Analysis_to_Proteins_of_a_Single_Function|Limiting ConSurf Analysis to Proteins of a Single Function]].
Line 12: Line 12:


==Average Pairwise Distance==
==Average Pairwise Distance==
The ''average pairwise distance'' (APD) in a multiple sequence alignment (MSA) is a measure of the evolutionary breadth of the range of sequences included. The APD is "The average number of replacements between any two sequences in the alignment; A distance of 0.01 means that on average, the expected replacement for every 100 positions is 1." (quoted from the ConSurf Server).
The ''average pairwise distance'' (APD) in a multiple sequence alignment (MSA) is a measure of the evolutionary diversity in the sequences included. The APD is "The average number of replacements between any two sequences in the alignment; A distance of 0.01 means that on average, the expected replacement for every 100 positions is 1." (quoted from the ConSurf Server).


Generally, an APD of < ~1 is consistent with an MSA whose sequences are limited to proteins with one specific function. As the APD climbs above 1, it is more likely that proteins of multiple functions are included in the MSA.
Generally, an APD of 0.25 to 0.5 is consistent with an MSA whose sequences are limited to proteins with one specific function. As the APD approaches or exceeds 1.0, it is more likely that proteins of multiple functions are included in the MSA.


===Example===
===Example===
At the ConSurf Server, click on ''Gallery'', then ''MHC Class I heavy chain'' (2VAA). In the finished results for chain A of 2VAA, under the subheading ''Sequence Data'', click on '''Sequences Used'''.
At the ConSurf Server, click on ''Gallery'', then ''MHC Class I heavy chain'' (2VAA). In the finished results for chain A of 2VAA, under the subheading ''Sequence Data'', click on '''Sequences Used'''.


The APD is '''0.99'''. The MSA has 150 sequences, largely limited to sequences for major histocompatibility complex class I proteins. The labels of 101 sequences (67% of 150) contain "class I" or "class 1". There is only one class II protein sequence. Three sequences are labeled "zinc-alpha-2-glycoprotein", clearly a different function. There are 22 sequences labeled "uncharacterized protein" which nevertheless have high similarity to the query. 19 sequences are labeled "UPI000... related cluster". If the uncharacterized and "UPI000..." sequences are in fact class I sequences, then '''up to 142/150 (95%) of the sequences could be MHC-I'''.
====APD 0.99====
The APD for the ConSurf Server 2VAA result with default settings in the Gallery is '''0.99'''. The MSA has 150 sequences, largely limited to sequences for major histocompatibility complex class I proteins. The labels of 101 sequences (67% of 150) contain "class I" or "class 1". There is only one class II protein sequence. Three sequences are labeled "zinc-alpha-2-glycoprotein", clearly a different function. There are 22 sequences labeled "uncharacterized protein" which nevertheless have high similarity to the query. 19 sequences are labeled "UPI000... related cluster". If the uncharacterized and "UPI000..." sequences are in fact class I sequences, then '''up to 142/150 (95%) of the sequences could be MHC-I'''.


However, conservation of key functional residues was revealed only when custom ConSurf Server jobs achieved APD around 0.30: See Case #1 at[[ConSurfDB_vs._ConSurf#Examples]].
====APD 1.62====
In contrast, ConSurfDB used 300 sequences for its 2VAA chain A result. The APD is '''1.62''', suggesting that a number of non-MHC-I proteins were included in the MSA. '''Only 146/300 sequences (49% of 300 total) in the MSA have labels that include "class I"''' (excluding the count with "class II"). The MSA includes 62 sequences labeled "Ig-like domain-containing protein", 20 "T-cell surface glycoprotein" sequences of the CD1 family, 17 apparently unrelated proteins (one or a few each), 14 histocompatibility class II proteins, 8 sequences for "hereditary hemochromatosis protein", 8 for "zinc-alpha-2-glycoprotein", and 11 uncharacterized proteins. Excluding the uncharacterized proteins, that leaves '''129 (43% of 300) that do not or may not function as MHC I proteins'''.
In contrast, ConSurfDB used 300 sequences for its 2VAA chain A result. The APD is '''1.62''', suggesting that a number of non-MHC-I proteins were included in the MSA. '''Only 146/300 sequences (49% of 300 total) in the MSA have labels that include "class I"''' (excluding the count with "class II"). The MSA includes 62 sequences labeled "Ig-like domain-containing protein", 20 "T-cell surface glycoprotein" sequences of the CD1 family, 17 apparently unrelated proteins (one or a few each), 14 histocompatibility class II proteins, 8 sequences for "hereditary hemochromatosis protein", 8 for "zinc-alpha-2-glycoprotein", and 11 uncharacterized proteins. Excluding the uncharacterized proteins, that leaves '''129 (43% of 300) that do not or may not function as MHC I proteins'''.


Line 28: Line 32:


<table style="background-color:#ffe0e0"><tr><td>
<table style="background-color:#ffe0e0"><tr><td>
IMPORTANT (December, 2021): For the steps below, use the unreleased beta-test version [http://bioinformatics.org/firstglance/fgij3.8beta2 FirstGlance 3.8 Beta2]. The publicly available version 3.7 does not display the distribution of residues. At your finished ConSurf Job Status page, under the heading ''PDB Files'', right click on ''PDB File with ConSurf Results in its Header, for FirstGlance in Jmol'' and select '''Copy Link Address'''. Then, at [http://bioinformatics.org/firstglance/fgij3.8beta2 FirstGlance 3.8 Beta2], click ''enter a molecule's URL'', paste the address into the slot, and click Submit. (You cannot upload the PDB file to 3.8Beta2 because the upload mechanism always goes to version 3.7.)
 
At the ConSurf Server results page, '''download the PDB file''' by opening ''High Resolution Figures and PDB Files'', and then clicking ''Download ConSurf PDB File for FirstGlance in Jmol''. Then [http://www.bioinformatics.org/firstglance/fgij/where.htm#u upload it to FirstGlance]. By downloading the PDB file, you will have it after the results '''disappear''' from the ConSurf Server. PDB files downloaded from the ConSurf Database (ConSurfDB) '''do not work''' in FirstGlance.
</td></tr></table>
</td></tr></table>


Line 47: Line 52:
[[Image:1n73-56seq-APD1.01.png]]
[[Image:1n73-56seq-APD1.01.png]]
[[Image:4mkm-39seq-APD0.85.png]]
[[Image:4mkm-39seq-APD0.85.png]]
===Poor Distributions: Too Few Sequences===
When the number of sequences falls below roughly 25, the result is unlikely to be satisfactory. The distribution alerts you to the problem. In such cases, the percentage of residues with insufficient data (uncertain conservation grades) rises, and the average conservation grade for residues assigned grades 1-9 tends to be '''&gt;5.5'''.
<br>
[[Image:6t3x-18seq-APD0.77.png]]
[[Image:2PNL-B-13seq-APD0.21.png]]
===Poor Distributions: Short Chains===
Protein [[chains]] with fewer than 50 residues may give unsatisfactory results. The collagen in [[6vzx]] has only 24 amino acids/chain. Increasing the number of sequences in the MSA did little to improve the result.
<br>
[[Image:6vzx-collagen-APD0.75.png]]
[[Image:6vzx-collagen-250seq-APD0.74.png]]
===Too Many Residues With Insufficient Data===
Amino acids with insufficient data (uncertainty in conservation grade) are colored yellow. Here are two cases where yellow residues were a problem, with solutions.
====More Sequences Needed====
If a residue of interest has insufficient data, increasing the number of sequences in the MSA may give it a reliable conservation grade. This happened for sequence-identical chains C and F in [[1n73]]. With 150 sequences (default job settings), Lys401, participating in an isopeptide bond, had insufficient data. When 300 sequences were used in the MSA, Lys401 acquired a reliable conservation grade of 1. Its partner in the isopeptide bond, Gln397, dropped from conservation grade 8 to 7, although the APD did not increase.
<blockquote>
FirstGlance automatically reports six types of [[protein crosslinks]], including isopeptide bonds. Other examples of protein crosslinks colored by evolutionary conservation are [[FirstGlance/Evaluating_Protein_Crosslinks#Conservation_of_Crosslinking_Residues|a thioether crosslink in catalase]] and [[FirstGlance/Visualizing_Conservation#Conservation_of_Protein_Crosslinks|an isopeptide in poly-ubiquitin]].
</blockquote>
[[Image:1n73-APD1.05.png]]
[[Image:1n73-isopeptide-conservation-yellow.png]]
<br>
[[Image:1n73-300seqs-APD 1.04.png]]
[[Image:1n73-isopeptide-conservation.png]]
====Entire Domain With Insufficient Data====
In some cases, the MSA fails to cover an entire domain adequately, so the entire domain is yellow. With default job settings and automatic sequence selection, this happened for [[2yev]] chain A. As the ConSurf server explains, the solution to this is to run each domain as a separate ConSurf job (not shown). Here are [[ConSurfDB_vs._ConSurf#Too_Many_Yellow_Residues|instructions for separating domains]] into different PDB files.
<br>
[[Image:2yev-APD1.12.png]]
[[Image:2yev-consurf-chainA.png]]
==See Also==
*[[ConSurf/Index]]: Links to explanations of the principles of evolutionary conservation, as well as practical guidance.
*[[FirstGlance/Visualizing Conservation]]: Demonstrates the conveniences offered by FirstGlance for easily seeing conservation of salt bridges, cation-pi interactions, residues that bind ligand, substrate, or inhibitor, residues in covalent protein crosslinks, or any residues that you specify.

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz