ConSurfDB vs. ConSurf: Difference between revisions
Eric Martz (talk | contribs) created page by splitting Conservation, Evolutionary |
Eric Martz (talk | contribs) No edit summary |
||
(213 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
<table align="right" class="wikitable" width=430><tr><td> | |||
[[Image:2vaa-APD0.31-40degslow.gif]] | |||
</td></tr><tr><td> | |||
{{Template:ColorKey_ConSurf_NoYellow_NoGray}} | |||
Conservation of amino acids non-covalently interacting with a peptide ({{Template:ColorKey_Element_C}} {{Template:ColorKey_Element_N}} {{Template:ColorKey_Element_O}}) in the groove of [https://www.youtube.com/watch?v=2ZakngfbHSo Major Histocompatibility Protein] Class I ([[2vaa]]). Conservation was '''not revealed''' until an [[#Average Pairwise Distance]] of 0.31 was achieved in a customized ConSurf Server job. [[#Examples|DETAILS BELOW]]. | |||
</td></tr></table> | |||
Evolutionary Conservation is introduced at [[Introduction to Evolutionary Conservation]], and treated in somewhat greater depth in the article [[Conservation, Evolutionary]]. These describe how conservation patterns in 3D can help to identify functional sites in proteins. Proteopedia displays conservation patterns pre-calculated by [http://consurfdb.tau.ac.il ConSurfDB], when available. These are usually based on broad protein families that include sequences of proteins with multiple functions. Consequently, they usually '''obscure conservation''' present in a family of proteins with a single function (see [[Conservation%2C_Evolutionary#Caveats|Caveats]]). | |||
The present article explains '''how to use options available at the ConSurf Server to reveal conservation within a group limited to proteins with a single function'''. Several [[#Examples|examples are presented]]. The mechanisms utilized by ConSurfDB and ConSurf Servers are also summarized. | |||
==The Two ConSurf Servers== | |||
There are two ConSurf servers: | |||
*[http://consurfdb.tau.ac.il ConSurfDB]: | |||
**{{Yelspan| Since 2022, ConSurfDB results are NOT compatible with [[FirstGlance in Jmol]].}} Visualizing and analyzing conservation results in FirstGlance has [http://firstglance.jmol.org/notes.htm#consurffg many advantages]. Use the ConSurf Server for compatibility with FirstGlance. | |||
**{{Yelspan| In July, 2024, ConSurfDB had not been updated with new entries in the [[Protein Data Bank]] since mid-2022.}} To look for more recent updates, at [http://consurfdb.tau.ac.il ConSurfDB Home], click ''More''. <!-- In December, 2021: ConSurfDB had not been updated with new entries in the [[Protein Data Bank]] since November 4, 2019. --> | |||
**Has pre-calculated results for every chain in the [[PDB]]. | |||
**Proteopedia's '''Evolutionary Conservation''' resource displays results from ConSurfDB. | |||
**Results typically obscure some conservation related to a protein's function because the analysis typically included proteins of multiple functions (see [[Evolutionary_Conservation#ConSurf-DB_Often_Obscures_Some_Functional_Sites|ConSurfDB Often Obscures Some Functional Sites]]). | |||
*[http://consurf.tau.ac.il ConSurf Server]: | |||
**Results can be visualized and analyzed in [[FirstGlance in Jmol]], which has [http://firstglance.jmol.org/notes.htm#consurffg many advantages]. | |||
**You submit proteins of interest and wait for the analysis to be completed. | |||
**Completely automated analysis typically gives excellent results.<!--Enables you to pick the sequences used in the analysis from a list with checkboxes.--> | |||
**Optionally, highly flexible with many configurable parameters and several sequence database options. | |||
**You can upload your own multiple sequence alignment, or phylogenetic tree, for use in the analysis. | |||
Both servers use state-of-the-art methods that are published in peer-reviewed journals. For comparisons with other methods, see [[Conservation%2C_Evolutionary#Other_Evolutionary_Conservation_Servers|Other Evolutionary Conservation Servers]]. | |||
Both servers permit you to '''download results'''. This is a good idea since the continual growth of sequence databases and improvements in analysis algorithms will give at least slightly different results for the same jobs run several months or more apart. Also, results are periodically deleted from the ConSurf server to conserve disk space. | |||
==Examining Functions of Proteins in ConSurf-DB's MSA== | ==Examining Functions of Proteins in ConSurf-DB's MSA== | ||
As explained [[#ConSurf-DB Often Obscures Some Functional Sites|above]], ConSurf-DB typically includes proteins with more than one function in its conservation analysis. Before deciding whether to do a ConSurf Server job that [[#Limiting ConSurf Analysis to Proteins of a Single Function|limits the analysis to proteins of a single function]], you may want to see what proteins ConSurf-DB included in its analysis. Here is how to see the names (which hopefully reveal the functions) of the proteins included in ConSurf-DB's analysis of a protein chain. (The following steps are | :{{Yelspan| Since 2022, ConSurfDB results are NOT compatible with [[FirstGlance in Jmol]].}} Visualizing and analyzing conservation results in FirstGlance has [http://firstglance.jmol.org/notes.htm#consurffg many advantages]. Use the ConSurf Server for compatibility with FirstGlance. | ||
:{{Yelspan| In July, 2024, ConSurfDB had not been updated with new entries in the [[Protein Data Bank]] since mid-2022.}} To look for more recent updates, at [http://consurfdb.tau.ac.il ConSurfDB Home], click ''More''. <!-- In December, 2021: ConSurfDB had not been updated with new entries in the [[Protein Data Bank]] since November 4, 2019. --> | |||
As explained [[#ConSurf-DB Often Obscures Some Functional Sites|above]], ConSurf-DB typically includes proteins with more than one function in its conservation analysis. Before deciding whether to do a ConSurf Server job that [[#Limiting ConSurf Analysis to Proteins of a Single Function|limits the analysis to proteins of a single function]], you may want to see what proteins ConSurf-DB included in its analysis. Here is how to see the names (which hopefully reveal the functions) of the proteins included in ConSurf-DB's analysis of a protein chain. (The following steps are based on ConSurfDB as of July, 2024.) | |||
# Go to [http://consurfdb.tau.ac.il consurf'''db'''.tau.ac.il] (the DB, distinct from the ConSurf Server). | |||
# Enter the [[PDB code]] (PDB ID) for the protein of interest. If you get "ERROR: No chains found" it means that this PDB entry was released after the most recent update of ConSurfDB, so you should [[ConSurf Quick Analysis Procedure|use the ConSurf Server]]. | |||
# Select the chain of interest. If you're not sure, get familiar with the structure using [http://firstglance.jmol.org FirstGlance]. | |||
# Click the button '''Apply''', and wait for the results to load. | |||
# Scroll down and click ''Homologues, Alignment, and Phylogeny''. | |||
# Notice the number of sequences in the MSA (the number of "hits" upon which "calculations were conducted"). | |||
# '''To view the list of sequences in the multiple sequence alignment (MSA)''' from which the conservation pattern was determined, click on the count of hits upon which the calculations were conducted. <!--(If fewer than 50 hits were used, it will probably be useful to do another run using a larger database: UniProt and NR are larger than the default UniProt90. Use the button "Select Run Parameters Manually".--> | |||
# '''Get the Average Pairwise Distance (APD)''' by clicking "View Alignment Details". A value close to, or larger than 1.0 suggests that proteins with multiple functions were included in the multiple sequence alignment, which obscures conservation related to the function of your query protein. See [[Interpreting ConSurf Results]], where the APD is explained. | |||
If the list includes proteins of functions that differ from that of the protein of interest, that tends to obscure patches of conservation that exist among proteins with the same function as the query protein of interest. | |||
====Example With Multiple Functions==== | |||
[[2vaa]] is a [[major histocompatibility complex]] '''class I''' protein (MHC I). ConSurfDB used 300 sequences in its calculation, and the MSA had an '''APD of 1.63'''. Starting about halfway down the list of sequence is the first '''MHC II''' protein, which has a different function. Early in the list is the first of 14 sequences for hemochromatosis proteins (unrelated in function). The list includes 62 "Ig-like domain-containing" proteins of unknown function. The inclusion of these and many other sequences for proteins with functions that differ from the query will obscure conservation of residues critical for the specific functions of MHC I proteins. Therefore, you may prefer to run a ConSurf job in which you <!--limit the MSA to MHC I proteins by manual selection--> limit the APD to a value below 1.0. [[#Examples|See such examples below]]. | |||
==Limiting ConSurf Analysis to Proteins of a Single Function== | ==Limiting ConSurf Analysis to Proteins of a Single Function== | ||
''This section was updated in July, 2024 to correspond to changes in the ConSurf Server.'' | |||
As explained [[#ConSurf-DB Often Obscures Some Functional Sites|above]], ConSurf'''DB''' results, and the ConSurf-DB ''Evolutionary Conservation'' scene available in Proteopedia often includes proteins with multiple functions. However, the best way to find all functional sites by conservation analysis is to limit the analysis to proteins with a single function. | |||
When the [[Interpreting ConSurf Results|Average Pairwise Distance]] (APD) in the multiple sequence alignment (MSA) approaches or exceeds approximately 1.0, it is likely that proteins with multiple functions have been included in the MSA. To see conservation that reflects the function of the query protein, it is best to use an MSA with an APD in roughly the range 0.3-0.6. Sometimes, the ConSurf Server's result with default settings may give such a result. If not, you may wish to do additional ConSurf runs with the goal of reducing the APD. | |||
Prior to 2022, the ConSurf Server enabled manual selection of sequences. Unfortunately, after a 2022 update to the ConSurf Server, this is no longer practical. Hence we are limited to adjusting run parameters by trial and error "in the dark" until satisfactory results are obtained. | |||
#Go to [http://consurf.tau.ac.il consurf.tau.ac.il], the ConSurf Server (distinct from ConSurf-DB). | |||
#Enter your PDB ID in the slot (or upload your PDB file). | |||
#'''IGNORE''' the "pre-calculated ConSurfDB analysis" if offered. (Presumably you've already decided it has an APD value too high, and it will not work in FirstGlance in Jmol.) | |||
#Enter a title. Best if the title includes abbreviations for the custom settings you plan for this run. For example, if you are running 2vaa chain A with the UniRef90 database, 150 sequences sampled from the unique hits, with maximal %ID 99% and minimal %ID 60%, the title might be "2vaa u90 150s 99-60". | |||
#Enter your email address. | |||
#Click '''Select Run Parameters Manually'''. | |||
Below are suggestions for selecting the run parameters. | |||
<!--A procedure for doing this follows. | |||
===Procedure=== | |||
#Go to [http://consurf.tau.ac.il consurf.tau.ac.il], the ConSurf Server (distinct from ConSurf-DB). | #Go to [http://consurf.tau.ac.il consurf.tau.ac.il], the ConSurf Server (distinct from ConSurf-DB). | ||
# | #Fill out the form. For your first run, all options can be left at their default settings. When you get to the section ''Select homologs for ConSurf analysis'', be sure to check '''manually'''. | ||
#Enter your email address and click the ''Submit'' button. | |||
# | #After a few minutes, a <font color="#00d000"><b>green</b></font> message will appear <font color="#00d000"><b>SELECT SEQUENCES</b></font>. The job cannot continue until you select the sequences. | ||
# | #Look at the names of the proteins in the list that has checkboxes, under the header "Sequences producing significant alignments:". Find the first case where the function of the protein is not the same as the protein of interest. Usually you will want to exclude sequences for proteins of different functions. | ||
# | #Just below the large red line <font color="red">Please choose which sequences you want to use for ConSurf calculation</font> is a form. Put the number of the last sequence having the same function as the protein of interest in the box "Select the first [ .... ] sequences". Then click on the "Update selection" button. | ||
# | ## ConSurf will not accept >500 sequences. 200-250 sequences are plenty. Using more sequences simply loads the server unnecessarily and delays returning your result. If the number of the last sequence having the same function as the query protein is higher than 250, use the radio buttons labeled "only every 2nd, 3rd, ..." to reduce the total number of sequences selected while sampling the full diversity of the desired sequences. | ||
# | ##The form is confusingly labeled. If you check an "only every" number, then you will need to divide the number in the slot labeled "Select the first" by the "only every" number you selected. For example, if the first 472 sequences have the same function as the query protein, check "only every 2nd" and enter 236 (namely, 472/2) in the "Select the first" slot. | ||
#Examine the list of sequences to make sure that only the desired sequences are checked. (Of course you may check or uncheck individual sequences if you wish.) | |||
#When you are satisfied, scroll to the very bottom of the page and click the ''Submit'' button. | |||
--> | |||
===APD Is Too High=== | |||
As explained above and under [[Interpreting ConSurf Results]], when the APD approaches or exceeds 1.0, conservation related to the function of the query protein may be obscured due to inclusion in the MSA of proteins with multiple functions. There are several ways to get less diversity in the MSA: | |||
#Increase the "Minimal %ID" from the default 35% to perhaps 60%-75%. This will exclude the more distantly related sequences. | |||
#Increase the "Maximal %ID" from the default 95% to perhaps 98%. This will include more sequences closely related to the query. | |||
The above two actions are usually sufficient. Actions with larger effects are: | |||
#The default is to choose the 150 sequences in the MSA by uniform '''sampling''' of the unique hits. This sample includes both the closest and farthest sequences from the query. Change "sampling" to '''closest''' to the unique hits. This typically makes a '''large drop in the APD'''. You could reduce the drop by increasing the number of sequences used from 150 to perhaps 250. The more sequences you use, the longer the job will take to complete. | |||
#Try getting sequences from SwissProt instead of UniRef90. SwissProt has roughly 100-fold fewer sequences than does UniRef90. This tends to reduce the APD, but you may have difficulty getting enough sequences to avoid a large number of residues with "insufficient data" (colored yellow). | |||
===APD Is Too Low=== | |||
When the APD drops close to, or falls below, about 0.2, conservation will be exaggerated because of limited diversity in the MSA. You could increase diversity: | |||
#Decrease the "Maximal %ID" below the default of 95%, perhaps to 80% or 60%. This will exclude the most closely related sequences. | |||
#Search a larger database of sequences. [https://uniprot.org UniProt] has more sequences than the default of [https://www.uniprot.org/help/uniref UniRef90], although the additional sequences may not add much diversity. The [https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html NR (Non-Redundant) database] is the largest (twice as many sequences as UniProt), but contains more errors. | |||
<!--===Too Many Sequences=== | |||
ConSurf will list up to 2,000 sequences from which to select. In some cases, these sequences are all too similar. Some proteins will retrieve >5,000 sequences with an expectation value (E value) < 1.0e-4 (1.0 times ten to the -4), the default threshold. Then the 2,000th sequence listed may still be very similar to the first sequence listed. This would be true if the 2,000th sequence has a very small E value, such as 1.0e-100. In such a case, you may wish to try searching the Swiss-Prot database, which is much smaller than the default Uniref-90 database. Start a new job, with the only difference being the database searched.--> | |||
===Too Few Sequences=== | |||
If the default search for sequence homologs fails to find the minimum of 5 sequences, or if you have more than a few yellow residues (yellow means insufficient data to assign a meaningful conservation value): | |||
*Under "Choose parameters for homolog search algorithm", change the ''Protein Database'' to UniProt or NR (larger databases than the default Uniref90). | |||
* Increase the "Maximal %ID between sequences" from the default of 95% to perhaps 98%. | |||
If the larger database does not give enough sequences, you can use other options to widen the search for sequences, knowing that you will be retrieving sequences less related to the query sequence, likely including proteins with functions differing from that of the query: | |||
* Increase the number of iterations in the search to more than the default of one. Each iteration generates a sequence profile that is used as a query for the next iteration. | |||
* Increase the default E cutoff of 0.0001, for example, to 0.001 or 0.01. | |||
===Too Many Yellow Residues=== | |||
If more than a few residues are yellow, it means that the MSA had insufficient data to assign meaningful conservation values to the yellow residues. Try to increase the number of sequences in the MSA: See [[#Too Few Sequences]]. | |||
If an entire [[domain]] of your query protein is <span style="color: yellow; background-color: black;"><b> yellow (insufficient data) </b></span>, it is because the multiple sequence alignment (MSA) has poor coverage of the yellow domain. In this case, it is best to do separate ConSurf runs for each domain in your protein. | |||
<blockquote> | |||
#Determine sequence numbers for the linkers between domains. This can be done easily by inspection in [http://firstglance.jmol.org FirstGlance in Jmol]. | |||
#Using a [[plain text editor]], delete everything in your [[PDB file]] except the lines that begin with ATOM or HETATM. (Its OK to leave the lines ANISOU if present.) | |||
#Based on sequence numbers, separate the domains into different PDB files. | |||
#Upload each domain's PDB file to ConSurf as a separate job. | |||
</blockquote> | |||
==Examples== | |||
<StructureSection load='' size='350' side='right' caption='' scene='39/399854/2vaa_consurf_halos_w274_y159/4'> | |||
With default parameters, the ConSurf Server results have an average [[#Average Pairwise Distance]] (APD) of 1.00<ref name="APD">Tested with 20 arbitrarily selected proteins, mostly enzymes. Average of the average pairwise distance (APD) values: 1.00; range 0.82-1.42.</ref>, and an average of only a few "yellow" residues with insufficient data.<ref name="ISD">Tested with 20 arbitrarily selected proteins, mostly enzymes. Average number of amino acids with insufficient data ("yellow" in ConSurf): 3.5; range 0 to 16.</ref> For the examples below, it was necessary to [[#Limiting ConSurf Analysis to Proteins of a Single Function|customize the ConSurf Server job parameters]] in order to '''reveal conservation of key residues''' in proteins with the same function as the query. | |||
These molecular scenes were obtained in [[FirstGlance in Jmol]], which offers many conveniences for analyzing ConSurf Server results. See [[Help:How to Insert a ConSurf Result Into a Proteopedia Green Link|How to Insert a ConSurf Result Into a Proteopedia Green Link]]. | |||
===Case #1: MHC=== | |||
The alpha chain of [https://www.youtube.com/watch?v=2ZakngfbHSo Major Histocompatibility Complex (MHC)] Class I protein has a groove that binds a wide range of peptides, and a small loop that binds CD8. Our example is [[2vaa]] (mouse H-2Kb). | |||
<span style="float:right;">{{Template:ColorKey_ConSurf}}</span> | |||
====ConSurf Server Default APD 1.1==== | |||
[[2vaa]] contains three chains. Here, (<scene name='39/399854/2vaa_consurf_halos_w274_y159/4'>restore initial scene, ConSurf Server default settings, APD 1.1</scene>)<ref name="consurfdefaults">Default ConSurf Server settings: UniRef90 database, excluding sequences with > 95% or < 35% identity with the query, MSA has 150 sequences sampled evenly from all unique sequence hits.</ref> ConSurf colors are applied only to the alpha chain (chain A), while the beta chain (chain B = β-2 microglobulin) and the 8 amino acid peptide (chain P) are shown as gray backbone traces. | |||
Conservation of important residues in the groove is obscured by inclusion in the MSA of proteins with different functions ([[#Example With Multiple Functions|see analysis above]]). The sides of the groove are variable due to many alleles that enable it to bind a wide range of peptide sequences. The only groove residue that is conserved at greater than level 7 is '''Tyr159''' (level 8), whose sidechain hydrogen bonds the main-chain oxygen of the amino-terminal peptide residue. Only a handful of surface residues are highly conserved (level 9), including '''Trp274''' involved in binding CD8. | |||
====ConSurfDB APD 1.63==== | |||
ConSurfDB has a result (NOT SHOWN) with an '''APD of 1.63''', much higher than the APD 1.1 for the ConSurf Server with default settings. As expected, nothing in the contacts between the peptide and the groove shows high conservation in the ConSurfDB result, but Trp274 (the CD8 binding site) remains highly conserved. | |||
====ConSurf Server Custom APD 0.51==== | |||
A custom consurf job resulting in an APD of 0.51<ref name="apd0.51">Custom ConSurf Server settings for APD 0.51 with 2vaa: UniRef90 database, excluding sequences with > 95% or '''< 50%''' identity with the query, MSA has 150 sequences sampled evenly from all unique sequence hits.</ref> (NOT SHOWN) had '''NO groove residues with conservation levels > 6'''. Trp274 was level 9. | |||
====ConSurf Server Custom APD 0.31==== | |||
<span style="float:right;">{{Template:ColorKey_ConSurf_NoYellow_NoGray}}</span> | |||
By default, ConSurf Server excludes from the multiple sequence alignment sequences with >95% identity, or <35% identity with the query sequence. Changing those limits to >98% and <70% reduced the default APD of 1.1 to 0.31<ref name="apd0.31">Custom ConSurf Server settings for APD 0.31 with 2vaa: UniRef90 database, excluding sequences with '''> 98% or < 70%''' identity with the query, MSA has 150 sequences sampled evenly from all unique sequence hits.</ref>. <scene name='39/399854/2vaa_apd_point31/3'>This result reveals high conservation of the following 4 key residues in the groove</scene> ({{Yelspan|yellow halos}}). With spin OFF, touch a residue to identify it. | |||
* <span style="background-color:#961d54;color:white;padding:0.2em 0.4em 0.1em 0.4em;">Level 9:</span> | |||
**Tyr7: hydrogen bonds to the amino terminus of the peptide. (Floor of the groove, hard to see.) | |||
**Lys146: salt bridges to the carboxy terminus of the peptide. | |||
* <span style="background-color:#ec6d96;color:white;padding:0.2em 0.4em 0.1em 0.4em;">Level 8:</span> | |||
**Tyr84: hydrogen bonds to the peptide C-terminus. | |||
**Thr143: hydrogen bonds to the peptide C-terminus. (Buried, hard to see.) | |||
(Tyr159 was level 6. CD8 binding site Trp274 remains level 9.) | |||
<scene name='39/399854/2vaa_peptide_contacts/1'>Here are all the polar residues contacting the peptide</scene>. Use the '''POPUP BUTTON''' to see details! (This scene is easily obtained in [http://firstglance.jmol.org FirstGlance]: Tools tab, click Contacts, check Label Contacts, and [[Help:How to Insert a ConSurf Result Into a Proteopedia Green Link|made into a Green Link]].) | |||
Another custom ConSurf Server job<ref name="apd0.30">Custom ConSurf Server settings for APD 0.30 with 2vaa: UniRef90 database, excluding sequences with > 95% or < 35% identity with the query, MSA has '''250''' sequences '''closest''' to the query.</ref> gave an '''APD of 0.30''', but levels for the above 4 groove residues were 7-8. These lower levels can be accounted for by the highest expectation value<ref name="evalue" /> in the MSA, which was 10 to the power -141. In contrast, for the job with APD 0.31, the highest expectation value was 10 to the power -84. | |||
===Case #2: UV Resistance Protein=== | |||
<scene name='39/399854/4dnw_consurf_apd-point48/1'>''Arabidopsis'' UVB-Resistance Protein UVR8</scene> [[4dnw]] is a homodimer with an <scene name='39/399854/4dnw_consurf_apd-point48/2'>unusual number of between-chain salt bridges</scene>. '''Are the between-chain salt bridges more conserved than the within-chain salt bridges?''' | |||
[[FirstGlance in Jmol]] displays <scene name='39/399854/4dnw_consurf_apd-point48/2'>all salt bridges</scene> with one click (Tools tab), colored by conservation (if pre-processed by the ConSurf Server), and can list them, '''spreadsheet-ready, including conservation level numbers, and marking those between chains'''. | |||
With the default ConSurf Server result '''APD 1.42''', and with a custom ConSurf Server result '''APD 0.91''', the salt-bridged residues have about '''average''' conservation. With a custom result '''APD 0.48''', the between-chain salt bridges have '''above-average''' conservation (7.6 vs. 6.8), while the within-chain salt bridges have below average conservation (6.3 vs. 6.8). In conclusion, when the multiple sequence alignment is limited to sequences closely related to the query (APD 0.48), '''between-chain salt bridged residues are more conserved than are within-chain salt bridged residues.''' The difference is '''statistically significant''' (p < 0.01<ref name="stats">With APD 0.48, mean conservation of between-chain salt bridged atoms is 7.57 ± 0.13 SEM. Subtracting 3 SEM (99% confidence limit) gives 7.18. This does not overlap with either 7.16 (the all-salt-bridged atoms mean + 3 SEM) or 6.82 (the mean for within-chain salt-bridged atoms + 3 SEM).</ref>). | |||
<table class="wikitable" style="text-align:center;"> | |||
<tr> | |||
<td colspan=5> | |||
Salt Bridges in [[4dnw]] | |||
</td> | |||
</tr><tr> | |||
<td rowspan=2> | |||
ConSurf [[#Average Pairwise Distance|APD]] | |||
</td> | |||
<td rowspan=2> | |||
Level 9:<br>% of All Residues | |||
</td> | |||
<td colspan=3> | |||
<center> | |||
Mean Conservation Levels ± SEM | |||
</center> | |||
</td> | |||
</tr><tr> | |||
<td> | |||
All Residues | |||
</td><td> | |||
Salt Bridges Between Chains | |||
</td><td> | |||
Salt Bridges Within Chains | |||
</td> | |||
</tr><tr> | |||
<td> | |||
1.42<ref name="consurfdefaults" /> | |||
</td><td> | |||
14% | |||
</td><td> | |||
3.7 | |||
</td><td> | |||
3.5 | |||
</td><td> | |||
3.8 | |||
</td> | |||
</tr><tr> | |||
<td> | |||
0.91<ref name="apd0.91">ConSurf settings for APD 0.91 with 4dnw: Clean UniProt, 35-95%, 200 sequences closest to query.</ref> | |||
</td><td> | |||
16% | |||
</td><td> | |||
5.4 | |||
</td><td> | |||
6.0 | |||
</td><td> | |||
5.0 | |||
</td> | |||
</tr><tr> | |||
<td> | |||
0.48<ref name="apd0.48">ConSurf settings for APD 0.48 with 4dnw: Clean UniProt, 35-95%, 125 sequences closest to query.</ref> | |||
</td><td> | |||
18% | |||
</td><td> | |||
6.8 ± 0.12* | |||
</td><td> | |||
7.6 ± 0.13* | |||
</td><td> | |||
6.3 ± 0.17* | |||
</td> | |||
</tr> | |||
</table> | |||
* *Averages are per atom for 88 between-chains salt-bridged atoms, and 140 within chain salt-bridged atoms. SEM's were calculated as standard deviation divided by the square root of the atom counts. Differences for APD 0.48 are statistically significant, p < 0.01<ref name="stats" />. | |||
*Salt bridges are Lys or Arg sidechain nitrogens within 4.0 Å of Asp or Glu sidechain oxygens. | |||
Examples of conserved patches on other proteins, revealed by ConSurf, will be found in the articles on | |||
*[[Lac repressor]] | |||
*[[Avian Influenza Neuraminidase, Tamiflu and Relenza]] | |||
*[[Mechanosensitive channels: opening and closing]] | |||
</StructureSection> | |||
==Conclusion== | |||
In order to discover key functional residues, it is important to inspect multiple ConSurf Server jobs for highly conserved residues, including multiple jobs with [[#Average Pairwise Distance]]s (APD) in the range 0.25-0.5 using the [[#Limiting ConSurf Analysis to Proteins of a Single Function|above methods]]. Jobs with APD higher than 0.5 may obscure conservation of residues crucial for the function of the query protein. Residues conserved in the broader family of more distantly related proteins with different functions will typically be revealed with default ConSurf Server settings (APD ~ 1.0), or even in the ConSurf'''DB''' result. | |||
==The ConSurf-DB Mechanism== | ==The ConSurf-DB Mechanism== | ||
Because results from the ConSurf DataBase server, [http://consurfdb.tau.ac.il ConSurf-DB]<ref name="consurfdb">PMID: 18971256</ref> are displayed within Proteopedia as ''Evolutionary Conservation'', an overview of its methods is provided here. ConSurf-DB pre-calculates conservation levels for each amino acid in every protein chain in the [[Protein Data Bank]]. It went into service in 2008. It uses state-of-the-art methods, all published in peer-reviewed journals<ref name="consurfdb" /> | :{{Yelspan| In January 2018: ConSurfDB had not been updated with new entries in the [[Protein Data Bank]] since January, 2013. }} | ||
Because results from the ConSurf DataBase server, [http://consurfdb.tau.ac.il ConSurf-DB]<ref name="consurfdb">PMID: 18971256</ref> are displayed within Proteopedia as ''Evolutionary Conservation'', an overview of its methods is provided here. ConSurf-DB '''pre-calculates''' conservation levels for each amino acid in every protein chain in the [[Protein Data Bank]]. It went into service in 2008. It uses state-of-the-art methods, all published in peer-reviewed journals<ref name="consurfdb" />. | |||
===ConSurf-DB Process=== | ===ConSurf-DB Process=== | ||
#A list of unique protein chains is extracted from the [[Protein Data Bank]]. Chains shorter than 30 amino acids are not processed because they do not contain enough information for reliable phylogenetic tree construction. | |||
#The amino acid sequence of each protein chain is submitted to | #A list of unique protein chains is extracted from the [[Protein Data Bank]]. Chains shorter than 30 amino acids are not processed because they do not contain enough information for reliable phylogenetic tree construction. Certain non-standard residues are converted to the closest standard amino acids, for example, [[selenomethionine]] MSE is converted to MET. Chains that still have more than 15% non-standard residues are not processed. Chains that could not be processed are colored gray in Proteopedia -- see the color key at the top of this page. | ||
# The sequences | #The amino acid sequence of each protein chain is submitted to [http://hmmer.org HMMER] for collection of related sequences from the UniRef90 database. By default, one iteration is performed using an expectation value<ref name="evalue">'''Expectation Value (E value):''' When searching a sequence database with a query sequence, e.g. using BLAST or PSI-BLAST, each found sequence can be characterized by an E value. It is the number of hits expected by chance with the sequence matching level observed, taking into account the size of the sequence database and length of the query sequence. Low values of E (much less than one) mean increasing significance of the match.</ref> cutoff of 10<sup>-4</sup>. | ||
#The filtered sequence set is multiply aligned with [ | # The found sequences are then filtered ([[#Filtering|see below]]) using a scheme that attempts a balance between limiting the sequences to close homologues, and including distant sequences that do not share structure or function. | ||
#The filtered sequence set is multiply aligned with [https://mafft.cbrc.jp MAFFT] (a multiple sequence alignment algorithm that out-performs older algorithms such as MUSCLE and CLUSTALW). | |||
#A phylogenetic tree is constructed from the multiple sequence alignment (MSA) using the Rate4Site program developed by the ConSurf team. | #A phylogenetic tree is constructed from the multiple sequence alignment (MSA) using the Rate4Site program developed by the ConSurf team. | ||
#Rate4Site then calculates an evolutionary rate for each position in the MSA using a [http://en.wikipedia.org/wiki/Bayesian_inference Bayesian] approach shown by the ConSurf team to be superior<ref>PMID: 15201400</ref>. "The amino acid evolution is traced using the JTT<ref> PMID: 1633570</ref> substitution model. High evolutionary rate represents a variable position while low rate represents an evolutionarily conserved position."<ref name="consurfdb" /> | #Rate4Site then calculates an evolutionary rate for each position in the MSA using a [http://en.wikipedia.org/wiki/Bayesian_inference Bayesian] approach shown by the ConSurf team to be superior<ref>PMID: 15201400</ref>. "The amino acid evolution is traced using the JTT<ref> PMID: 1633570</ref> substitution model. High evolutionary rate represents a variable position while low rate represents an evolutionarily conserved position."<ref name="consurfdb" /> | ||
Line 59: | Line 266: | ||
#Colors mapped to the nine conservation levels, from <font color="#0fC7CF"><b>turquoise (1)</b></font> to <font color="#A01F5F"><b>burgandy (9)</b></font> are applied to the 3D protein structure visualized in [[FirstGlance in Jmol]]. A coloring script for [[RasMol]] is also provided. | #Colors mapped to the nine conservation levels, from <font color="#0fC7CF"><b>turquoise (1)</b></font> to <font color="#A01F5F"><b>burgandy (9)</b></font> are applied to the 3D protein structure visualized in [[FirstGlance in Jmol]]. A coloring script for [[RasMol]] is also provided. | ||
<center>{{Template:ColorKey_ConSurf}}</center> | <center>{{Template:ColorKey_ConSurf}}</center> | ||
#A confidence interval for the conservation level is calculated for each amino acid position in the MSA. When this indicates low reliability, the position is colored <font color="#c0c000"><b>yellow</b></font>, signifying that the data were insufficient to assign a meaningful conservation level. | #A confidence interval for the conservation level is calculated for each amino acid position in the MSA. When this indicates low reliability, the position is colored {{Yelspan|yellow}}<!--<font color="#c0c000"><b>yellow</b></font>-->, signifying that the data were insufficient to assign a meaningful conservation level. | ||
<ol start="11"> | <ol start="11"> | ||
<li>An ''Average Pairwise Distance'' (APD) is calculated to describe the diversity of sequences in the MSA ([[#Average Pairwise Distance|see below]]). | <li>An ''Average Pairwise Distance'' (APD) is calculated to describe the diversity of sequences in the MSA ([[#Average Pairwise Distance|see below]]). | ||
Line 73: | Line 280: | ||
#Redundant sequences (>95% identical) are removed using CD-HIT<ref>PMID: 16731699</ref>. | #Redundant sequences (>95% identical) are removed using CD-HIT<ref>PMID: 16731699</ref>. | ||
#A maximum of 300 sequences meeting the above criteria is used (the 300 with the lowest expectation values<ref name="evalue" />, that is, most closely related to the query sequence). | #A maximum of 300 sequences meeting the above criteria is used (the 300 with the lowest expectation values<ref name="evalue" />, that is, most closely related to the query sequence). | ||
#If the above process yields fewer than 5 sequence homologs, no calculation is performed due to insufficient data. In February, 2008, this occurred for 1,348 chains out of 30,918 (4%). | #If the above process yields fewer than 5 sequence homologs, no calculation is performed due to insufficient data. In February, 2008, this occurred for 1,348 chains out of 30,918 (4%). | ||
===Average Pairwise Distance=== | ===Average Pairwise Distance=== | ||
The ''Average Pairwise Distance'' (APD) is an important measure of the diversity in the multiple sequence alignment. To learn how to interpret it, please see above, and [[Interpreting ConSurf Results]]. | |||
==The ConSurf Server== | ==The ConSurf Server== | ||
The [http://consurf.tau.ac.il ConSurf Server], first available in 2001<ref>PMID: 11243830</ref><ref>PMID: 12499312</ref><ref>PMID: 15980475</ref> with many subsequent enhancements, can calculate and display the conservation pattern for 3D structures '''completely automatically'''. | The [http://consurf.tau.ac.il ConSurf Server], first available in 2001<ref>PMID: 11243830</ref><ref>PMID: 12499312</ref><ref>PMID: 15980475</ref> with many subsequent enhancements, can calculate and display the conservation pattern for 3D structures '''completely automatically'''. It should be used whenever the pre-calculated result at the [[#The ConSurf-DB Mechanism|ConSurf-DB]] is unavailable, or does not meet your needs (for example, see [[#Limiting ConSurf Analysis to Proteins of a Single Function|above]]), or if you have your own multiple sequence alignment (MSA) that you wish to use. The default settings of ConSurf may need to be adjusted in order to get an optimally informative result. The main adjustment needed is to gather an adequate number of sequences for proteins of the same function as your protein of interest (see [[#Limiting ConSurf Analysis to Proteins of a Single Function|above]]). | ||
[[ConSurf Quick Analysis Procedure|ConSurf Server job submission instructions]]. | |||
Like ConSurf-DB, the ConSurf Server uses the same state-of-the-art methods, all of which are published in peer-reviewed journal articles. Unlike ConSurf-DB's pre-calculated results the ConSurf Server permits considerable customization. For example, the user may specify the number of sequences to use, choose the database from which sequences are obtained | Like ConSurf-DB, the ConSurf Server uses the same state-of-the-art methods, all of which are published in peer-reviewed journal articles. Unlike ConSurf-DB's pre-calculated results the ConSurf Server permits considerable customization. For example, the user may specify the number of sequences to use, choose the database from which sequences are obtained, set the Expectation cutoff<ref name="evalue" />, set the number of HMMER iterations, or submit their own multiple sequence alignment, or phylogenetic tree. Also you can upload your own PDB file, which enables you to process unpublished data, theoretical models, or "trimmed" chains, e.g. a [[domain]] of interest from a multiple-domain chain. | ||
In brief, the [http://consurf.tau.ac.il ConSurf Server] uses the following process by default: | In brief, the [http://consurf.tau.ac.il ConSurf Server] uses the following process by default: | ||
# Obtains the protein sequence for the specified PDB code (or uploaded PDB file) and chain. | # Obtains the protein sequence for the specified PDB code (or uploaded PDB file) and chain. | ||
# Gathers closely related sequences from | # Gathers closely related sequences from UNIREF90 (or another database that you specify) with an HMMER search (or other algorithm that you specify). E value cutoff<ref name="evalue" />, number of iterations, and number of sequences to use are configurable. | ||
# | # Filters the sequences, by default eliminating those redundant at 95% or higher identity with each other, and those with less than 35% sequence identity to the query sequence. These percentages are adjustable. | ||
# Does a multiple sequence alignment with | <!--#Optionally enables the user to manually select which sequences will be used, from a list with checkboxes. In particular, this enables users to limit the analysis to proteins having the same function as the protein of interest (see [[#Limiting ConSurf Analysis to Proteins of a Single Function|above]]).--> | ||
# Constructs a phylogenetic tree. (Or you can upload your own.) | # Does a multiple sequence alignment with MAFFT. (Or you can choose a different algorithm or upload your own MSA.) | ||
# Calculates a conservation score for each amino acid. Classifies the conservation scores into nine levels, and maps them to standard conservation level colors (see color key at the top of this page). Marks residues for which the conservation score confidence interval is too large, hence the conservation score is unreliable ("insufficient data"). | # Constructs a phylogenetic tree using neighbor joining with ML distance. (Or you can choose a different algorithm or upload your own tree.) | ||
# Displays the protein, colored by conservation, in interactive 3D, using [[FirstGlance in Jmol]], [[Chimera]], [[PyMOL]] | # Calculates a conservation score with confidence interval for each amino acid. Classifies the conservation scores into nine levels, and maps them to standard conservation level colors (see color key at the top of this page). Marks residues for which the conservation score confidence interval is too large, hence the conservation score is unreliable ("insufficient data"). | ||
# Displays the protein, colored by conservation, in interactive 3D, using the NGL Viewer, [[FirstGlance in Jmol]], [[Chimera]], or [[PyMOL]]. | |||
== | ==See Also== | ||
*[[ConSurf/Index]] provides links to all pages about evolutionary conservation and ConSurf in Proteopedia. | |||
==Notes & References== | |||
{{Reflist}} | |||