AlphaFold2 examples from CASP 14: Difference between revisions

Eric Martz (talk | contribs)
No edit summary
Eric Martz (talk | contribs)
No edit summary
 
(26 intermediate revisions by the same user not shown)
Line 1: Line 1:
<span class="text-red">'''This page is under construction.''' [[User:Eric Martz|Eric Martz]] 01:03, 22 February 2021 (UTC)</span>
Prediction of protein structures from amino acid sequences, [[theoretical modeling]], has been extremely challenging. In 2020, breakthrough success was achieved by AlphaFold2<ref name="af2">PMID:31942072</ref>, a project of [http://deepmind.com DeepMind]. '''For an overview of this breakthrough''', documented by the bi-annual prediction competition [[Theoretical_models#CASP|CASP]], please see [[Theoretical_models#2020:_CASP_14|2020: CASP 14]]. Below are illustrated two examples of predictions from that competition.
Prediction of protein structures from amino acid sequences, [[theoretical modeling]], has been extremely challenging. In 2020, breakthrough success was achieved by AlphaFold2<ref name="af2">PMID:31942072</ref>, a project of [http://deepmind.com DeepMind]. '''For an overview of this breakthrough''', documented by the bi-annual prediction competition [[Theoretical_models#CASP|CASP]], please see [[Theoretical_models#2020:_CASP_14|2020: CASP 14]]. Below are illustrated two examples of predictions from that competition.


Line 9: Line 7:
First, SARS-CoV-2 ORF8<ref name="7jtl" />, a 92-residue FM domain where '''AlphaFold2's GDT_TS was 87, and the second best was 43''' (by the group of Xian Ming Pan)<ref name="t1064">For SARS-CoV-2 ORF8, at the [https://predictioncenter.org/casp14/results.cgi?view=tb-sel CASP 14 Table Browser], check T1064-D1 and press ''Show Results''.</ref>, the largest difference between 1st and 2nd predictions among the FM targets. It is further unusual because two independently-determined X-ray crystallographic structures were subsequently published. Inspiration for this case came from the discussion by Rubiera<ref name="rubiera">[https://www.blopig.com/blog/2020/12/casp14-what-google-deepminds-alphafold-2-really-achieved-and-what-it-means-for-protein-folding-biology-and-bioinformatics/ CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics], a blog post by Carlos Outeir al Rubiera, December 3, 2020.</ref>.
First, SARS-CoV-2 ORF8<ref name="7jtl" />, a 92-residue FM domain where '''AlphaFold2's GDT_TS was 87, and the second best was 43''' (by the group of Xian Ming Pan)<ref name="t1064">For SARS-CoV-2 ORF8, at the [https://predictioncenter.org/casp14/results.cgi?view=tb-sel CASP 14 Table Browser], check T1064-D1 and press ''Show Results''.</ref>, the largest difference between 1st and 2nd predictions among the FM targets. It is further unusual because two independently-determined X-ray crystallographic structures were subsequently published. Inspiration for this case came from the discussion by Rubiera<ref name="rubiera">[https://www.blopig.com/blog/2020/12/casp14-what-google-deepminds-alphafold-2-really-achieved-and-what-it-means-for-protein-folding-biology-and-bioinformatics/ CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics], a blog post by Carlos Outeir al Rubiera, December 3, 2020.</ref>.


Second, the '''longest domain in the FM category, 404 residues'''. This domain is part of the 2,180-residue RNA polymerase of a bacteriophage, some of whose group members are prevalent in the human gut<ref name="6vr4">PMID: 33208949</ref>. Eight of the CASP 14 FM target domains are parts of this protein, [[6vr4]]. For the 404-residue domain T1037, AlphaFold2 achieved GDT_TS of 88, and the second best prediction, 63 (by Seok-refine). Among the 14 FM targets, the second-longest has 276 residues, the median 132, and the shortest, 92.
Second, the '''longest domain in the FM category, 404 residues'''. This domain is part of the 2,180-residue RNA polymerase of a bacteriophage, some of whose group members are prevalent in the human gut<ref name="6vr4">PMID: 33208949</ref>. Eight of the CASP 14 FM target domains are parts of this protein, [[6vr4]]. For the 404-residue domain T1037, AlphaFold2 achieved GDT_TS of 88, and the second best prediction, 63 (by Seok-refine)<ref name="t1037">For the phage RNA polymerase target, at the [https://predictioncenter.org/casp14/results.cgi?view=tb-sel CASP 14 Table Browser], check T1037-D1 and press ''Show Results''.</ref>. Among the 14 FM targets, the second-longest has 276 residues, the median 132, and the shortest, 92.


==SARS-CoV-2 ORF8==
==SARS-CoV-2 ORF8==
Our first example is [[SARS-CoV-2 protein ORF8]], a protein that contributes to virulence in COVID-19<ref name="7jtl">PMID: 33361333</ref>. CASP 14 classified ORF8 as a "free modeling" (FM) target<ref name="casp14domains">[https://predictioncenter.org/casp14/domains_summary.cgi Summary and Classifications of Domains for CASP 14].</ref>, meaning that there were no adequate empirical templates for [[homology modeling]]. This was easily confirmed. When the [https://www.uniprot.org/uniprot/P0DTC8 amino acid sequence of ORF8] is submitted to [https://swissmodel.expasy.org/ Swiss Model], it reports the best templates for homology modeling. When the two [[empirical models]] that were not available during CASP 14 are excluded ([[7jtl]] and [[7jx6]]), the best template offered, chain B of [[3afc]], covers only 36% of the length of ORF8 at 13.2% sequence identity, with a 4-residue untemplated gap in the sequence alignment. This template would not be adequate for constructing a useful model.
Our first example is [[SARS-CoV-2 protein ORF8]], a protein that contributes to virulence in COVID-19<ref name="7jtl">PMID: 33361333</ref>. CASP 14 classified ORF8 as a "free modeling" (FM) target<ref name="casp14domains">[https://predictioncenter.org/casp14/domains_summary.cgi Summary, Definitions and Classifications of Domains for CASP 14].</ref>, meaning that there were no adequate empirical templates for [[homology modeling]]. This was easily confirmed. When the [https://www.uniprot.org/uniprot/P0DTC8 amino acid sequence of ORF8] is submitted to [https://swissmodel.expasy.org/ Swiss Model], it reports the best templates for homology modeling. When the two [[empirical models]] that were not available during CASP 14 are excluded ([[7jtl]] and [[7jx6]]), the best template offered, chain B of [[3afc]], covers only 36% of the length of ORF8 at 13.2% sequence identity, with a 4-residue untemplated gap in the sequence alignment. This template would not be adequate for constructing a useful model.


===X-Ray Structures for ORF8===
===X-Ray Structures for ORF8===
Line 34: Line 32:
! Model || GDT_TS || Disulfide<br>Bonds || C&alpha; [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD], Å || C&alpha; Superposed || [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD] Including<br>Sidechains, Å || Atoms Superposed
! Model || GDT_TS || Disulfide<br>Bonds || C&alpha; [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD], Å || C&alpha; Superposed || [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD] Including<br>Sidechains, Å || Atoms Superposed
|-
|-
| [[7jtl]]:A || 88<ref name="gdt_ts">See [[#GDT_TS Calculations]].</ref> || 3 ||  4.02<br>'''0.66''' || 102/102 (100%)<br>'''87/102 (85%)''' || 4.3<br>'''1.58''' || 829/829 (100%)<br>'''709/829 (86%)'''
| [[7jtl]]:A || 96<ref name="gdt_ts">See [[#GDT_TS Calculations]].</ref> || 3 ||  4.02<br>'''0.66''' || 102/102 (100%)<br>'''87/102 (85%)''' || 4.3<br>'''1.58''' || 829/829 (100%)<br>'''709/829 (86%)'''
|-
|-
| AlphaFold2 || 87 || 3 || 2.58<br>'''1.25''' || 92/92 (100%)<br>'''83/92* (90%)''' || 3.23<br>'''1.91''' || 747/748 (100%)<br>'''679/748 (91%)'''
| AlphaFold2 || 87 || 3 || 2.58<br>'''1.25''' || 92/92 (100%)<br>'''83/92* (90%)''' || 3.23<br>'''1.91''' || 747/748 (100%)<br>'''679/748 (91%)'''
|-
|-
| Dali top hit<ref name="nnf">See [[#ORF8 is not a novel fold]].</ref> [[5a2f]] || 53<ref name="gdt_ts" /> || na || 3.2<br>'''1.95''' || 92/92 (100%)<br>'''48/92 (52%)''' || na || na
| Dali top hit<ref name="nnf">See [[#ORF8 is not a novel fold]].</ref> [[5a2f]] || 60<ref name="gdt_ts" /> || na || 3.2<br>'''1.95''' || 92/92 (100%)<br>'''48/92 (52%)''' || na || na
|-
|-
| 2nd Best* || 43 || 0 || 5.33<br>'''<span class="text-gray">1.71</span>''' || 92/92 (100%)<br>'''<span class="text-gray">38/92 (41%)</span>''' || 6.54<br>'''<span class="text-gray">5.86</span>''' || 747/748 (100%)<br>'''<span class="text-gray">324/748 (43%)</span>)'''
| 2nd Best* || 43 || 0 || 5.33<br>'''<span class="text-gray">1.71</span>''' || 92/92 (100%)<br>'''<span class="text-gray">38/92 (41%)</span>''' || 6.54<br>'''<span class="text-gray">5.86</span>''' || 747/748 (100%)<br>'''<span class="text-gray">324/748 (43%)</span>)'''
Line 69: Line 67:
===Baker Rosetta Server Prediction for ORF8===
===Baker Rosetta Server Prediction for ORF8===
Among predictions for all ~100 CASP 14 targets, the group of David Baker [https://predictioncenter.org/casp14/zscores_final.cgi ranked second]. The Rosetta Server of the Baker group ranked 18th overall, but was the 4th ranked server<ref name="serverranks">For all targets in CASP 14, the top two servers were QUARK and Zhang-server (which were not significantly different at a Z-score sum of 62.9), followed by Zhang-CEthreader (55.9) and BAKER-ROSETTASERVER (55.3).</ref>. [https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1064-D1&model=1&groups_id= For ORF8, the Rosetta Server prediction GDT_TS was 26], a bit better than the median of 23. The Rosetta Server's prediction for ORF8 has '''the two termini far apart''' (C&alpha; 13 Å or farther apart), a substantial difference from the X-ray structure (C&alpha; mostly ~5 Å apart). It predicts '''two disulfide bonds, but neither matches''' the pairs of Cys residues in the actual disulfide bonds.  The '''salt bridge''' Arg86:Asp98 is correctly predicted, along with one incorrectly predicted salt bridge. The structural superposition is very poor and is not shown.  
Among predictions for all ~100 CASP 14 targets, the group of David Baker [https://predictioncenter.org/casp14/zscores_final.cgi ranked second]. The Rosetta Server of the Baker group ranked 18th overall, but was the 4th ranked server<ref name="serverranks">For all targets in CASP 14, the top two servers were QUARK and Zhang-server (which were not significantly different at a Z-score sum of 62.9), followed by Zhang-CEthreader (55.9) and BAKER-ROSETTASERVER (55.3).</ref>. [https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1064-D1&model=1&groups_id= For ORF8, the Rosetta Server prediction GDT_TS was 26], a bit better than the median of 23. The Rosetta Server's prediction for ORF8 has '''the two termini far apart''' (C&alpha; 13 Å or farther apart), a substantial difference from the X-ray structure (C&alpha; mostly ~5 Å apart). It predicts '''two disulfide bonds, but neither matches''' the pairs of Cys residues in the actual disulfide bonds.  The '''salt bridge''' Arg86:Asp98 is correctly predicted, along with one incorrectly predicted salt bridge. The structural superposition is very poor and is not shown.  
===ORF8 Sidechain Prediction Accuracy===


Jump below to [[#ORF8 Sidechain Accuracy]]
Jump below to [[#ORF8 Sidechain Accuracy]]
Line 82: Line 82:


<scene name='87/875686/T1037_length_404/1'>The X-ray structure of CASP 14 domain T1037</scene> (length 404 residues) consists of residues 337-369 + 531-901 of [[6vr4]] (taken from chain B). It is an <scene name='87/875686/T1037_length_404/2'>alpha/beta domain with secondary structure</scene> <span style="color:#ff0080;font-weight:bold;">45% helices</span>, <span style="color:#ffc800;background-color:black;font-weight:bold;">&nbsp;19% beta strands&nbsp;</span>, and 37% loops and turns. The N- and C-termini are 10 Å apart, and there are no cysteines (thus no disulfide bonds).
<scene name='87/875686/T1037_length_404/1'>The X-ray structure of CASP 14 domain T1037</scene> (length 404 residues) consists of residues 337-369 + 531-901 of [[6vr4]] (taken from chain B). It is an <scene name='87/875686/T1037_length_404/2'>alpha/beta domain with secondary structure</scene> <span style="color:#ff0080;font-weight:bold;">45% helices</span>, <span style="color:#ffc800;background-color:black;font-weight:bold;">&nbsp;19% beta strands&nbsp;</span>, and 37% loops and turns. The N- and C-termini are 10 Å apart, and there are no cysteines (thus no disulfide bonds).
===T1037 contains several known fold fragments===
The X-ray structure of T1037 (404 residues from 6vr4) was submitted to Dali<ref name="dali2020" /> in March, 2021. Among the ~1,000 hits with Z ≥ 2.0, there were 152 with lengths ≥ 400 residues, and 224 with lengths ≥ 300, long enough that a superposition with the majority of T1037 would not be precluded. Among all hits, the largest number of aligned residues was 140/404 (35%) with RMSD 11.7 Å. The second largest was 127/404 (31%), RMSD 7.7 Å. Thus, no single structure in the PDB superposed with more than 35% of T1037.
However, several of the Dali hits superposed with non-overlapping core fragments of [[6vr4]]<ref name="lholm">These non-overlapping core fragments were kindly pointed out by Liisa Holm, March, 2021.</ref>:
*[[2j7n]] chain A, RNA-dependent RNA polymerase
**length 934, aligned residues '''115, RMSD 4.3 Å''', Z=4.0, structural alignment 9 %id.
*[[4ncj]] chain A, DNA double-strand break repair RAD50 ATPase
**length 311, aligned residues '''109, RMSD 4.7 Å''', Z=3.4, structural alignment 11 %id.
*[[5vfk]] chain A, Uncharacterized protein
**length 146, aligned residues '''61, RMSD 7.8 Å''', Z=3.3, structural alignment 11 %id.
Liisa Holm<ref name="dali2020" /><ref name="holmquote">Quoted with permission from Liisa Holm, March, 2021.</ref> stated: "T1037 has a homologous template in the PDB. The parent structure of T1037, phage RNA polymerase (6vr4, 2166 amino acids), is homologous to the RNAi polymerase from Neurospora crassa (2j7n chain A, 934 amino acids)<ref name="6vr4" />. Dali aligns them over 564 residues with an RMSD of 4.8 A. 115 residues of the common core are in the T1037 substructure. Several long insertions in T1037/6vr4 relative to 2j7n (chain A) form subdomains, which point outwards from the common core. Similar massive adaptation of the common core is seen, for example, in the glucosyltransferase 1 family<ref>PMID: 7729407</ref>."
The [https://fatcat.godziklab.org/ FATCAT Server] reported that in order to superpose 150 residues (37% of 404) of T1037 with the closest structure in the PDB, 3 twists at hinges were required, after which an RMSD of 3.1 Å was achieved. For a 200-residue superposition (50% of 404), the best results after 3 twists had an RMSD of 5.4 Å.


===AlphaFold2 prediction for T1037===
===AlphaFold2 prediction for T1037===
<scene name='87/875686/Morph_lin_6vr4_to_af2/1'>AlphaFold2 predicted the structure of T1037 with high accuracy</scene><ref name="imf" /> (GDT_TS 88; see Table II below for details).
<scene name='87/875686/Morph_lin_6vr4_to_af2/1'>AlphaFold2 predicted the structure of T1037 with very high accuracy</scene><ref name="imf" />. 91% of the 404 alpha carbons can be aligned with RMSD 1.0 Å. (GDT_TS 88; see Table II below for details).


{| style="text-align:center;" class="wikitable"
{| style="text-align:center;" class="wikitable"
Line 91: Line 106:
! Model || GDT_TS || C&alpha; [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD], Å || C&alpha; Superposed || [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD] Including<br>Sidechains, Å || Atoms Superposed
! Model || GDT_TS || C&alpha; [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD], Å || C&alpha; Superposed || [https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions RMSD] Including<br>Sidechains, Å || Atoms Superposed
|-
|-
| T1037 of<br>[[6vr4]]:A || XX<ref name="gdt_ts" /> || 0.25<br>'''0.25''' || 404/404 (100%)<br>'''404/404 (100%)''' || 0.58<br>'''0.24''' || 3157/3157 (100%)<br>'''1616/3157 (51%)'''
| T1037 of<br>[[6vr4]]:A || 99.9<ref name="gdt_ts" /> || 0.25<br>'''0.25''' || 404/404 (100%)<br>'''404/404 (100%)''' || 0.58<br>'''0.24''' || 3157/3157 (100%)<br>'''1616/3157 (51%)'''
|-
|-
| AlphaFold2 || 88 || 1.68<br>'''0.98''' || 404/404 (100%)<br>'''368/404 (91%)''' || 2.28<br>'''<span class="text-gray">1.01</span>''' || 3157/3157 (100%)<br>'''<span class="text-gray">1472/3157 (47%)</span>'''
| AlphaFold2 || 88 || 1.68<br>'''0.98''' || 404/404 (100%)<br>'''368/404 (91%)''' || 2.28<br>'''<span class="text-gray">1.01</span>''' || 3157/3157 (100%)<br>'''<span class="text-gray">1472/3157 (47%)</span>'''
Line 104: Line 119:
:<span style="color:#b0b0b0;">Superpositions involving ≤ 25% of each structure.</span><br>
:<span style="color:#b0b0b0;">Superpositions involving ≤ 25% of each structure.</span><br>
:&#42;Second best by Seok-refine: Group of Chaok Seok, Seoul National University.<br>
:&#42;Second best by Seok-refine: Group of Chaok Seok, Seoul National University.<br>
:§Prediction by Seder2020 (one of the predictions with GDT_TS 53, arbitrarily 10 less than the 2nd best with GDT_TS 63): Group of Andrzej Kloczkowski, Colombus, Ohio.<br>
:§Prediction by Seder2020 (one of the predictions with GDT_TS 53, arbitrarily 10 less than the 2nd best with GDT_TS 63): Group of Andrzej Kloczkowski, Columbus, Ohio. '''Superposition not shown'''.<br>
:†Close superposition of the three longest alpha helices.
:†Close superposition of the three longest alpha helices.


===Second Best Prediction for T1037===
===Second Best Prediction for T1037===
Despite its impressive GDT_TS of 63, <scene name='87/875686/Morf_t1037_6vr4_to_2nd_cao/1'>the second best prediction for 1037 was far less accurate</scene><ref name="imf" /> than the prediction of AlphaFold2. (The second best prediction was by Seok-refine, from the group of Chaok Seok, Seoul National University.)
Despite its impressive GDT_TS of 63, <scene name='87/875686/Morf_t1037_6vr4_to_2nd_cao/2'>the second best prediction for 1037 was far less accurate</scene><ref name="imf" /> than the prediction of AlphaFold2. (The second best prediction was by Seok-refine, from the group of Chaok Seok, Seoul National University.)
 
==Calculating GDT_TS==
Please see [[#GDT_TS Calculations]].


<!--########################################################-->
<!--########################################################-->
Line 165: Line 183:


====Visualization of Surface Charge Distributions====
====Visualization of Surface Charge Distributions====
The distributions of surface charges are in good agreement between AlphaFold2's prediction and the two crystal structures, which agree with each other. The distribution in the 2nd best prediction has several discrepancies with the other three models.
[[Image:Orf8-casp14-charges.png]]
[[Image:Orf8-casp14-charges.png]]


==GDT_TS Calculations==
==GDT_TS Calculations==
GDT_TS values for predictions are taken from CASP 14 results. GDT_TS values for 7JTL and 5A2F vs. 7JX6 chain A were calculated using the [http://linum.proteinmodel.org/ AS2TS server] of Adam Zemla<ref name="zemla">PMID: 12824330</ref>. See instructions for [[Calculating GDT_TS]]. CASP 14 reported GDT_TS 86.96 for the AlphaFold2 prediction, while the AS2TS server calculated GDT_TS 86.41 vs. 7jx6 chain A, and 88.59 vs. 7JTL chain A.
GDT_TS values for predictions are taken from CASP 14 results. The reference structure for the CASP 14 GDT_TS values was 92 alpha carbons of 7JTL<ref name="casp14domains" />, since the CASP 14 target had only 92 residues<ref name="casp14domains" />.
 
GDT_TS values for 7JTL and 5A2F vs. 7JX6 chain A were calculated using the [http://linum.proteinmodel.org/ AS2TS server] of Adam Zemla<ref name="zemla">PMID: 12824330</ref>. See instructions for [[Calculating GDT_TS]]. GDT_TS values were corrected for 92 residues (not 104) because the CASP 14 target had only 92 residues<ref name="casp14domains" />.
 
For comparison, CASP 14 reported GDT_TS 86.96 for the AlphaFold2 prediction, while the AS2TS server calculated GDT_TS 85.87 vs. 7jx6 chain A, and 88.32 vs. 7JTL chain A. (These results were corrected for 90/92 and 91/92 residues, respectively.) Thus, there appears to be some unidentified minor discrepancy between the GDT_TS calculations of CASP-14 vs. the method detailed at [[Calculating GDT_TS]].
 
==See Also==
*[[AlphaFold/Index]], a list of pages in Proteopedia about Alphafold.


==References & Notes==
==References & Notes==
<references />
<references />

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz