Theoretical models: Difference between revisions

← Older edit

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Jaime Prilusky, Wayne Decatur

@@ Line 1: / Line 1: @@
 <table style="background-color:#ffffb0;border:1px solid black;"><tr><td>
-'''BREAKTHROUGH!''' In 2020, a machine learning artificial intelligence system called ''AlphaFold2'' became able to predict the structures of a large subset of single protein chains successfully from their amino acid sequences. See [[#2020: CASP 14|CASP 14]].
+'''BREAKTHROUGH!''' In 2020, a machine learning artificial intelligence system called ''AlphaFold2'' became able to predict the structures of a large subset of single protein chains successfully from their amino acid sequences. See [[#2020: CASP 14|CASP 14]]. For the AlphaFold database of predictions, and AlphaFold-based servers that will predict structure from sequence, see [[AlphaFold]], and for practical guidance, [[How to predict structures with AlphaFold]].
 </td></tr></table>
 <center></center>
@@ Line 8: / Line 8: @@
 The distinction between theoretical and empirical models is important because when theoretical models are compared with empirical models, the theoretical models often contain significant errors. In contrast, when the structure of a particular macromolecule is determined using [[empirical models|empirical methods]] by different laboratories, or both by crystallography and NMR, the agreement is usually quite good.
-,390 theoretical models were historically deposited in the [[Protein Data Bank]] but [http://www.pdb.org/pdb/static.do?p=download/theoretical_models/index.html removed from the main database in 2002]. The structure displayed in the pages automatically generated in Proteopedia for these theoretical models should be interpreted with caution (see [[:Category:Theoretical Model]]).
+,390 theoretical models were historically deposited in the [[Protein Data Bank]] but [http://www.pdb.org/pdb/static.do?p=download/theoretical_models/index.html removed from the main database in 2002]. The structure displayed in the pages automatically generated in Proteopedia for these theoretical models should be interpreted with caution (see [[:Category:Theoretical Model]]). One such database where theoretical models are allowed is the [https://modelarchive.org/ ModelArchive] supported the Swiss Institute of Bioinformatics.
 ==Empirical Models==
@@ Line 52: / Line 52: @@
 There are also competitions to predict protein-protein docking interactions<ref>[http://www.ebi.ac.uk/msd-srv/capri/ CAPRI: Critical Assessment of PRediction of Interactions].</ref>
+===2022: CASP 15===
+Overall, AlphaFold2 continued to "convincingly outperform all other methods" when various methods were compared using "fully automated mode with default parameter settings, without any manual interventions"<ref name="bhattacharya">PMID: 37523536</ref>. AlphaFold2 predictions had a mean [[Calculating GDT TS|GDT-TS]] score of 73 (100 meaning perfect, and 0, meaningless). ESMFold, which is not based upon multiple sequence alignments, attained second best for backbone positioning (mean GDT-TS 61.6), outperforming RoseTTAFold (which is MSA based) for >80% of cases<ref name="bhattacharya" />. Individual domains were reliably predicted in the 19 multidomain targets, but predictions of domain orientations were less successful<ref name="bhattacharya" />. As an example, AlphaFold 2 achieved the best prediction for one large multi-domain target T1154, but the GDT-TS was only 24<ref name="bhattacharya" />. There is considerable room for improvement in prediction of side-chain positioning: while AlphaFold2 was most successful, its mean GDC-SC score fell short of 50<ref name="bhattacharya" />. Targets in CASP 15 (2022) included several new categories: 12 with RNA<ref name="rna">PMID: 37162955</ref><ref name="rna2">PMID: 37466021</ref>, some ligand protein complexes, and 41 quaternary assembly protein complexes<ref name="casp15new">PMID: 37306011</ref>. "... for the vast majority of proteins and protein complexes, AlphaFold can produce a model close to experimental quality."<ref name="elofsson">PMID: 37060758</ref>. The success rate for overall fold and interface prediction in complexes was 90%, vs. 31% in CASP 14<ref name="assemblies">PMID: 37503072</ref>. This was "largely due to the incorporation of DeepMind's AF2-Multimer approach into custom-built prediction pipelines"<ref name="assemblies" />.
 ===2020: CASP 14===
@@ Line 79: / Line 83: @@
 <center></center>
-'''AlphaFold2 Methods''': AlphaFold2 uses deep machine learning from the [[Protein Data Bank]] and sequence databases, and relies heavily on distances between beta-carbons<ref name="senior202001" />. AlphaFold2 is trained from data in the PDB to predict "the distances between pairs of residues, which convey more information about the structure than contact predictions."<ref name="senior202001" /> By one estimate<ref name="rubiera" />, "the DeepMind team had roughly two orders of magnitude more computational resources" than did academic groups competing in CASP 14. Further information about methods was provided by AlQuraishi<ref name="alquraishi" />.
+====AlphaFold2 Methods====
+AlphaFold2 uses deep machine learning from the [[Protein Data Bank]] and sequence databases, and relies heavily on distances between beta-carbons<ref name="senior202001" />, using co-evolution rates determined from multiple sequence alignments. AlphaFold2 is trained from data in the PDB to predict "the distances between pairs of residues, which convey more information about the structure than contact predictions."<ref name="senior202001" /> By one estimate<ref name="rubiera" />, "the DeepMind team had roughly two orders of magnitude more computational resources" than did academic groups competing in CASP 14. Further information about methods was provided by AlQuraishi<ref name="alquraishi" />.
 ====CASP 14 Global Distance Test Results====
-Performance was judged overall, in large part, by the ''global distance test total score'' or GDT_TS<ref name="gdtcasp">[https://predictioncenter.org/casp13/doc/LCS_GDT.README GDT description] at the CASP website.</ref><ref name="gdtwikipedia">[https://en.wikipedia.org/wiki/Global_distance_test Global distance test] at Wikipedia.</ref>.  GDT_TS values range from 0 (a meaningless prediction) to 100 (a perfect prediction). "Random predictions give around 20; getting the gross topology right gets one to ~50; accurate topology is usually around 70; and when all the little bits and pieces, including side-chain conformations, are correct, GDT_TS begins to climb above 90."<ref name="alquraishi">[https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/ AlphaFold2 @ CASP14: “It feels like one’s child has left home.”] by Mohammed AlQuraishi, December 8, 2020.</ref>.
+Performance was judged overall, in large part, by the ''global distance test total score'' or [[Calculating GDT TS|GDT_TS]]<ref name="gdtcasp">[https://predictioncenter.org/casp13/doc/LCS_GDT.README GDT description] at the CASP website.</ref><ref name="gdtwikipedia">[https://en.wikipedia.org/wiki/Global_distance_test Global distance test] at Wikipedia.</ref>.  GDT_TS values range from 0 (a meaningless prediction) to 100 (a perfect prediction; see [[Calculating GDT TS]]. "Random predictions give around 20; getting the gross topology right gets one to ~50; accurate topology is usually around 70; and when all the little bits and pieces, including side-chain conformations, are correct, GDT_TS begins to climb above 90."<ref name="alquraishi">[https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/ AlphaFold2 @ CASP14: “It feels like one’s child has left home.”] by Mohammed AlQuraishi, December 8, 2020.</ref>.
 A '''GDT_TS value of ~90''' means that the prediction is as close to an [[empirical model]] as would be an independently obtained second empirical model. GDT_TS gives an overall average measure of how close each amino acid in the predicted model is to those in the empirical model, taking into account many different superpositions of the two models. It is less sensitive to outlier regions than is the root mean square deviation (RMSD)<ref name="rmsdwikipedia">[https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions Root mean square deviation] at Wikipedia.</ref>. "RMSD uses the actual distances between alpha carbons, where GDT works with the percentage of alpha carbons that are found within certain cutoff distances of each other."<ref name="gdtfoldit">[https://foldit.fandom.com/wiki/GDT GDT in the Foldit Wiki].</ref>
@@ Line 122: / Line 127: @@
 :5. AlphaFold2 does not predict the '''trajectory of how a protein folds''', only the final structure<ref name="callaway" />.
-====AlphaFold Published July, 2021====
+====AlphaFold Servers and Database====
-AlphaFold was published in July, 2021<ref name="af2021">PMID: 34265844</ref>. Methods were described in considerable detail. The source code, trained weights, and inference script were made available under an open-source license. Structure prediction required about one GPU (Graphics Processing Unit) minute per model of about 384 amino acids.
-Impressively, AlphaFold had remarkable success predicting a set of 10,795 protein chain structures (filtered for high reliability, lengths restricted to 80-1,400 residues) published in the [[PDB]] after AlphaFold's initial training set. (The training set cutoff was 2018/04/30. The test set was obtained between then and 2021/02/15.)
+The methods and open-source code, as well as the advent of free servers offering to predict structures, and a huge database of predictions were published or became available in July, 2021. Please see [[AlphaFold]].
 ===2018: CASP 13===
@@ Line 143: / Line 147: @@
 ==See Also==
+*[[AlphaFold/Index]], a list of pages in Proteopedia about Alphafold.
+*[[Calculating GDT TS]]
 * Theoretical models displayed in Proteopedia must be clearly identified: see [[Proteopedia:Policy#Theoretical Models]] using methods explained at [[Proteopedia:Cookbook#Theoretical Models]].
 * [[:Category:Theoretical Model]]