Theoretical models: Difference between revisions
Eric Martz (talk | contribs) |
Eric Martz (talk | contribs) |
||
(156 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
< | <table style="background-color:#ffffb0;border:1px solid black;"><tr><td> | ||
'''BREAKTHROUGH!''' In 2020, a machine learning artificial intelligence system called ''AlphaFold2'' became able to predict the structures of a large subset of single protein chains successfully from their amino acid sequences. See [[#2020: CASP 14|CASP 14]]. For the AlphaFold database of predictions, and AlphaFold-based servers that will predict structure from sequence, see [[AlphaFold]], and for practical guidance, [[How to predict structures with AlphaFold]]. | |||
</td></tr></table> | |||
<center></center> | |||
The term ''theoretical model'' refers to a molecular model obtained using theory or artificial intelligence, such as [[homology modeling]], energy minimization, molecular mechanics, molecular dynamics, or a machine learning system. Such theoretical models are distinguished from [[empirical models]], which are usually obtained by [[X-ray crystallography]], [[NMR Ensembles of Models|nuclear magnetic resonance]] (NMR), or [[cryo-electron microscopy]]. | |||
The | The distinction between theoretical and empirical models is important because when theoretical models are compared with empirical models, the theoretical models often contain significant errors. In contrast, when the structure of a particular macromolecule is determined using [[empirical models|empirical methods]] by different laboratories, or both by crystallography and NMR, the agreement is usually quite good. | ||
1,390 theoretical models were historically deposited in the [[Protein Data Bank]] but [http://www.pdb.org/pdb/static.do?p=download/theoretical_models/index.html removed from the main database in 2002]. The structure displayed in the pages automatically generated in Proteopedia for these theoretical models should be interpreted with caution (see [[:Category:Theoretical Model]]). One such database where theoretical models are allowed is the [https://modelarchive.org/ ModelArchive] supported the Swiss Institute of Bioinformatics. | |||
1,390 theoretical models were historically deposited in the [[Protein Data Bank]] but [http://www.pdb.org/pdb/static.do?p=download/theoretical_models/index.html removed from the main database in 2002]. The structure displayed in the pages automatically generated in Proteopedia for these theoretical models should be interpreted with caution (see [[:Category:Theoretical Model]]). | |||
==Empirical Models== | ==Empirical Models== | ||
Empirical models are not ''theoretical models'', but are mentioned here for the sake of completeness. Empirical models, usually determined by [[X-ray crystallography]] | [[Empirical models]] are not ''theoretical models'', but are mentioned here for the sake of completeness. Empirical models, usually determined by [[X-ray crystallography]], [[NMR|nuclear magnetic resonance]] or [[cryo-electron microscopy]], are the most reliable and accurate models available. Methods for judging the reliability and quality of empirical models are discussed at [[Quality assessment for molecular models]]. Independent determinations of the same protein by empirical methods generally agree within <1.0 Å root mean square deviation (RMSD) for alpha carbon atoms (reference needed). | ||
==Homology Models== | ==Homology Models== | ||
Line 44: | Line 46: | ||
==Ab Initio Models== | ==Ab Initio Models== | ||
When there is no template with sufficient sequence identity to use for homology modeling, one can use ''ab initio'' or ''de novo'' folding theory to predict the structure of a target protein sequence. | When there is no template with sufficient sequence identity to use for homology modeling, one can use ''ab initio'' or ''de novo'' folding theory, or ''machine learning artificial intelligence'' to predict the structure of a target protein sequence. | ||
===CASP=== | |||
The success of structure prediction methods is assessed biannually in the ''Critical Assessment of techniques for protein Structure Prediction'' ([[CASP]]) competitions<ref>[http://predictioncenter.gc.ucdavis.edu/ Critical Assessment of techniques for protein Structure Prediction (CASP)].</ref>. Crystallographers submit sequences which they have solved, but for which the structures have not yet been published. Modelers predict the structures which are then compared with subsequently published structures. Beginning in CASP5 (2002), the ability to predict [[Intrinsically Disordered Protein|intrinsic disorder]] was included<ref>PMID: 19774619</ref>. Assessment of CASP results is done in a '''double-blind''' manner: the predictors do not have access to the empirical structures, and the assessors do not know the identities of the predictors, which are coded. | |||
There are also competitions to predict protein-protein docking interactions<ref>[http://www.ebi.ac.uk/msd-srv/capri/ CAPRI: Critical Assessment of PRediction of Interactions].</ref> | |||
===2022: CASP 15=== | |||
Overall, AlphaFold2 continued to "convincingly outperform all other methods" when various methods were compared using "fully automated mode with default parameter settings, without any manual interventions"<ref name="bhattacharya">PMID: 37523536</ref>. AlphaFold2 predictions had a mean [[Calculating GDT TS|GDT-TS]] score of 73 (100 meaning perfect, and 0, meaningless). ESMFold, which is not based upon multiple sequence alignments, attained second best for backbone positioning (mean GDT-TS 61.6), outperforming RoseTTAFold (which is MSA based) for >80% of cases<ref name="bhattacharya" />. Individual domains were reliably predicted in the 19 multidomain targets, but predictions of domain orientations were less successful<ref name="bhattacharya" />. As an example, AlphaFold 2 achieved the best prediction for one large multi-domain target T1154, but the GDT-TS was only 24<ref name="bhattacharya" />. There is considerable room for improvement in prediction of side-chain positioning: while AlphaFold2 was most successful, its mean GDC-SC score fell short of 50<ref name="bhattacharya" />. Targets in CASP 15 (2022) included several new categories: 12 with RNA<ref name="rna">PMID: 37162955</ref><ref name="rna2">PMID: 37466021</ref>, some ligand protein complexes, and 41 quaternary assembly protein complexes<ref name="casp15new">PMID: 37306011</ref>. "... for the vast majority of proteins and protein complexes, AlphaFold can produce a model close to experimental quality."<ref name="elofsson">PMID: 37060758</ref>. The success rate for overall fold and interface prediction in complexes was 90%, vs. 31% in CASP 14<ref name="assemblies">PMID: 37503072</ref>. This was "largely due to the incorporation of DeepMind's AF2-Multimer approach into custom-built prediction pipelines"<ref name="assemblies" />. | |||
===2020: CASP 14=== | |||
The best predictions at CASP 13 (2018) correctly predicted "folds" and the topology of secondary structure elements (helices and beta strands), but fell short of correctly predicting entire structures in detail. | |||
In CASP 14 (2020), the '''AlphaFold2'''<ref name="senior202001">PMID: 31942072</ref><ref name="alphafoldwikipedia">[https://en.wikipedia.org/wiki/AlphaFold AlphaFold] at Wikipedia.</ref> system of [http://deepmind.com DeepMind]<ref name="deepmindblog">[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology AlphaFold: a solution to a 50-year-old grand challenge in biology], DeepMind Blog, November 30, 2020.</ref><ref name="deepmindwikipedia">[https://en.wikipedia.org/wiki/DeepMind DeepMind] at Wikipedia.</ref> demonstrated a '''major breakthrough'''<ref name="alquraishi" /><ref name="casppressrelease">[https://predictioncenter.org/casp14/doc/CASP14_press_release.html Artificial intelligence solution to a 50-year-old science challenge could ‘revolutionise’ medical research], CASP Press Release, November 30, 2020.</ref><ref name="callaway" /><ref name="helliwell">[https://www.iucr.org/news/newsletter/volume-28/number-4/deepmind-and-casp14 DeepMind and CASP14] by John R. Helliwell, International Union of Crystallography Newsletter, December 4, 2020.</ref>. AlphaFold2 was far better able, among over 100 competing groups, to predict structures, including sidechain positions, so close to the subsequently revealed X-ray crystallographic structures as to differ by little more than the differences between two independently-determined X-ray structures of the same molecule. It did this for about two-thirds of the targets in the competition. AlphaFold2 has been hailed as '''largely solving the protein structure prediction problem for single-chain proteins'''<ref name="alquraishi" /><ref name="casppressrelease" /><ref name="callaway">PMID: 33257889</ref><ref name="helliwell" />. "Never in my life had I expected to see a scientific advance so rapid." said Mohammed AlQuraishi of Columbia University<ref name="alquraishi" />. | |||
<table style="background-color:#ffffb0;"><tr><td width="160"> | |||
<span style="font-size:120%;"> | |||
See [[AlphaFold2 examples from CASP 14]] for some detailed comparisons. | |||
</span> | |||
</td><td> | |||
<imagemap> | |||
Image:Alphafold2-orf8-vs-7jx6-114x124.gif | |||
default [[AlphaFold2 examples from CASP 14]] | |||
</imagemap> | |||
</td><td width="20"> | |||
</td><td> | |||
<imagemap> | |||
Image:AlphaFold-snapshot-DeepMind-h124px.png | |||
default [https://www.youtube.com/watch?v=gg7WjuFs8F4] | |||
</imagemap> | |||
</td><td width="160"> | |||
Visit the DeepMind AlphaFold2 team and hear commentary by luminaries such as John Moult at '''[https://www.youtube.com/watch?v=gg7WjuFs8F4 YouTube]'''. | |||
</td></tr></table> | |||
<center></center> | |||
====AlphaFold2 Methods==== | |||
AlphaFold2 uses deep machine learning from the [[Protein Data Bank]] and sequence databases, and relies heavily on distances between beta-carbons<ref name="senior202001" />, using co-evolution rates determined from multiple sequence alignments. AlphaFold2 is trained from data in the PDB to predict "the distances between pairs of residues, which convey more information about the structure than contact predictions."<ref name="senior202001" /> By one estimate<ref name="rubiera" />, "the DeepMind team had roughly two orders of magnitude more computational resources" than did academic groups competing in CASP 14. Further information about methods was provided by AlQuraishi<ref name="alquraishi" />. | |||
====CASP 14 Global Distance Test Results==== | |||
Performance was judged overall, in large part, by the ''global distance test total score'' or [[Calculating GDT TS|GDT_TS]]<ref name="gdtcasp">[https://predictioncenter.org/casp13/doc/LCS_GDT.README GDT description] at the CASP website.</ref><ref name="gdtwikipedia">[https://en.wikipedia.org/wiki/Global_distance_test Global distance test] at Wikipedia.</ref>. GDT_TS values range from 0 (a meaningless prediction) to 100 (a perfect prediction; see [[Calculating GDT TS]]. "Random predictions give around 20; getting the gross topology right gets one to ~50; accurate topology is usually around 70; and when all the little bits and pieces, including side-chain conformations, are correct, GDT_TS begins to climb above 90."<ref name="alquraishi">[https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/ AlphaFold2 @ CASP14: “It feels like one’s child has left home.”] by Mohammed AlQuraishi, December 8, 2020.</ref>. | |||
A '''GDT_TS value of ~90''' means that the prediction is as close to an [[empirical model]] as would be an independently obtained second empirical model. GDT_TS gives an overall average measure of how close each amino acid in the predicted model is to those in the empirical model, taking into account many different superpositions of the two models. It is less sensitive to outlier regions than is the root mean square deviation (RMSD)<ref name="rmsdwikipedia">[https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions Root mean square deviation] at Wikipedia.</ref>. "RMSD uses the actual distances between alpha carbons, where GDT works with the percentage of alpha carbons that are found within certain cutoff distances of each other."<ref name="gdtfoldit">[https://foldit.fandom.com/wiki/GDT GDT in the Foldit Wiki].</ref> | |||
Based on GDT_TS, the most successful predictions were by '''AlphaFold2, which achieved a median GDT_TS of 92.4'''<ref name="alquraishi" />. The second most successful predictions were by by BAKER (David Baker group). A group with median success was CAO-QA1 (Renzhi Cao, Kyle Hippe, & Mikhail Korovnik). | |||
{| style="text-align:center;" class="wikitable" | |||
|- | |||
! Group Name !! Rank !! GDT_TS ≥ 90 !! <span class="text-red">GDT_TS ≥ 87<span> || GDT_TS High || GDT_TS Median || GDT_TS Low | |||
|-''''''' | |||
| AlphaFold2 || 1 || 55% || <span class="text-red">'''68%'''<span> || 99 || 92 || 45 | |||
|- | |||
| BAKER || 2 || 5% || <span class="text-red">'''8%'''<span> || 96 || 70 || 25 | |||
|- | |||
| CAO-QA1 || 73 || 1% || <span class="text-red">'''1%'''<span> || 91 || 36 || 4 | |||
|} | |||
Each of the three groups in the above table submitted 92 predictions. Data are for FM (Free Modeling) and TBM (Template Based Modeling) targets<ref name="fm+tbm">Data from [https://predictioncenter.org/casp14/results.cgi?view=tb-sel CASP 14 "Table Browser"]. '''Caution:''' A maximum of 1,200 results are shown. To see all results for a given group, you must select that group alone. If you select all groups, only the subset of predictions with the highest GDT_TS scores is shown for the subset of groups listed.</ref>. | |||
====CASP 14 Rankings==== | |||
AlphaFold2 ranked first, by a wide margin, for all categories of targets. Groups making predictions were ranked by the sums of the Z-scores for their predictions<ref name="zscores">[https://predictioncenter.org/casp14/zscores_final.cgi TS Analysis: Group performance based on combined z-scores] for CASP 14 at PredictionCenter.Org.</ref>. A Z-score is the GDT_TS score for one prediction minus the mean of all GDT_TS scores for the target in question, divided by the standard deviation for all GDT_TS scores. For 92 single-domain targets, AlphaFold2's Z-score sum was 2.7 fold higher than the second best, which was the group of David Baker. It was 14-fold higher than the median. For 10 multi-domain targets, AlphaFold2's Z-score sum was 3.6-fold higher than the second best (again, the David Baker group), and 23-fold higher than the median. | |||
====AlphaFold2 Pros and Cons==== | |||
In February, 2021, AlphaFold2 is '''not yet available''' to the general scientific community. When it, or another system based on the same principles, does become available, it is not clear how much computing power will be needed. By one estimate<ref name="rubiera">[https://www.blopig.com/blog/2020/12/casp14-what-google-deepminds-alphafold-2-really-achieved-and-what-it-means-for-protein-folding-biology-and-bioinformatics/ CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics], a blog post by Carlos Outeir al Rubiera, December 3, 2020.</ref>, "the DeepMind team had roughly two orders of magnitude more computational resources" than did academic groups competing in CASP 14. | |||
AlphaFold2 was developed by [http://deepmind.com DeepMind], a for-profit company whose parent company is Alphabet, Inc., the parent company of Google<ref name="deepmindwikipedia" />. '''An ethical question arises''', since AlphaFold2 was trained on public datasets, largely funded by public money<ref name="callaway" />. "Big as DeepMind's war chest might be, the taxpayers' investment that has made their achievement possible is several orders of magnitude larger."<ref name="rubiera" />. | |||
When the new technology becomes widely available, X-ray crystallographers may often be able to '''skip solving the phase problem''', since they can solve their diffraction data by '''molecular replacement''', using predicted structures -- at least for single chain and single domain structures. This has already occurred: the group of Henning Tidow had toiled away for over a year on a structure which they were able to solve in less than a day using a prediction from DeepMind<ref name="rubiera" />. | |||
'''Neither [[empirical methods]] nor theoretical methods are obsolete'''. | |||
:1. AlphaFold2 does very well for about 2/3 of 92 single chain domains targeted in CASP 14, but '''less well for the remaining third'''. Its performance for sequence families not represented in CASP 14, and not well represented in the [[Protein Data Bank]] training set, remains to be seen. | |||
:2. AlphaFold2 does not predict '''interactions between chains'''<ref name="callaway" /> forming functional [[biological assemblies]]. | |||
:3. AlphaFold2 does not predict '''ligand binding''', including the positions of '''metals''' in the one-third of proteins that are metalloproteins<ref name="callaway" />. | |||
:4. AlphaFold2 does not predict protein '''kinetics and allostery''', often crucial for function<ref name="callaway" />. | |||
:5. AlphaFold2 does not predict the '''trajectory of how a protein folds''', only the final structure<ref name="callaway" />. | |||
====AlphaFold Servers and Database==== | |||
The methods and open-source code, as well as the advent of free servers offering to predict structures, and a huge database of predictions were published or became available in July, 2021. Please see [[AlphaFold]]. | |||
===2018: CASP 13=== | |||
Excerpts from the conclusions: "... the ability of predicting hard protein folds at the tertiary level has increased enormously ..." "On the other hand, important global and local features of prediction models are still seldom as accurate as in the experimental structure. This is the case of enzyme active sites and ligand binding sites, where the predicted arrangement of the amino acids side chains involved in ligand binding and substrate specificity has not achieved the level of accuracy required to confidently infer their function .... Accurate prediction of loops is still a challenging task*. As they are often involved in protein interactions, their incorrect prediction can compromise the accuracy of the interacting surface and overall structure of the complex." "... the ability of current methods in modeling the correct quaternary structure of proteins remains rudimentary and shows little progress compared to what observed at the tertiary level."<ref>Lepore <i>et al.</i>, in press in <i>Proteins: Structure, Function, and Bioinformatics</i>, 2019. DOI: [http://doi.org/10.1002/prot.25805 10.1002/prot.25805]</ref> | |||
<blockquote> | |||
"The most recent experiment (CASP13 held in 2018) saw dramatic progress in structure modeling without use of structural templates (historically 'ab initio' modeling). Progress was driven by the successful application of deep learning techniques to predict inter-residue distances. In turn, these results drove dramatic improvements in three-dimensional structure accuracy: With the proviso that there are an adequate number of sequences known for the protein family, the new methods essentially solve the long-standing problem of predicting the fold topology of monomeric proteins."<ref name="casp13b">PMID: 31589781</ref> | |||
</blockquote> | |||
<nowiki>*</nowiki>Fig. 4 in Kryshtafovych ''et al.''<ref name="casp13b" /> illustrates how, in the case of [[6cci]] (~350 residues), the core of the protein is well-predicted, while the surface loops are poorly predicted. Surfaces of folded proteins are generally critical in their functions. | |||
===2008: CASP 8=== | |||
In CASP 8 (2008), there were 13 "template free" targets, that is, sequences for which no significant sequence identity occurred for any empirically solved entry in the [[PDB]]. These are the most difficult to predict, as they must be predicted by ''ab initio'' methods. 102 groups submitted predictions. Assessing the quality of a prediction is not simple, given that even "good" predictions can have high root mean square (RMS) deviations for alpha carbon alignment, e.g. due to a hinge<ref name="casp8" />. Several assessment methods were used, each emphasizing different qualities. A number of groups submitted good predictions for six of the thirteen targets<ref name="casp8">PMID: 19774550</ref>. None of the submitted models was judged to be satisfactory for four of the thirteen targets<ref name="casp8" />. | |||
===2004: CASP 6=== | |||
In 2004, for about one out of four cases of small domains of less than 85 amino acids, the best predictions were within about 1.5 Å (RMS for carbon alphas) of the true structure<ref>PMID: 16166519</ref>. (Independent determinations of the same protein by empirical methods generally agree within <1.0 Å RMS for carbon alphas.) | |||
==See Also== | ==See Also== | ||
*[[AlphaFold/Index]], a list of pages in Proteopedia about Alphafold. | |||
*[[Calculating GDT TS]] | |||
* Theoretical models displayed in Proteopedia must be clearly identified: see [[Proteopedia:Policy#Theoretical Models]] using methods explained at [[Proteopedia:Cookbook#Theoretical Models]]. | * Theoretical models displayed in Proteopedia must be clearly identified: see [[Proteopedia:Policy#Theoretical Models]] using methods explained at [[Proteopedia:Cookbook#Theoretical Models]]. | ||
* [[:Category:Theoretical Model]] | * [[:Category:Theoretical Model]] | ||
* [[Homology modeling]] | * [[Homology modeling]] | ||
* [[Practical Guide to Homology Modeling]] | |||
* [[Homology modeling servers]] | |||
* [[Structural genomics]] | * [[Structural genomics]] | ||
* [[CASP]] | |||
* [[CAMEO]] | |||
* [http://folding.stanford.edu/ Folding@Home] and [http://boinc.bakerlab.org/rosetta/ Rosetta@home] use distributed computing to fold proteins and address structural biology issues. | * [http://folding.stanford.edu/ Folding@Home] and [http://boinc.bakerlab.org/rosetta/ Rosetta@home] use distributed computing to fold proteins and address structural biology issues. | ||
* [http://fold.it/portal/info/science Protein Folding Game called Foldit]: Are humans good at folding? Can crowdsourcing help the folding problem? Game to assess how well you folded a solved protein. <!--Anyone tried the Foldit game and know if it could be recommended on the High School page for students with great interest in exploring protein structure more?--> Structures given as folding puzzles aren't include solved structures and necessarily derived from theoretical; however, the goal of users is to produce a theoretical model that matches the known structure without referring to known structure. Check out [http://www.nature.com/nature/journal/v466/n7307/full/nature09304.html an article on it in Nature]<ref>PMID: 20686574</ref> and [http://www.biotechniques.com/news/Online-videogame-solves-protein-folding-problems/biotechniques-302088.html this coverage] of the project. | * [http://fold.it/portal/info/science Protein Folding Game called Foldit]: Are humans good at folding? Can crowdsourcing help the folding problem? Game to assess how well you folded a solved protein. <!--Anyone tried the Foldit game and know if it could be recommended on the High School page for students with great interest in exploring protein structure more?--> Structures given as folding puzzles aren't include solved structures and necessarily derived from theoretical; however, the goal of users is to produce a theoretical model that matches the known structure without referring to known structure. Check out [http://www.nature.com/nature/journal/v466/n7307/full/nature09304.html an article on it in Nature]<ref>PMID: 20686574</ref> and [http://www.biotechniques.com/news/Online-videogame-solves-protein-folding-problems/biotechniques-302088.html this coverage] of the project. | ||
Line 63: | Line 165: | ||
==References & Links== | ==References & Links== | ||
<references /> | <references /> | ||
==Acknowledgements== | |||
[[User:Eric Martz|Eric Martz]] thanks Roman Sloutsky, Can Özden, Jeanne Hardy, Scott Garman, Thomas Sawyer, Katie Wahlbeck, Erik Nordquist, Nathaniel Kuzio (University of Massachusetts, Amherst) and Woody Sherman (Silicon Therapeutics) for introducing him to CASP 14 and AlphaFold2. |