Quality assessment for molecular models: Difference between revisions
Eric Martz (talk | contribs) |
fix a typo |
||
(31 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
Generally, crystallographic models are reliable in most details when they have [[Resolution|resolutions]] of 2.0 Å or better (the lower the number the better), [[R value|R values]] of 0.20 or less, and [[Free R]] values of 0.25 or less. However, new and important structural insights are often provided by models with much lower resolution. Interestingly, the quality of published molecular models is inversely related to the impacts of the journals in which they are published<ref>Brown EN, Ramaswamy S. 2007. Quality of protein crystal structures. [http://www.blackwell-synergy.com/doi/full/10.1107/S0907444907033847 Biol. Crystallography 63:941-950].</ref>. | Generally, crystallographic models are reliable in most details when they have [[Resolution|resolutions]] of 2.0 Å or better (the lower the number the better), [[R value|R values]] of 0.20 or less, and [[Free R]] values of 0.25 or less. However, new and important structural insights are often provided by models with much lower resolution. Interestingly, the quality of published molecular models is inversely related to the impacts of the journals in which they are published<ref>Brown EN, Ramaswamy S. 2007. Quality of protein crystal structures. [http://www.blackwell-synergy.com/doi/full/10.1107/S0907444907033847 Biol. Crystallography 63:941-950].</ref>. | ||
===Validation By The PDB=== | |||
Detailed validation reports are available from the PDB for all entries, including those deposited before the validation process was implemented by the PDB. An example is the [https://files.rcsb.org/pub/pdb/validation_reports/zg/6zgg/6zgg_full_validation.pdf Full Validation Report for 6ZGG] (SARS-CoV-2 spike protein in an open conformation). To access such validation reports at RCSB, go to the page about the structure (for example [https://www.rcsb.org/structure/6zgg 6ZGG]), and scroll down to the section ''Experimental Data & Validation''. | |||
====History==== | |||
In 2011, the Validation Task Force of the worldwide [[Protein Data Bank]] recommended that state-of-the-art crystallographic validation tools be used to generate succinct reports, understandable to non-experts, at the time a [[PDB code]] is assigned, and made available to the authors, reviewers, and users of the model<ref name="vtf" />. Their report<ref name="vtf" /> discusses these validation tools in some detail, including: | In 2011, the Validation Task Force of the worldwide [[Protein Data Bank]] recommended that state-of-the-art crystallographic validation tools be used to generate succinct reports, understandable to non-experts, at the time a [[PDB code]] is assigned, and made available to the authors, reviewers, and users of the model<ref name="vtf" />. Their report<ref name="vtf" /> discusses these validation tools in some detail, including: | ||
Line 14: | Line 18: | ||
*All-atom contacts (clash score, Asn/Gln/His flips), analyses available from [[MolProbity]]. | *All-atom contacts (clash score, Asn/Gln/His flips), analyses available from [[MolProbity]]. | ||
*Underpacking (holes in the core). Analysis available from RosettaHoles2<ref>PMID: 20665689</ref> (not available as a server). | *Underpacking (holes in the core). Analysis available from RosettaHoles2<ref>PMID: 20665689</ref> (not available as a server). | ||
===Validation of SARS-CoV-2 Structures=== | |||
In 2021, Grabowski ''et al.'' analyzed ~1,000 recently deposited models of SARS-CoV-2 proteins, comparing the models with the deposited X-ray diffraction data (structure factors)<ref name="rapid">PMID: 33953926</ref>. They emphasized the importance of this, rather than relying on the meta-data in the PDB file [[Header of PDB file|header section]]: | |||
<blockquote> | |||
"Metadata that are only contained in the PDB itself can be unreliable because they are supplied by the researcher who made the deposition. Inexperience or haste may lead to information being submitted to the wrong field, to inappropriate values being entered or to data items being skipped. First-time depositors make as many as 20% of all PDB depositions (assuming that the first author of a structure is responsible for the deposition); therefore, mistakes are not uncommon." | |||
</blockquote> | |||
They found minor to moderate quality issues in about 100 structures, and serious issues in nine, two of which are presented in case studies. They provided a database of "validated SARS-CoV-2 related structural models of potential drug targets" at [http://covid19.bioreproducibility.org covid19.bioreproducibility.org], which includes diagnostic tools<ref name="covid19repro">PMID: 32981130</ref>. | |||
==NMR Models== | ==NMR Models== | ||
Models resulting from [[NMR Ensembles of Models|solution NMR experiments]] account for about 15% of those published in the [[Protein Data Bank]]. These are generally less reliable than crystallographic models because the method yields less detailed information. For NMR, there are no widely reported global error estimates equivalent to the crystallographic [[R value]] and [[Free R]]. Unlike with crystallographic results, it is not possible to distinguish reliable from unreliable NMR models from information included in the PDB files. NMR models are more likely to contain major errors <ref>Traditional biomolecular structure determination by NMR spectroscopy allows for major errors. Sander B. Nabuurs, Chris. A. E. M. Spronk, Geerten W. Vuister, and Gert Vriend. (2006). PLoS Computational Biology 2: [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020009 Open Access Full Text] [http://proteinexplorer.org/favlit/nmr.htm Precis]. DOI: 10.1371/journal.pcbi.0020009</ref> than are crystallographic models that have good [[Resolution]] and [[Free R]] values. | Models resulting from [[NMR Ensembles of Models|solution NMR experiments]] account for about 15% of those published in the [[Protein Data Bank]]. These are generally less reliable than crystallographic models because the method yields less detailed information. For NMR, there are no widely reported global error estimates equivalent to the crystallographic [[R value]] and [[Free R]]. Unlike with crystallographic results, it is not possible to distinguish reliable from unreliable NMR models from information included in the PDB files. NMR models are more likely to contain major errors <ref>Traditional biomolecular structure determination by NMR spectroscopy allows for major errors. Sander B. Nabuurs, Chris. A. E. M. Spronk, Geerten W. Vuister, and Gert Vriend. (2006). PLoS Computational Biology 2: [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020009 Open Access Full Text] [http://proteinexplorer.org/favlit/nmr.htm Precis]. DOI: 10.1371/journal.pcbi.0020009</ref> than are crystallographic models that have good [[Resolution]] and [[Free R]] values. In 2012, an X-ray crystallographic structure of integral membrane diacylglycerol kinase, [[3ze4]], revealed functionally important domain swapping<ref>PMID: 23676672</ref><ref>PMID: 23676677</ref> that was not present in an earlier NMR structure [[2kdc]]<ref>PMID: 19556511</ref>. At least one rapid approach <ref>PMID: 23779148</ref> has been introduced to avoid misassignments, as summarized [http://www.rsc.org/chemistryworld/2013/06/nmr-misassignments-spreadsheet-artificial-neural-networks here]. In 2020 a "useful addition to existing measures of accuracy" was proposed in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7749147/ 'A method for validating the accuracy of NMR protein structures' by Fowler et al.].<ref>PMID: 33339822</ref> . [https://github.com/nickjf/ANSURR The software repository related to that method] has currently not been updated since 2021. | ||
==Global vs. Local Quality== | ==Global vs. Local Quality== | ||
Line 28: | Line 40: | ||
The orientation of the sidechains of Asn, Gln, and His cannot be determined from the electron density in a crystallographic experiment at typical resolution, because of the similarity in electron densities of carbon vs. nitrogen. It is usually straightforward to determine the correct orientation by examining the local environment and optimising hydrogen bonding. Unfortunately, is is common for these determinations not to be made in published crystallographic models. Fortunately, [[MolProbity]] does these determinations automatically, and corrects the model by flipping the sidechains of Asn, Gln and HIs when this is warranted. | The orientation of the sidechains of Asn, Gln, and His cannot be determined from the electron density in a crystallographic experiment at typical resolution, because of the similarity in electron densities of carbon vs. nitrogen. It is usually straightforward to determine the correct orientation by examining the local environment and optimising hydrogen bonding. Unfortunately, is is common for these determinations not to be made in published crystallographic models. Fortunately, [[MolProbity]] does these determinations automatically, and corrects the model by flipping the sidechains of Asn, Gln and HIs when this is warranted. | ||
==Local Quality Scores of Protein Models in Cryo-EM Maps== | |||
The [https://daqdb.kiharalab.org/ DAQ-Score Database] provides pre-computed residue-wise local quality scores for structure models in the [[PDB]] that were derived from [[cryo-EM]] maps<ref>PMID: 35953671</ref>. | |||
Recently, all the cryo-EM models in the PDB with 3-5Å resolution were included in a scan to find likely register errors by scanning to find inconsistencies between the residue contacts and distances observed in the model in the PDB and those computationally predicted by methods such as AlphaFold 2<ref>https://www.biorxiv.org/content/10.1101/2024.07.19.604304</ref>. | |||
==Improving Published Models== | ==Improving Published Models== | ||
There are several free automated servers that can improve most published models. See [[Improving published models]]. | There are several free automated servers that can improve most published models. See [[Improving published models]]. | ||
Recently, new methods for protein structure validation based on the compatibility of a structure with the inter-residue distances and contacts predicted by methods such as AlphaFold 2 have been introduced <ref>PMID: 36458613/</ref>. Then all 3-5Å resolution cryo-EM and crystallographic structures in the PDB with that method, identifying thousands of likely register errors via these scans for locally incorrect PDB structures <ref>https://www.biorxiv.org/content/10.1101/2024.07.19.604304</ref>. | |||
==Further Reading== | ==Further Reading== | ||
Laskowski<ref>Laskowski, Roman A. 2003. Structural quality assurance. Chapter 14 in ''Structural Bioinformatics'' (2003) edited by Philip E. Bourne and Helge Weissig, Wiley-Liss, 649 pages. Complete contents at [http://www.structuralbioinformaticsbook.com structuralbioinformaticsbook.com].</ref> has provided an outstandingly clear and succinct overview of how to assess model quality. See also the 2007 overview by Kleywegt<ref>Kleywegt, GJ. 2007. Quality control and validation. Methods Mol. Biol. 364:255-72. [http://www.ncbi.nlm.nih.gov/pubmed/17172770 PubMed].</ref> For examples of published crystallographic errors, see Laskowski, and Kleywegt, 2000<ref>Kleywegt, GJ. 2000. Validation of protein crystal structures. Acta. Crystallogr. D. Biol. Crystallogr. 56:249-265</ref>, and Kleywegt and Brünger, 1996<ref>Kleywegt, GJ, AT Brünger. 1996. Checking your imagination: applications of the free R value. Structure 4:897-904. [http://www.ncbi.nlm.nih.gov/pubmed/8805582 PubMed].</ref>. Kleywegt has also provided an excellent on-line tutorial on model validation<ref>[http://xray.bmc.uu.se/gerard/embo2001/modval/index.html Practical Model Validation] by Gerard Kleywegt, University of Uppsala, Sweden</ref>. | Wlodawer ''et al.'' (2008) explain how non-crystallographers can judge model quality<ref name="wlodawer-best">PMID: 18034855</ref>. Laskowski<ref>Laskowski, Roman A. 2003. Structural quality assurance. Chapter 14 in ''Structural Bioinformatics'' (2003) edited by Philip E. Bourne and Helge Weissig, Wiley-Liss, 649 pages. Complete contents at [http://www.structuralbioinformaticsbook.com structuralbioinformaticsbook.com].</ref> has provided an outstandingly clear and succinct overview of how to assess model quality. See also the 2007 overview by Kleywegt<ref>Kleywegt, GJ. 2007. Quality control and validation. Methods Mol. Biol. 364:255-72. [http://www.ncbi.nlm.nih.gov/pubmed/17172770 PubMed].</ref> For examples of published crystallographic errors, see Laskowski, and Kleywegt, 2000<ref>Kleywegt, GJ. 2000. Validation of protein crystal structures. Acta. Crystallogr. D. Biol. Crystallogr. 56:249-265</ref>, and Kleywegt and Brünger, 1996<ref>Kleywegt, GJ, AT Brünger. 1996. Checking your imagination: applications of the free R value. Structure 4:897-904. [http://www.ncbi.nlm.nih.gov/pubmed/8805582 PubMed].</ref>. Kleywegt has also provided an excellent on-line tutorial on model validation<ref>[http://xray.bmc.uu.se/gerard/embo2001/modval/index.html Practical Model Validation] by Gerard Kleywegt, University of Uppsala, Sweden</ref>. | ||
See also the publications cited at [[Retractions and Fraud]], where you will find links to sites where you can search for retractions or expressions of concern. | |||
==See Also== | ==See Also== | ||
Line 42: | Line 64: | ||
*[[R value]] | *[[R value]] | ||
*[[Free R]] | *[[Free R]] | ||
*[[Clashes]] | |||
*[[Water in macromolecular models]]: Too many or too few water molecule per amino acid suggest unreliability. | |||
*[[Temperature value]] | *[[Temperature value]] | ||
*[[NMR Ensembles of Models]] | *[[NMR Ensembles of Models]] | ||
*[[Hydrogen in macromolecular models]] | *[[Hydrogen in macromolecular models]] | ||
*[[Improving published models]] | *[[Improving published models]] | ||
*[[Retractions and Fraud]], which includes links to sites where you can search for retractions or expressions of concern. | |||
*[[Anisotropic refinement]] | *[[Anisotropic refinement]] | ||
*[[Molecular modeling and visualization software]] | |||
==Content Donors== | ==Content Donors== |