Practical Guide to Homology Modeling: Difference between revisions

From Proteopedia
Jump to navigation Jump to search
Eric Martz (talk | contribs)
Eric Martz (talk | contribs)
Line 48: Line 48:


For example, “Identities: 355/1045 (34%)” means that 1,045 residues of your query sequence align to the hit with 34% sequence identity (355 identical residues in the alignment). Knowing that my query had length 1,170 residues, I can see that this potential template for a homology model would enable me to model 1,045/1,170 = 89% of my query sequence. Quite often the alignment would span a much smaller portion of the full-length sequence.
For example, “Identities: 355/1045 (34%)” means that 1,045 residues of your query sequence align to the hit with 34% sequence identity (355 identical residues in the alignment). Knowing that my query had length 1,170 residues, I can see that this potential template for a homology model would enable me to model 1,045/1,170 = 89% of my query sequence. Quite often the alignment would span a much smaller portion of the full-length sequence.
<font color='red'>'''BEWARE!'''</font> The sequence identity percentage may be '''underestimated''' at pdb.org. This happens when pdb.org deems segments of the query sequence to be of low complexity. Such segments are marked with X’s in the sequence alignment, and excluded from the calculation of sequence identity. For example, for Saccharomyces gal4 (UniProt P04386), for the top hit (3coq), pdb.org reports “Identities: 71/89 (80%)”, while in fact the sequence identity is 100%. Note this in the sequence alignment at pdb.org:
[[Image:Seq-algn-lo-complexity.png|center]]
The 18 residues marked X were not included in the identity calculation. In contrast, when the same sequence search is performed at PDB-Europe, 100% sequence identity is reported. However, other aspects of the report at PDB-Europe are less satisfactory (e.g. the length of the alignment is not stated; the sequences are not numbered) and hence we recommend using pdb.org despite its misleading sequence identity percentages.


== References ==
== References ==
<references/>
<references/>

Revision as of 22:38, 28 December 2014

TerminologyTerminology

  • Query sequence: The amino acid sequence for which a 3D model is wanted. More commonly called the target sequence, but talking about target vs. template gets confusing.
  • Template: An empirically determined 3D protein structure with significant sequence similarity to the query.
  • Structure will be used in this article to mean three-dimensional structure.

What Is A Homology Model?What Is A Homology Model?

Homology models, also called comparative models, are obtained by folding a query protein sequence (also called the target sequence) to fit an empirically-determined template model. The registration between residues in the query and template is determined by an amino acid sequence alignment between the query and template sequences.

Imagine that the template’s polypeptide backbone is a folded glass tube. Now imagine that the query sequence is a thin metal chain that can be pulled through the tube. The chain (query) will adopt the same fold as the tube (template). The sequence alignment specifies how far the chain should be pulled into the tube; that is, how the residues in the query sequence match up with the structure of the template.

Errors or uncertainties in the sequence alignment result in errors or uncertainties in the homology model. Portions of the query sequence cannot be modeled reliably when there are gaps in the sequence alignment due to insertions/deletions ("indels"), or portions of the template that lack coordinates due to crystallographic disorder. Provided there is sufficient sequence identity between the query and template (at least 30%), the main chain in homology models is usually mostly correct. However, the positions of sidechain rotamers in homology models are usually unreliable.

Nevertheless, homology models are useful for seeing low-resolution features, such as which residues are on the surface or buried, which are close to other features of interest (such as a putative active site), and the overall distribution of charges and evolutionary conservation.

Do you need a homology model?Do you need a homology model?

You don’t need a homology model if the amino acid sequence of interest (the query sequence) already has an empirically determined 3D structure. Structures determined empirically, by X-ray crystallography or (much less often) by solution NMR, will almost always be more accurate than a homology model.

Is there an empirical model?Is there an empirical model?

All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the [[[Protein Data Bank]] (PDB, pdb.org).

Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.

Here are two methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB.

Simple search for empirical models (via PIR)Simple search for empirical models (via PIR)

At UniProt.Org, find your protein and click on Structure.

  • If there is a column labeled “Entry” with 4-character PDB IDs, these are empirical structures for your protein. Pay attention to the “Positions” column, which gives the sequence number range covered by each model.
  • If there is no “Entry” column, then there are no sequence-identical empirical structures for your protein. Then try the Advanced search method below.
  • Some proteins have no Structure section (e.g. K4QDG1_SACBA). Then try the Advanced search method below.

If empirical structures exist, see sections below for guidance on how to explore them. If they are satisfactory, then you don't need a homology model.

Advanced search for empirical models (RCSB PDB)Advanced search for empirical models (RCSB PDB)

This method takes more time but gives you more information. It will find empirical structures that have sequence similarity to the query. Such hits enable a high-quality homology model.

For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. A very high quality homology model can be constructed.

Advanced search procedure:

  1. Copy the FASTA format sequence for your protein, for example, from UniProt.Org.
  2. Note the length of your sequence.
  3. At pdb.org, go to Advanced Search.
  4. Click on “Choose a query type” and select Sequence under “Sequence Features”.
  5. Paste your query sequence into the large box, and click the “Submit Query” button at the lower right of the search interface box.
  6. The best hits will be listed first, starting below “Showing 1-25 of XXX Results”. Notice that each hit starts with a large, bold PDB ID. In the “Alignment” section of the first hit, click on “Display for All Results”. Also in the “Compound” section, click “Display for All Results”.

For each hit, notice the “Identities” above the sequence alignment box. The denominator tells you the length of the sequence alignment. The percentage tells you the sequence identity of the alignment.

For example, “Identities: 355/1045 (34%)” means that 1,045 residues of your query sequence align to the hit with 34% sequence identity (355 identical residues in the alignment). Knowing that my query had length 1,170 residues, I can see that this potential template for a homology model would enable me to model 1,045/1,170 = 89% of my query sequence. Quite often the alignment would span a much smaller portion of the full-length sequence.

BEWARE! The sequence identity percentage may be underestimated at pdb.org. This happens when pdb.org deems segments of the query sequence to be of low complexity. Such segments are marked with X’s in the sequence alignment, and excluded from the calculation of sequence identity. For example, for Saccharomyces gal4 (UniProt P04386), for the top hit (3coq), pdb.org reports “Identities: 71/89 (80%)”, while in fact the sequence identity is 100%. Note this in the sequence alignment at pdb.org:

The 18 residues marked X were not included in the identity calculation. In contrast, when the same sequence search is performed at PDB-Europe, 100% sequence identity is reported. However, other aspects of the report at PDB-Europe are less satisfactory (e.g. the length of the alignment is not stated; the sequences are not numbered) and hence we recommend using pdb.org despite its misleading sequence identity percentages.

ReferencesReferences

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Juergen Haas, Jaime Prilusky