Practical Guide to Homology Modeling

TerminologyTerminology

Query sequence: The amino acid sequence for which a 3D model is wanted. More commonly called the target sequence, but talking about target vs. template gets confusing.
Template: An empirically determined 3D protein structure with significant sequence similarity to the query.
Structure will be used in this article to mean three-dimensional structure.

What Is A Homology Model?What Is A Homology Model?

Homology models, also called comparative models, are obtained by folding a query protein sequence (also called the target sequence) to fit an empirically-determined template model. The registration between residues in the query and template is determined by an amino acid sequence alignment between the query and template sequences.

Imagine that the template’s polypeptide backbone is a folded glass tube. Now imagine that the query sequence is a thin metal chain that can be pulled through the tube. The chain (query) will adopt the same fold as the tube (template). The sequence alignment specifies how far the chain should be pulled into the tube; that is, how the residues in the query sequence match up with the structure of the template.

Errors or uncertainties in the sequence alignment result in errors or uncertainties in the homology model. Portions of the query sequence cannot be modeled reliably when there are gaps in the sequence alignment due to insertions/deletions ("indels"), or portions of the template that lack coordinates due to crystallographic disorder. Provided there is sufficient sequence identity between the query and template (at least 30%), the main chain in homology models is usually mostly correct. However, the positions of sidechain rotamers in homology models are usually unreliable.

Nevertheless, homology models are useful for seeing low-resolution features, such as which residues are on the surface or buried, which are close to other features of interest (such as a putative active site), and the overall distribution of charges and evolutionary conservation.

Rationale for homology modelingRationale for homology modeling

The science of predicting the structure of a protein from its sequence, using theory, is generally unsuccessful, despite decades of work by some very bright people, and real progress having been made (see Theoretical models).

Structure is more conserved than sequence. This conclusion is supported by many examples of proteins that have similar structures, yet no discernable sequence identity. An example is the ftsZ cell division protein in bacteria which shares structure with mammalian tubulin despite only 12-15% sequence identity^[1]. The customary interpretation is that modern proteins with very similar structures have a common ancestor, and that their sequences diverged while maintaining the ancestral 3D structure.

Thus, if the query sequence has significant identity with an empirically determined protein structure (the template), there is a very high probability that they have similar structures. Folding the query sequence identically to the template, guiding the registration by the sequence alignment, produces a homology model.

Do you need a homology model?Do you need a homology model?

You don’t need a homology model if the amino acid sequence of interest (the query sequence) already has an empirically determined 3D structure. Structures determined empirically, by X-ray crystallography or (much less often) by solution NMR, will almost always be more accurate than a homology model.

Is there an empirical model?Is there an empirical model?

All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the [[[Protein Data Bank]] (PDB, pdb.org).

Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.

Here are two methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB.

Simple search for empirical models (via PIR)Simple search for empirical models (via PIR)

At UniProt.Org, find your protein and click on Structure.

If there is a column labeled “Entry” with 4-character PDB IDs, these are empirical structures for your protein. Pay attention to the “Positions” column, which gives the sequence number range covered by each model.
If there is no “Entry” column, then there are no sequence-identical empirical structures for your protein. Then try the Advanced search method below.
Some proteins have no Structure section (e.g. K4QDG1_SACBA). Then try the Advanced search method below.

If empirical structures exist, see sections below for guidance on how to explore them. If they are satisfactory, then you don't need a homology model.

Advanced search for empirical models (RCSB PDB)Advanced search for empirical models (RCSB PDB)

This method takes more time but gives you more information. It will find empirical structures that have sequence similarity to the query. Such hits enable a high-quality homology model.

For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. A very high quality homology model can be constructed.

Advanced search procedure:

Copy the FASTA format sequence for your protein, for example, from UniProt.Org.
Note the length of your sequence.
At pdb.org, go to Advanced Search.
Click on “Choose a query type” and select Sequence under “Sequence Features”.
Paste your query sequence into the large box, and click the “Submit Query” button at the lower right of the search interface box.
The best hits will be listed first, starting below “Showing 1-25 of XXX Results”. Notice that each hit starts with a large, bold PDB ID. In the “Alignment” section of the first hit, click on “Display for All Results”. Also in the “Compound” section, click “Display for All Results”.

For each hit, notice the “Identities” above the sequence alignment box. The denominator tells you the length of the sequence alignment. The percentage tells you the sequence identity of the alignment.

For example, “Identities: 355/1045 (34%)” means that 1,045 residues of your query sequence align to the hit with 34% sequence identity (355 identical residues in the alignment). Knowing that my query had length 1,170 residues, I can see that this potential template for a homology model would enable me to model 1,045/1,170 = 89% of my query sequence. Quite often the alignment would span a much smaller portion of the full-length sequence.

BEWARE! The sequence identity percentage may be underestimated at pdb.org. This happens when pdb.org deems segments of the query sequence to be of low complexity. Such segments are marked with X’s in the sequence alignment, and excluded from the calculation of sequence identity. For example, for Saccharomyces gal4 (UniProt P04386), for the top hit (3coq), pdb.org reports “Identities: 71/89 (80%)”, while in fact the sequence identity is 100%. Note this in the sequence alignment at pdb.org:

The 18 residues marked X were not included in the identity calculation. In contrast, when the same sequence search is performed at PDB-Europe, 100% sequence identity is reported. However, other aspects of the report at PDB-Europe are less satisfactory (e.g. the length of the alignment is not stated; the sequences are not numbered) and hence we recommend using pdb.org despite its misleading sequence identity percentages. But you may certainly want to run the sequence search at PDB-Europe to compare the reported identity percentages.

Are parts (or all) of the query protein intrinsically disordered?Are parts (or all) of the query protein intrinsically disordered?

Attempts to determine structure for intrinsically disordered protein will be futile. Therefore, before considering homology modeling or crystallization experiments, it is important to predict whether portions of the query protein are likely to be intrinsically disordered.

Although fold is required for the function of most proteins, some proteins are intrinsically disordered (natively unstructured) and do not fold, at least by themselves. Often, intrinsically disordered protein transitions to an ordered state when it binds to a folded partner protein. However some proteins remain disordered while performing their functions.

By some estimates, 10% of proteins are intrinsically disordered for their full lengths, and about 40% of eukaryotic proteins have at least one loop 50 residues or longer that is intrinsically disordered. These disordered loops are typically missing from X-ray crystallographic structures because the disorder blurs that portion of the electron density map.

Examples:

Folded: Pyruvate kinase (length 531; e.g. P11979, KPYM_FELCA) has no disordered regions. The crystal structure (1pkm) lacks only 11 residues at the C terminus.
Partially folded: The tumor suppressor protein p53 (length 393; e.g. P04637, P53_HUMAN) is intrinsically disordered at both the N and C termini. There are many crystallographic structures for the folded mid-region (~200 residues), which lack coordinates for 90-some residues at the N terminus, and 90-some at the C terminus. Some solution NMR structures of the N terminus illustrate the disorder (e.g. 2ly4).
Unfolded: Caldesmon from chicken gizzard (length 771; P12957, CALD1_CHICK) has no crystal structures, and is predicted to be disordered for essentially its full length.

Prediction of intrinsic disorderPrediction of intrinsic disorder

MobiDB via UniProtMobiDB via UniProt

At UniProt.Org, find your protein, then click on “Structure”. At the bottom of this section is usually a link to MobiDB’s report for the query protein. There, in the section Detailed Disorder Annotations are graphics showing experimental evidence for disorder (if available) and, under the heading Predictors, results from several servers designed to predict intrinsic disorder.

The Examples above are linked to MobiDB.

FoldIndexFoldIndex

The FoldIndex server is a useful adjunct to the MobiDB report, since it is not included in that report.

ReferencesReferences

↑ A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Juergen Haas, Jaime Prilusky

[1] A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.

[1]