Unusual sequence numbering: Difference between revisions

← Older edit

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz

@@ Line 1: / Line 1: @@
-The numbering of protein and nucleic acid sequences is arbitrary in structure files from the [[PDB|World Wide Protein Data Bank]] (PDB). That is, authors are free to number sequences as they wish.
+The numbering of protein and nucleic acid sequences is arbitrary in structure files from the [[PDB|World Wide Protein Data Bank]] (PDB). That is, authors are free to number sequences as they wish. If you need to change the numbering in a published [[PDB file]], please see [[Renumbering PDB files]].
-'''Straightforward numbering''' assigns 1 to the amino-terminal amino acid, and counts up sequentially and monotonically to the carboxy-terminal amino acid. An example is [http://firstglance.jmol.org/fg.htm?mol=1pgb 1pgb] ([[1pgb]]). The crystallized protein is numbered 1-56, despite it being a fragment of a [http://www.uniprot.org/uniprot/P06654#sequences 448-residue full length sequence] that begins (after adding an N-terminal Met) at full-length sequence number 228.
+'''Straightforward numbering''' assigns 1 to the amino-terminal amino acid (or 5' nucleotide), and counts up sequentially and monotonically to the carboxy-terminal amino acid (or 3' nucleotide). An example is [http://firstglance.jmol.org/fg.htm?mol=1pgb 1pgb] ([[1pgb]]). The crystallized protein is numbered 1-56, despite it being a fragment of a [http://www.uniprot.org/uniprot/P06654#sequences 448-residue full length sequence] that begins (after adding an N-terminal Met) at full-length sequence number 228.
 Below are some examples of '''unusual sequence numbering'''. The 3D structures of these PDB entries are not shown here. To explore them in 3D, the links below will display them in [[FirstGlance in Jmol]] (link with arrow) or in Proteopedia (link in parentheses).
 ==Numbering Does Not Start With One==
+===Arbitrary Numbering===
+[[1bsz]] contains three sequence-identical chains numbered 1-168, 501-668, and 1001-1168.
 ===N-Terminal Residues Missing Coordinates===
 Probably the most common reason that the first residue with coordinates is not numbered 1 is because the N-terminal (or 5'-terminal) residues are missing coordinates due to crystallographic disorder (fuzzy electron density map). An example is [http://firstglance.jmol.org/fg.htm?mol=1d66 1d66] ([[1d66]]). The first 7 residues of chain A are missing, so the first residue with coordinates is numbered 8. 1-7 were present in the crystallized protein, but could not be resolved in the electron density map.
@@ Line 24: / Line 27: @@
 ===Insertion Codes===
 [[Image:Sequence-insertion-codes-1igy.png|frame|Excerpt from PDB file 1igy showing insertion codes.]]
-Sometimes the residues of a protein are numbered according to a different ''reference sequence''. When there are insertions relative to the reference sequence, the additional residues may all be given the same sequence number, but marked with alphabetic insertion codes. This is frequently done in antibodies, where the reference sequence is the germline sequence, but the antibody has been somatically mutated, especially in complementarity-determining region (CDR) 3. An example is [http://firstglance.jmol.org/fg.htm?mol=1igy 1igy] ([[1igy]]). Four residues in chain B all have sequence number 82. They are distinguished by insertion codes: 82, 82A, 82B, 82C. At right is this part of the PDB file. Below are residues 81-83 showing their sequence numbers in [[FirstGlance in Jmol]]. Insertion codes are given following a caret "^". (How? See note <ref name="how1">Click ''Find'' and enter ''chain=B and 81-83. Click ''Isolate'' and check ''Atoms with Halos''. Zoom in. In the left center after "Halos around:" click ''Change'', and then ''Clear Halos''. Check ''Sequence numbers'' (near the bottom of the upper left panel).</ref>)
+Sometimes the residues of a protein are numbered according to a different ''reference sequence''. When there are insertions relative to the reference sequence, the additional residues may all be given the same sequence number, but marked with alphabetic insertion codes. This is frequently done in antibodies, where the reference sequence is the germline sequence, but the antibody has been somatically mutated, especially in complementarity-determining region (CDR) 3. An example is [http://firstglance.jmol.org/fg.htm?mol=1igy 1igy] ([[1igy]]). Four residues in chain B all have sequence number 82. They are distinguished by insertion codes: 82, 82A, 82B, 82C. At right is this part of the PDB file. Below are residues 81-83 showing their sequence numbers in [[FirstGlance in Jmol]]. Insertion codes are given following a caret "^". (How? See note <ref name="how1">Display 1igy in FirstGlance in Jmol. Click ''Find'' and enter ''chain=B and 81-83''. Click ''Isolate'' and check ''Atoms with Halos''. Zoom in. In the left center after "Halos around:" click ''Change'', and then ''Clear Halos''. Check ''Sequence numbers'' (near the bottom of the upper left panel).</ref>)
 <center>
 <table width=200><tr><td>[[Image:1igy-chain-b-81-83.png|200px|center]]</td></tr><tr><td>1igy residues 81-83 displayed with sequence numbers in FirstGlance in Jmol.<ref name="how1" /></td></tr></table>
@@ Line 41: / Line 44: @@
 ===Missing Residues===
 [[Image:Sequence-missing-loop-2ace.png|frame|Excerpt from PDB file 2ace showing gap in sequence numbering due to a missing loop.]]
-It is not uncommon for a surface loop of the crystallized protein to be disordered. Often such loops are [[Intrinsically Disordered Protein|intrinsically disordered]]. The disorder blurs the electron density map for that loop, and the loop residues are not given coordinates in the model: they are missing in the model. However, they were not missing in the crystallized protein. This causes a gap in the sequence numbers in the PDB file. An example is [http://firstglance.jmol.org/fg.htm?mol=2ace 2ace] ([[2ace]]). Residues 485-489 are missing in the 3D crystallographic model due to disorder in the crystal. Also missing are 3 N-terminal, and 2 C-terminal residues.  FirstGlance in Jmol tabulates missing residues, and marks regions of the 3D model where residues are missing with "empty baskets".
+It is not uncommon for a surface loop of the crystallized protein to be disordered. Often such loops are [[Intrinsically Disordered Protein|intrinsically disordered]]. The disorder blurs the electron density map for that loop, and the loop residues are not given coordinates in the model: they are [[Missing residues and incomplete sidechains|missing in the model]]. However, they were not missing in the crystallized protein. This causes a gap in the sequence numbers in the PDB file. An example is [http://firstglance.jmol.org/fg.htm?mol=2ace 2ace] ([[2ace]]). Residues 485-489 are missing in the 3D crystallographic model due to disorder in the crystal. Also missing are 3 N-terminal, and 2 C-terminal residues.  FirstGlance in Jmol tabulates missing residues, and marks regions of the 3D model where residues are missing with "empty baskets".
+<table width=550><tr><td>[[Image:2ace-empty-basket.png|center]]</td><td>&quot;Empty Basket&quot;: Closeup of the region of [[2ace]] where residues 485-489 are missing. In [[FirstGlance in Jmol]], empty baskets alert the user to missing residues. (&quot;S-&quot; labels residues with missing sidechain atoms.)
+<br><br>
+See also [[Missing residues and incomplete sidechains]].</td></tr></table>
 {{clear}}
 ==Not Monotonic==
 [[Image:Sequence-not-monotonic-4zwj.png|frame|Excerpt from PDB file 4zwj showing non-monotonic sequence numbering in chain A.]]
-Rarely, sequence numbers do not increase monotonically from N to C terminus. An example<ref>Thanks to Rachel Kramer Green of [[RCSB]] for this example.</ref> is [http://firstglance.jmol.org/fg.htm?mol=4zwj 4zwj] ([[4zwj]]). In this chimeric protein, chain A is numbered 1002-1161 continuing 1-326 continuing 2012-2361. That is, there are sudden jumps in numbering of consecutive amino acids: 1161 to 1, and 326 to 2012. At right is an excerpt from the ATOM records of the [[PDB file]] for 4zwj chain A.
+Rarely, sequence numbers do not increase monotonically from N to C terminus. An example<ref>Thanks to Rachel Kramer Green of [[RCSB]] for this example.</ref> is [http://firstglance.jmol.org/fg.htm?mol=4zwj 4zwj] ([[4zwj]]). In this chimeric protein, chain A is numbered 1002-1161 continuing 1-326 continuing 2012-2361. That is, there are sudden jumps in numbering of consecutive amino acids: 1161 to 1, and 326 to 2012. At right is an excerpt from the ATOM records of the [[PDB file]] for 4zwj chain A. Below is a snapshot of the non-monotonic numbering.
+<center>
+<table width=350><tr><td>[[Image:Not-monotonic-3sn6.png]]</td></tr><tr><td>Eight amino acids from 4zwj displayed with sequence numbers in FirstGlance in Jmol.<ref name="how2">Display 4zwj in FirstGlance in Jmol. Click ''Find'' and enter ''chain=A and (1-3,1160-1161,281-283)''. Click ''Isolate'' and check ''Atoms with Halos''. Zoom in. In the left center after "Halos around:" click ''Change'', and then ''Clear Halos''. Check ''Sequence numbers'' (near the bottom of the upper left panel).</ref> Tyr 1161 is peptide-bonded N-terminal to Met 1. Cys 2 is disulfide-bonded to Cys 282.</td></tr></table>
+</center>
-== References ==
+Other examples:
+*[http://firstglance.jmol.org/fg.htm?mol=1nsa 1nsa] ([[1nsa]]) is numbered 7A-95A ("A" being an insertion code) continuing 4-308. There is also 188A inserted between 188 and 189.
+*Chain R in [http://firstglance.jmol.org/fg.htm?mol=3sn6 3sn6] ([[3sn6]]). It is numbered 1002-1164 continuing 30-365. However the model lacks bonds between 1164 and 30 because amino acids 1161-1164 are missing due to crystallographic disorder.
+== Notes ==
 <references/>
+==See Also==
+*[[Renumbering PDB files]]
+*[[Missing residues and incomplete sidechains]]