PDB identification code: Difference between revisions

← Older edit

Proteopedia Page Contributors and Editors (what is this?)Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Wayne Decatur, Jaime Prilusky

@@ Line 1: / Line 1: @@
-Every molecular model ([[Atomic coordinate file|atomic coordinate file]]) in the [[Protein Data Bank]] (PDB) has a unique accession or identification code. These codes are always 4 characters in length. The first character is a numeral, while the last three characters can be either numerals or letters. In the past, the first character was always a numeral in the range 1-9. Although there appear to be no entries beginning with "0", its exclusion [http://www.rcsb.org/robohelp/index.htm#search_database/pdb_identifier.htm may have been relaxed].
+Every molecular model ([[Atomic coordinate file|atomic coordinate file]]) in the [[Protein Data Bank]] (PDB) has a unique accession or identification code. These codes are always 4 characters in length. The first character is a numeral in the range 1-9, while the last three characters can be either numerals (in the range 0-9) or letters (in the range A-Z in the [http://en.wikipedia.org/wiki/Latin_alphabet#Classical_Latin_alphabet Latin alphabet]). Plans for an expanded identification code system that handle more entries [[#Future Plans for Expanded PDB Codes|have been announced]].
 ==Lower vs. Upper Case==
@@ Line 21: / Line 21: @@
 For many years, depositors of models could request an available PDB code that represented an acronym for the molecule represented. All the above examples are such cases. With the increase in number of new entries each week, the PDB no longer permits this option. In recent years, all PDB codes are assigned by the PDB from the pool of available codes, in sequential ascending order, without reference to the name of the molecule.
-==PDB codes are permanently associated with a single structure==
+==PDB codes have been permanently associated with a single structure==
-Once a PDB code is assigned to a given structure, it's forever, even in those cases when a structure is withdrawn (retired from the database), like [[3luw]], or superseded by a newer or corrected structure, like [[1ace]]. If requesting a page for a superseded structure, like [[1aak]], Proteopedia will automatically display the newest structure [[2aak]]. Look for the explanation in the 'About this Structure' section of each page.
+Once a PDB code is assigned to a given structure, it's forever, even in those cases when a structure is withdrawn (retired from the database), like [[3luw]], or superceded by a newer or corrected structure, like [[1ace]]. If requesting a page for a superseded structure, like [[1aak]], Proteopedia will automatically display the newest structure [[2aak]]. Look for the explanation in the 'Structural Highlights' section of each page.
-==Limited Number of PDB Codes==
+In May, 2017, the PDB announced plans for a versioning system. This went into effect in July, 2019<ref name="news2" />. It allows multiple versions of the same entry to keep a single PDB code. See [[#Future Plans for Expanded PDB Codes|below]].
-There are over 400,000 possible 4-character PDB identification codes (419,904 or 466,560 if "0" is allowed as the first character). Thus, the ~78,000 entries in early 2012 have used up less than 19% of the available codes. Someday a scheme that can accomodate more entries will be required, requiring revision of macromolecular visualization and modeling software programs that obtain data online, all of which, of necessity, currently require 4-character PDB codes.
+==Limited Number of 4-Character PDB Codes==
+There are 419,904 possible 4-character PDB identification codes<ref>Ten numerals plus 26 letters = 36. The first character is 1-9. (9)(36<sup>3</sup>) = 419,904.</ref>. This could be increased to 466,560 if the numeral "0" is allowed as the first character<ref>In April, 2013. according to Rachel Kramer Green of the RCSB in Rutgers, NJ, there were no plans to use PDB codes beginning with 0. However, in July, 2017, the WWPDB FAQ states "The four-letter PDB identifier currently consists of a number (0-9) followed by 3 letters or numbers.".</ref>. Thus, the ~170,000 entries in mid 2017 (plus withdrawn and superceded entries) have used up nearly half of the available codes. After approximately 2027<ref name="spring2024">[https://cdn.rcsb.org/rcsb-pdb/general_information/news_publications/newsletters/2024q2/deposit.html#two Resources for Supporting the Extended PDB ID Format (pdb_00001abc)], Spring 2024 Issue of the RCSB PDB Newsletter.</ref>, a scheme that can accommodate more entries will be required, requiring revision of macromolecular visualization and modeling software programs that obtain data online, all of which, of necessity, currently require 4-character PDB codes. See plans for an expanded system in the following section.
+==Future Plans for Expanded PDB Codes==
+In May, 2017, the [[Protein Data Bank]] announced plans to introduce, later in 2017, an expanded PDB accession code with versioning<ref name="news1">PDB News May 17, 2017: [https://www.wwpdb.org/news/news?year=2017#5910c8d8d3b1d333029d4ea8 Revise Your Structure Without Changing the PDB Accession Code and Related Changes to the FTP Archive].</ref>. The new codes will have the format
+<blockquote>
+'''pdb_00001abc'''
+</blockquote>
+where the 5 characters "00001" may each be a numeral, 0-9, and the 3 trailing characters "abc" may each be a numeral or a letter.
+In addition to increasing the number of possible accession codes from ~4 x 10<sup>4</sup> to >10<sup>9</sup>, this will facilitate "text mining detection of PDB entries in the published literature"<ref name="news1" />. The PDB also promises "For as long as practicable, the
+wwPDB will continue assigning PDB codes that can be truncated losslessly
+to the current four-character style."<ref name="news1" /> When 4-character codes are exhausted, new entries will be available in [[Atomic_coordinate_file#mmCIF_Data_Format|mmCIF format]] only, since the legacy [[PDB format]] will not accommodate 12-character IDs.
+In 2024, the [[wwPDB]] plans to make a beta 12-character ID archive available in 2026<ref name="spring2024" />. In 2024, the wwPDB estimates that the 4-character IDs will be consumed in 2029<ref name="spring2024" />.
+===Versioning===
+Along with the expanded accession codes, a versioning system was introduced in mid-2019<ref name="news1" /><ref name="news2">PDB News July 31, 2019: [https://www.rcsb.org/news?year=2019&article=5d3ef68aea7d0653b99c87fd Improve your previously released coordinates AND keep your original PDB ID with OneDep]</ref>.
+<blockquote>
+At present, revised atomic coordinates for an existing released PDB
+entry are assigned a new accession code, and the prior entry is
+obsoleted. This long-standing wwPDB policy had the unintended
+consequence of breaking connections with publications and usage of the
+prior set of atomic coordinates ....<ref name="news1" />
+</blockquote>
+The version of an accession will be included in its filename thus:
+<blockquote>
+'''pdb_00001abc_xyz_v1-2.cif.gz'''
+</blockquote>
+where "v1" designates a major version, and "-2" a minor version.<ref name="news1" /> "xyz" is a constant that signifies an atomic coordinate file. Other types of data files might use the same PDB accession codes in future.
+==Document Object Identifiers (DOI) for PDB Entries==
+[http://www.wwpdb.org/news/news?year=2021#607760112786e73a79c76f9d Each PDB entry is accessible through a DOI]. For example, 6ef8 is accessible as [http://doi.org/10.2210/pdb6ef8/pdb doi.org/10.2210/pdb6ef8/pdb].
 ==See Also==
 * [[PDB file format]]
-*[http://pdbwiki.org/index.php/PDB_code PDB code at pdbwiki.org]
+* [[User:Eric Martz/Entertaining PDB codes|Entertaining PDB codes]]
+==References==
+<references />