Gene naming for peanut Draft, 2014-05-23 Objective: replace peanut MAKER names with more usable ones, e.g. first column to second column: snap_masked-Araip.B10-processed-gene-955.6-mRNA-1 Araip.Q49I.K30076.a1.M1 snap_masked-Araip.B10-processed-gene-96.35-mRNA-1 Araip.8QIG.K30076.a1.M1 Naming scheme: The naming scheme will basically follow the pattern that JGI and Phytozome have been using recently on genomes produced and annotated by JGI, with a couple of modifications. The JGI scheme has: o A species prefix, e.g. Glyma. for Glycine max o A central unique identifier, e.g. 10g000100 or 01g000100 o Optionally, a splice variant number, e.g. .1 .2 .3 o A namespace, conveying genotype, assembly, and annotation versions e.g. .Wm82.a2.v1 . This may be joined to the central ID, or carried along on the side. For example: o Glyma.01g000100.1 ID=Glyma.01g000100.1.Wm82.a2.v1 o Glyma.10g000100.1 ID=Glyma.10g000100.1.Wm82.a2.v1 For peanut, we will make two changes: 1) Instead of a central unique identifier with chromosome and ordinal number (01g000100 is the first gene on chr 1; 10g298900 is the last gene on chr 10), we will use a short random string that doesn't convey chromosome or position. The reason is that the Arachis assemblies are quite fragmented, and the relative locations of many genes will change in subsequent versions. The chromosomal locations of some genes will also change - for example, where scaffold breaks are not precisely correct or where new scaffolds are incorporated. 2) In the annotation version (v1), we will include a short mnemonic to indicate annotation method or source, e.g. M1 for MAKER or vB1 for BGI. The unique gene/locus identifier will be a randomly generated five-character string composed from digits 0-9 and letters A-Z (omitting letter O to avoid confusion with zero; and including at least one number internally, to avoid words). Splice variants will be indicated with an integer suffix on the identifier (1,2,3, ...). The namespace of a set of gene models will be distinguished with a string consisting of the genotype, assembly version, and annotation version (the version including software-abbreviation and number), e.g. V14167.a1.M1 where genotype=V14167, annotation=a1, version=M1 (M for MAKER, B for BGI etc.; mnemonics assigned first-come, first-served; could be longer than one char if need-be). The prefix will be species name abbreviated in 3-2 pattern, e.g. Aradu or Araip for Arachis duranensis and Arachis ipaensis, respectively. Following the pattern of gene IDs from JGI, a period will separate the species prefix and the unique locus identifier: "Aradu." or "Araip." The namespace string should be associated with the locus or gene model identifier to avoid uncertainty about genotype, assembly, or annotation. This association could be made by concatenating the gene model or locus and namespace, e.g. Adur.F0U1A.V14167.a1.M1 or the namespace string could be reported in a paper or a genome browser track, as long as there is no ambiguity about namespace. Examples: o Aradu.G88RX Locus - associated contextually with annotation namespace V14167.a1.v1 o Aradu.G88RX.V14167.a1.M1 Locus - associated with namespace by concatenation o Araip.7JI3A.1 Gene model, splice variant 1, associated contextually with annotation namespace V14167.a1.v1 o Araip.7JI3A.1.V14167.a1.M1 Gene model, splice variant 1, associated by concatenation with annotation namespace Efforts will be made to keep the core locus/gene ID stable across assemblies and annotations, but the scheme also allows for addition or removal or re-naming where warranted. In general, distinct species will receive distinct IDs - not attempting to assign the same IDs to orthologs (because orthology relationships may be complex due to private duplications and losses).