A standard set of annotation and assembly files released as part of Phytozome release. Organism = Glycine max Wm82 ISU-01 NCBI-Taxonomy-ID = 3847 Assembly-version = v2.0 Annotation-version = v2.1 All FASTA and GFF3 files are compressed by gzip to reduce the file size for faster downloads. Note: the number 724 in all file names is a Phytozome internal identifier for the current release of this genome annotation and can be safely ignored. Notes on JGI locus/gene naming convention: starting in 2013, JGI plant genome group began to use following naming pattern: a) Prefix.NGn for stable chromosome scale genome assembly, example: Glyma.01G000100; b) Prefix.Zn for chromosome scale genome assembly, example: Eucgr.A00001, Pavir.Aa00001; c) Prefix.Nsn for fragmented genome assembly, example: Pahal.0001s0001 where N is chromosome number in a) or scaffold/contig number in c), Z is a letter representing a chromosome number like A for 1, B for 2 and so on in b), and n is locus/gene number on a chromosome or scaffold/contig. In both a) and b), a letter after last chromosome represents all scaffold/contig that are not mapped to a chromosome, for example, Glyma.U045300 (soybean has 20 chromosomes, 21st alphabet is U). Polyploid genome chromosome number can have a letter representing subgenome and hence N and Z could have one more letter like Pavir.Aa00001. Digit width of N in a) is variable ranging from 1 to 3, dot (period) is optional, and Prefix is made up from organism genus and species or chosen by community. Transcript name is locus name plus . plus digits (transcript number digit), for example, Glyma.01G000100.1. Initially transcript having digit 1 is longest but in subsequent gene annotation, transcript with digit 1 can be lost or not longest any more. The longest transcript should be looked up from GFF3 file with attribute longest=1. Initially, locus/gene number is ordered and increased by 100 from chromosome left to right in a), but this is very likely to change when genome assembly and/or gene annotation is updated. Please refer to external source for naming pattern for third party gene sets in Phytozome. Files in the annotation subdirectory: 1) GmaxWm82ISU_01_724_v2.1.annotation_info.txt A summary of annotation details available in Phytozome. This is a tab-delimited file, as follows: (Note: Columns are blank if no corresponding data is available) 1: Phytozome internal transcript ID (potentially useful to connect to biomart datasets) 2: Phytozome gene locus name 3: Phytozome transcript name 4: Phytozome protein name (often same as transcript name, but this can vary) 5: PFAM 6: Panther 7: KOG 8: KEGG ec 9: KEGG Orthology 10: Gene Ontology terms (NOTE: these are automated results from interpro2go in most genomes, *not* empirically derived) Rest of columns are for best hit (please read file header) "Best hits" are defined as the top result returned from BLASTP alignment of this species proteome to the target (A. thaliana, O. sativa, or C. reinhardtii listed above). This was run with blast+ 2.2.26 with parameters: blastall -p blastp -F "mS" -b 1500 -v 1500 -e 0.001 -M BLOSUM45 and further filtered with an 1E-3 cutoff e-value. "Best hits" can also be obtained fron inparanoid computes against target(s) if not by BLASTP for newer gene set release 2) GmaxWm82ISU_01_724_v2.1.cds.fa.gz and GmaxWm82ISU_01_724_v2.1.cds_primaryTranscriptOnly.fa.gz Nucleotide FASTA format file of all gene coding sequences, with or without alternative splice variants 3) GmaxWm82ISU_01_724_v2.1.protein.fa.gz and GmaxWm82ISU_01_724_v2.1.protein_primaryTranscriptOnly.fa.gz Amino acid FASTA format file of all gene coding sequences, with or without alternative splice variants 4) GmaxWm82ISU_01_724_v2.1.transcript.fa.gz and GmaxWm82ISU_01_724_v2.1.transcript_primaryTranscriptOnly.fa.gz Nucleotide FASTA format file of spliced mRNA transcripts (UTR, exons), with or without alternative splice variants 5) GmaxWm82ISU_01_724_v2.1.gene.gff3.gz GFF3 format representation of all mRNA sequences (UTR, CDS). Genomic coordinates are relative to the reference sequence in column 1 6) GmaxWm82ISU_01_724_v2.1.gene_exons.gff3.gz GFF3 format representation of all mRNA sequences as above, but with exon subfeatures. Genomic coordinates are relative to the reference sequence in column 1 7) GmaxWm82ISU_01_724_v2.1.synonym.txt Tab-delimited list of all gene symbol/synonyms for the Phytozome transcript in the first column. 8) GmaxWm82ISU_01_724_v2.1.defline.txt Tab-delimited list of all defline for the Phytozome transcript in the first column or provisional defLine (pdef type in second column). Provisional deflines (aka auto defline) are generated via an automated method that prioritizes KEGG/EC over Panther over KOG over PFAM annotations, and then chooses the one with lowest multiplicity (M), where M indicates how many other genes in this genome have the same annotation. 9) GmaxWm82ISU_01_724_v2.1.locus_transcript_name_map.txt Tab-delimited list of all locus name, transcript name for the Phytozome gene models (column headers in file). 10) GmaxWm82ISU_01_724_v2.1.repeatmasked_assembly_v2.0.gff3.gz repeat GFF, mostly by RepeatMasker, some by MerMasking, still some derived from masked genome fasta ----- Files in the assembly subdirectory: 1) GmaxWm82ISU_01_724_v2.0.fa.gz Nucleotide FASTA format of the current genomic assembly 2) GmaxWm82ISU_01_724_v2.0.softmasked.fa.gz GmaxWm82ISU_01_724_v2.0.hardmasked.fa.gz Nucleotide FASTA format of the current genomic assembly, masked for repetitive sequence by RepeatMasker (softmasked sequence is in lower case; hardmasked replaces masked sequence with Ns). ----- Files in the additional subdirectories (expression, diversity, etc.) are releases of data related to the current annotation, and are not always available for all organisms.