--- file_transformation: - # Simplify gene deflines - RNA - perl -pe 's/^>(\S+)\s+.+OriSeqID=(\S+)\s+OriID=(\S+)\s+OriGeneID=(\S+)/>$3 $4 $2 $1/' GWHAAEV00000000.RNA.fasta > glyma.Zh13.gnm2.ann1.FJ3G.cds.fna - # Simplify gene deflines - protein - perl -pe 's/^>(\S+)\s+.+OriTrascriptID=(\S+)\s+OriGeneID=(\S+)\s+OriSeqID=(\S+)/>$2 $3 $4 $1/' GWHAAEV00000000.Protein.faa > glyma.Zh13.gnm2.ann1.FJ3G.protein.faa - # Get a hash of GWH IDs and DataStore names from assembly - grep '>' ../Zh13.gnm2.LV9P/glyma.Zh13.gnm2.LV9P.genome_main.fna | perl -pe 's/>glyma.Zh13.gnm2.(\S+)\s+(\S+)/$2\t$1/' > hsh.ref_GWH_DS - # Hash DataStore names into GFF - perl getGffID.pl hsh.ref_GWH_DS GWHAAEV00000000.1.gff |perl -pe 's/;Accession.+//; s/;Parent_Accession=.+//i' > glyma.Zh13.gnm2.ann1.FJ3G.gff - Create GFF3 file for repeat sequences - awk 'NR>3 {print}' 11427_2019_9822_MOESM6_ESM.txt |perl convertRepeatPostionToGFF.pl |perl -pe 's/^/glyma.gnm2./ if (! /##/)' >glyma.Zh13.gnm2.ann1.FJ3G.transposon_repeat.gff changes: - 2021-03-26 Initial repository creation - 2021-05-25 remove snoRNA, tRNA and miRNA from the cds file - gzip -dc glyma.Zh13.gnm2.ann1.FJ3G.cds.fna.gz |fasta_to_zero_lines.awk | grep -v "RNA" |perl -pe 's/(>\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3 $4\n$5/' >glyma.Zh13.gnm2.ann1.FJ3G.cds.fna - gzip -dc glyma.Zh13.gnm2.ann1.FJ3G.cds_primaryTranscript.fna.gz |fasta_to_zero_lines.awk | grep -v "RNA" |perl -pe 's/(>\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3 $4\n$5/' >glyma.Zh13.gnm2.ann1.FJ3G.cds_primaryTranscript.fna - 2021-07-27 adf: added transcript and transcript_primaryTranscript files (via gffread). Note that since the gene models file contains ncRNA, these have also gone into these files (in addition to mRNA); so the counts of sequence entries in transcript files will not match protein/cds which correspond only to mRNA. (fixes https://github.com/legumeinfo/datastore-issues/issues/46) - 2021-07-27 adf: :g/Name=\([^;]*\);Name=/s//Name=\1;Alias=/ (fixes https://github.com/legumeinfo/datastore-issues/issues/47) - 2022-01-29 adf: manually adjusted first bp of several ShortStack genes so that they span their children and can obey sorting rules. Also :g/_miRNA_\([0-9]*\)_mRNA/s//_miRNA_\1/g due to orphaned children which seemed to be following different naming rules than their presumed parents (I hope fixes: https://github.com/legumeinfo/datastore-issues/issues/81) - 2022-12-22 sc: strip length strings from gene IDs in bed file, e.g. SoyZH13_06G073300.m1.567 --> SoyZH13_06G073300.m1 - 2023-02-12 adf: fixed non-unique CDS IDs per https://github.com/legumeinfo/datastore-issues/issues/153 using datastore-specifications/scripts/add_IDs_to_gff_features.pl --clobber CDS - 2023-06-08 adf: add AHRD with GO/IPR in descriptors - 2023-07-18 adf: resort to fix https://github.com/legumeinfo/datastore-issues/issues/172