---
file_transformation: 
  - # Simplify gene deflines - RNA
  - perl -pe 's/^>(\S+)\s+.+OriSeqID=(\S+)\s+OriID=(\S+)\s+OriGeneID=(\S+)/>$3 $4 $2 $1/' GWHAAEV00000000.RNA.fasta > glyma.Zh13.gnm2.ann1.FJ3G.cds.fna
  - # Simplify gene deflines - protein
  - perl -pe 's/^>(\S+)\s+.+OriTrascriptID=(\S+)\s+OriGeneID=(\S+)\s+OriSeqID=(\S+)/>$2 $3 $4 $1/' GWHAAEV00000000.Protein.faa > glyma.Zh13.gnm2.ann1.FJ3G.protein.faa
  - # Get a hash of GWH IDs and DataStore names from assembly
  - grep '>' ../Zh13.gnm2.LV9P/glyma.Zh13.gnm2.LV9P.genome_main.fna | perl -pe 's/>glyma.Zh13.gnm2.(\S+)\s+(\S+)/$2\t$1/' > hsh.ref_GWH_DS
  - # Hash DataStore names into GFF
  - perl getGffID.pl hsh.ref_GWH_DS GWHAAEV00000000.1.gff |perl -pe 's/;Accession.+//; s/;Parent_Accession=.+//i' > glyma.Zh13.gnm2.ann1.FJ3G.gff
  - Create GFF3 file for repeat sequences
  - awk 'NR>3 {print}' 11427_2019_9822_MOESM6_ESM.txt |perl convertRepeatPostionToGFF.pl |perl -pe 's/^/glyma.gnm2./ if (! /##/)' >glyma.Zh13.gnm2.ann1.FJ3G.transposon_repeat.gff

changes:
  - 2021-03-26 Initial repository creation
  - 2021-05-25 remove snoRNA, tRNA and miRNA from the cds file
  - gzip -dc glyma.Zh13.gnm2.ann1.FJ3G.cds.fna.gz |fasta_to_zero_lines.awk |
    grep -v "RNA" |perl -pe 's/(>\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3 $4\n$5/' >glyma.Zh13.gnm2.ann1.FJ3G.cds.fna
  - gzip -dc glyma.Zh13.gnm2.ann1.FJ3G.cds_primaryTranscript.fna.gz |fasta_to_zero_lines.awk |
grep -v "RNA" |perl -pe 's/(>\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3 $4\n$5/' >glyma.Zh13.gnm2.ann1.FJ3G.cds_primaryTranscript.fna
  - 2021-07-27 adf: added transcript and transcript_primaryTranscript files (via gffread). Note that since the gene models file contains ncRNA, these have also gone into these files (in addition to mRNA); so the counts of sequence entries in transcript files will not match protein/cds which correspond only to mRNA. (fixes https://github.com/legumeinfo/datastore-issues/issues/46)
  - 2021-07-27 adf: :g/Name=\([^;]*\);Name=/s//Name=\1;Alias=/ (fixes https://github.com/legumeinfo/datastore-issues/issues/47)
  - 2022-01-29 adf: manually adjusted first bp of several ShortStack genes so that they span their children and can obey sorting rules. Also :g/_miRNA_\([0-9]*\)_mRNA/s//_miRNA_\1/g due to orphaned children which seemed to be following different naming rules than their presumed parents (I hope fixes: https://github.com/legumeinfo/datastore-issues/issues/81)
  - 2022-12-22 sc: strip length strings from gene IDs in bed file, e.g. SoyZH13_06G073300.m1.567 --> SoyZH13_06G073300.m1
  - 2023-02-12 adf: fixed non-unique CDS IDs per https://github.com/legumeinfo/datastore-issues/issues/153 using datastore-specifications/scripts/add_IDs_to_gff_features.pl --clobber CDS
  - 2023-06-08 adf: add AHRD with GO/IPR in descriptors
  - 2023-07-18 adf: resort to fix https://github.com/legumeinfo/datastore-issues/issues/172


  - 2025-04-16 adf: add glyma.Zh13.gnm2.ann1.FJ3G.legume.fam3.VLMQ.gfa.tsv.gz