MaizeSequence.org FTP Site ========================== This site provides access to the latest sequenced maize data. The site is part of the NSF-funded Maize Genome Sequencing Project. +----------------------------------------------+ | Release 5b | +--------------------------+-------------------+ | Assembly Date | March 8, 2010 | | Assembly Version | RefGen_v2 | | BAC clones | 16,084 | | BAC contigs | 172,445 | | Genome Length | 2,066,432,718 bp | | Total DNA | 3,232,254,451 bp | +--------------------------+-------------------+ | Working Gene Set* | 110,028 | | Transcripts | 136,770 | +--------------------------+-------------------+ | Filtered Gene Set | 39,656 | | Transcripts | 63,540 | +--------------------------+-------------------+ * - The Working Gene Set: This set merges new annotations performed on RefGen_v2 with 4a gene models mapped onto the new reference. New annotations were achieved by an evidence- based method (Gramene GeneBuilder) and complemented with de novo Fgenesh models performed on masked DNA. Where new gene models overlapped with mapped 4a models the annotations were compared by length and homology to choose the best gene. In all, 11661 models were improved (but still inherited the 4a gene ID). A total of 9718 gene models are novel, sharing no overlap with previous models. -------- CONTENTS All sequence dumps or other large files have been compressed (GZip) both for space constraints and for faster downloads. Genome Sequence --------------- The maize genome can be found in assembly/ ZmB73_RefGen_v2.fasta.gz The entire archive of the maize genome sequence. ZmB73_RefGen_v2_chr*.fasta.gz Individual chromosome files (unmasked). Note that chrUNKNOWN is an artificial collection of unanchored BAC clones. ZmB73_RefGen_v2.agp.gz The accessioned golden path (AGP) describing how the genome is assembled through sequenced BACs. ZmB73_RefGen_v2_chr*.agp.gz The AGP data for individual chromosomes. masked/ ZmB73_RefGen_v2.masked.fasta.gz The genome assembly masked by TE repeats. ZmB73_RefGen_v2_chr*.masked.fasta.gz Individual chromosomes masked by TE repeats. Gene Sequences -------------- The Working Gene Set is represented in various dumps. It is composed of both evidence-based genes and ab initio genes (predicted by Fgenesh). Gene IDs have been projected from the previous version (Build 4a) when possible. Projected gene IDs have a mixed format based on the prediction method, while all genes unassociated with previous predictions conform to a uniform ID naming convention. The ID formats are: +--------------+------------------+-------------------------+ | | GeneBuilder | Fgenesh | +--------------+------------------+-------------------------+ | Genes | GRMZMXXXXXXX | [CloneName]_FGXXX | | Transcripts | GRMZMXXXXXXX_TYY | [CloneName]_FGTXXX | | Translations | GRMZMXXXXXXX_PYY | [CloneName]_FGPXXX | | Exons | GRMZMXXXXXXX_EYY | [CloneName]_FGXXX.exonY | +--------------+------------------+-------------------------+ * GRMZM2GXXXXXXXX are Gramene evidence-based genes projected from Build 4a. * GRMZM5GXXXXXXXX are newly built genes in Build 5a. The Working Gene Set: working-set/ ZmB73_5a_WGS_info.txt A tab-delimited table describing genes, transcripts, and various bits of useful information, including location and classification. ZmB73_5a_WGS.gff.gz A GFF3 dump of the genes along with their underlying structure. ZmB73_5a_WGS_genes.fasta.gz Genomic sequences of the Working Gene Set. ZmB73_5a_WGS_genes_500.fasta.gz Genomic sequences of the Working Gene Set with 500 bp of flanking context on both sides. ZmB73_5a_WGS_cdna.fasta.gz cDNA sequences of the Working Gene Set. ZmB73_5a_WGS_cds.fasta.gz CDS sequences of the Working Gene Set. ZmB73_5a_WGS_pre_mrna.fasta.gz Genomic sequences of the Working Gene Set with annotated exon-intron structure (introns are soft-masked, i.e., lowercase) for each alternate transcript. ZmB73_5a_WGS_translations.fasta.gz Peptide sequences of the Working Gene Set. 4a_discontinued_ids.txt Table listing the 4a gene IDs that are discontinued in the 5a working gene set. Such genes either failed to map to RefGen_v2 or received a new ID because their annotation was merged with another gene model. In the latter case the 4a gene model can be mapped to a new or pre-existing gene ID in the 5a working set. Mapping was achieved either by direct coordinate projection or by alignment. Alignment mappings require at least 99% identity and 98% coverage, and its position on RefGen_v2 must be on the same or adjacent BAC as in RefGen_v1. Since the alignment method relied to some extent on manual curation we only applied this to filtered set genes that could not be mapped by coordinates. In all 2207 4a gene IDs are not accounted for in RefGen_v2, of which 165 were in the filtered set. filtered-set/ ZmB73_5b_FGS_info.txt A tab-delimited table describing genes, transcripts, and various bits of useful information, including location and classification. ZmB73_5b_WGS_to_FGS.txt A description of the fate of all the genes in the Working Gene Set. Column 2 describes a gene's primary reason for inclusion in ('syntelog', 'ortholog', or 'other') or exclusion from ('probable_pseudogene', 'possible_transposon', 'contamination', or 'low_confidence') the Filtered Gene Set (FGS). Columns 4 and 5 state whether a filtered gene is possibly transposable or a pseudogene, respectively. Finally, Column 6 states whether a gene was in Filtered Gene Set of Release 4a (RefGen v1). ZmB73_5b_FGS.gff.gz A GFF3 dump of the genes along with their underlying structure. ZmB73_5b_FGS_genes.fasta.gz Genomic sequences of the Working Gene Set. ZmB73_5b_FGS_genes_500.fasta.gz Genomic sequences of the Working Gene Set with 500 bp of flanking context on both sides. ZmB73_5b_FGS_cdna.fasta.gz cDNA sequences of the Working Gene Set. ZmB73_5b_FGS_cds.fasta.gz CDS sequences of the Working Gene Set. ZmB73_5b_FGS_pre_mrna.fasta.gz Genomic sequences of the Working Gene Set with annotated exon-intron structure (introns are soft-masked, i.e., lowercase) for each alternate transcript. ZmB73_5b_FGS_translations.fasta.gz Peptide sequences of the Working Gene Set. The format of each sequence comment within the gene files is: >OBJECT_ID CLONE:START:END:STRAND:ANALYSIS:CLASSIFICATION OBJECT_ID The unique ID of the gene, transcript, translation, or exon CLONE, START, END, STRAND Locus information about the specific object ANALYSIS The prediction method used to annotate this object CLASSIFICATION The assigned class based on homology to peptides in the NR database: protein_coding Significant homology to a known non-TE protein transposon_pseudogene Significant homology to a known TE protein protein_coding_unsupported No significant homology, i.e., hypothetical Functional Annotations ---------------------- InterPro and Gene Ontology (GO) annotations for genes are provided in various file dumps. functional-annotations/ ZmB73_5a_xref.tar.gz Xref mappings for various external data sets, including InterPro ID assignments. ZmB73_5a_gene_descriptors.txt.gz High-level gene descriptors based on high confidence Xref mappings. Known Gene Loci --------------- Gene models that have been associated with external databases using the Xref system were then matched against a curated list of known gene loci maintained by MaizeGDB. For more information, visit http://maizesequence.org/info/docs/namedgenes.html known-genes/ maizegdb_loci.txt Known gene loci associated with GenBank accessions. ZmB73_5a_named_genes.txt Known maize gene names associated with the Working Gene Set. Repeats ------- Repeat annotations were generated by RepeatMasker using two libraries, MIPS/REcat and the TE Consortium repeat family exemplars. repeats/ ZmB73_5a_MIPS_repeats.gff.gz GFF3 dump of the MIPS/REcat repeat annotations ZmB73_5a_MTEC_repeats.gff.gz GFF3 dump of the TE Consortium repeat annotations ZmB73_5a_MTEC+LTR_repeats.gff.gz GFF3 dump of the TE Consortium repeats, with new LTR exemplars. Our website is located at: http://maizesequence.org For more information, or to make requests for specific file dumps, please contact us at: info@maizesequence.org 2010-12-02: Added list of known gene loci (see above). 2011-02-07: Added the Filtered Gene Set. 2011-02-08: Added a table describing the fate of the 5a Working Gene Set.