RepetDB Data Submission

RepetDB provides two entry point to REPET generated datasets.

The first entry point is the RepetDB InterMine web application which provides a search and visualization interface centered on the repeat consensus, their classification and their features.

The second (optional) entry point is the RepetDB Track Hub which provides repeat consensus copies on a genome browser.

Each of these entry point requires metadata and data files to be prepared and then integrated.

I. RepetDB Intermine dataset

A RepetDB dataset must be prepared into a folder in which you will regroup all the data files and prepare a project.xml file which lists these files and add metadata on the dataset.

The project.xml file should follow this template:

<source name="[DATASET NAME]" type="repetdb-main">
      <property name="src.data.dir" location="[DATA DIR]"/>

      <!-- Dataset metadata -->
      <property name="file.submission" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus fasta sequence -->
      <property name="file.fasta" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus classification -->
      <property name="file.classif" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus statistics -->
      <property name="file.stat" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus copies -->
      <property name="file.gff3.copies" value="[RELATIVE PATH TO FILE]"/>

      <!-- Similarity evidences -->
      <property name="file.gff3.blrn" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.blrtx" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.blrx" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.rdnablrn" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.profiles" value="[RELATIVE PATH TO FILE]"/>

      <!-- Structural evidences -->
      <property name="file.gff3.orf" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.polya" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.ssr" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.tr" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.trna" value="[RELATIVE PATH TO FILE]"/>
</source>

This template show all possible properties of a RepetDB dataset but most of them are optional.

The mandatory fields include:

The source name ([DATASET NAME]) which should be a unique identifier for your RepetDB dataset
The dataset data folder ([DATA DIR] in property src.data.dir) which should contain the absolute path to the dataset data folder
The dataset metadata (see “Dataset metadata” section) which can be either
- The minimum metadata (properties taxonomy.id, genome.assembly and contact.name)
- OR the extended metadata file (property file.submission) which should point to the Excel submission file
The mandatory dataset data files (see “Mandatory file” section)
- The consensus fasta sequences file (property file.fasta)
- The consensus classification file (property file.classif)

All [RELATIVE PATH TO FILE] properties should contain a relative path to a data file inside the dataset data folder.

All the other properties point to optional data file

I.1. Dataset metadata

To submit data to RepetDB, you must also fill an Excel file providing the metadata about your dataset containing:

The genome taxonomy identifier from NCBI
The genome assembly name
(optionally) a genome JBrowse link (used to link consensus copies directly on the genome JBrowse)
A contact name
A list of software used (along with a link to the software website)
A list of publication
Comments

All of these metadata must be provided in the Excel submission file using the RepetDB_Submission_Template.xls template file.

For an example of a submission file, check the RepetDB_Sample_Submission.xls file.

Once filled, you then have to place the submission file to the dataset data directory and add the following property:

  <property name="file.submission" value="[SUBMISSION FILE PATH]"/>

Replace [SUBMISSION FILE PATH] by the relative file path of the submission file inside the dataset data directory.

I.2. Mandatory files

Two files are mandatory to integrate a new RepetDB dataset:

Consensus multi-fasta file: a multi-fasta file listing all consensus sequences.
Consensus classification file: a TSV (tab separated values) file generated by the REPET package

I.2.a. Consensus multi-fasta file

This file should follow the multi-fasta standard with the consensus identifier as a sequence identifier.

This file should be added to the project.xml file as followed:

  <property name="file.fasta" value="[FASTA FILE PATH]"/>

I.2.b. Consensus classification file

From the consensus classification file, five specific columns are used in RepetDB integration (order does not matter) :

“Seq_name”: Consensus identifier (same as in the multi-fasta file)
“status”: Miscellaneous consensus classification (“PotentialChimeric”, “PotentialHostGene”, “Virus”, etc.). This field is taken as-is in the “Miscellaneous” field of the consensus.
“class_classif”: Consensus wicker class.
- Supported values: “I”, “II” and chimeric classes (ex: I|II)
- Unknown values (“NA”, “noCat”) will mark the consensus as “Unclassified”
- Other values gets ignored also but causes a warning in the logs (see logs/integrate.log)
“order_classif”: Consensus wicker order
- Supported values: “LTR”, “DIRS”, “PLE”, “LINE”, “SINE”, “TIR”, “Crypton”, “Helitron”, “Maverick”, “MITE”, “LARD”, “TRIM” and chimeric orders (ex: “SINE|TRIM”)
- Ignored values: “NA”, “noCat”
- Other values gets appended to the “Miscellaneous” field of the consensus.
“superFamily”: Consensus wicker super family
- Supported values: “Copia”, “Gypsy”, “Bel-Pao”, “Retrovirus”, “ERV”, “DIRS”, “Ngaro”, “VIPER”, “Penelope”, “Crypton”, “R2”, “RTE”, “Jockey”, “L1”, “I”, “tRNA”, “7SL”, “5S”, “Tc1-Mariner”, “hAT”, “Mutator”, “Merlin”, “Transib”, “P”, “PiggyBac”, “PIF-Harbinger”, “CACTA”, “Helitron”, “Maverick” and chimeric super families (ex: “Copia|Gypsy”)
- Ignored values: “NA”, “noCat”
- Other values gets appended to the “Miscellaneous” field of the consensus.

Any other columns are ignored by the integration process.

This file should be added to the project.xml file as followed:

  <property name="file.classif" value="[CLASSIF FILE PATH]"/>

I.3. Optional files

I.3.a. Consensus copy statistics

This TSV file is generated by the REPET software package and should contain the headers as a first line.

The first column should have the “TE” header and should contain consensus identifiers (like in the classification file or multi-fasta file). RepetDB parses this file without the need for a specific order of column as the header line informs on their content.

26 optional statistical value columns are parsed in RepetDB:

Consensus statistics
- “covg”: The cumulative coverage of the copies on the genome (in base pair)
- “frags”: The number of fragments
- “fullLgthFrags”: The number of full-length fragments
- “copies”: The number of copies
- “fullLgthCopies”: The number of full-length copies
Copies statistics
- Identity: minimum (“minId”), maximum (“maxId”), mean (“meanId”), median (“medId”), 25th quartile (“q25Id”), 75th quartile (“q75Id”) and standard deviation (“sdId”)
- Copies length: minimum (“minLgth”), maximum (“maxLgth”), mean (“meanLgth”), median (“medLgth”), 25th quartile (“q25Lgth”), 75th quartile (“q75Lgth”) and standard deviation (“sdLgth”)
- Copies coverage over consensus: minimum (“minLgthPerc”), maximum (“maxLgthPerc”), mean (“meanLgthPerc”), median (“medLgthPerc”), 25th quartile (“q25LgthPerc”), 75th quartile (“q75LgthPerc”) and standard deviation (“sdLgthPerc”)

Omitting any of the previous columns will be ignored.

This file should be added to the project.xml file as followed:

  <property name="file.stat" value="[STATISTICS FILE PATH]"/>

I.3.b. GFF3 files

RepetDB can integrate nine consensus features files following the GFF3 standard:

Consensus copies (positioned on the consensus)
5 Similarity features (BlastTX, BlastX, BlastN, protein profiles and ribosomal DNA BlastN)
5 Structural features (ORF, SSR, TR, tRNA and PolyA)

For all these files, the following columns are read by RepetDB:

Column 1: Consensus identifier
Column 4: Feature start
Column 5: Feature end
Column 6: Feature score (optional; can contain blast e-value score for example)
Column 7: Feature strand
Column 9: Feature attributes
- “ID” attribute
- “Parent” attribute
- Other attributes depending on the feature type. (See following sub sections)

Warning on volumetry: The bigger the consensus copies GFF is, the longer the data integration will take. You might consider filtering out consensus copies under a threshold of identity to the consensus.

I.3.b.a. GFF3 column 9 specifics: blast features files

For blast features (BlastX, BlastTX, BlastN) on the consensus, RepetDB will extract the target Repbase element name and classification stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=[REPBASE ID]:[WICKER CLASSIF] [START] [END];

The components are:

[REPBASE ID]: Repbase element identifier
[WICKER CLASSIF]: Wicker classification notation composed with: [Class]:[Order]:[Super Family]
[START]: Match start
[END]: Math end

Examples:

Target=EnSpm-5_VV:ClassII:TIR:CACTA 2158 2231;
Target=VHARB-N1_VV:ClassII:?:? 59 234;

RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).

I.3.b.b. GFF3 column 9 specifics: protein profile feature file

For protein profile, RepetDB will extract the target GyDB or PFAM profile name stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=[PFAM ID][REPET MISC] [START] [END];
[OR]
Target=[GyDB ID][REPET MISC] [START] [END];

The components are:

[PFAM ID]: PFAM profile identifier
[GyDB ID]: GyDB profile identifier
[REPET MISC]: Miscellaneous info concatenated by REPET
[START]: Match start
[END]: Math end

Examples:

Target=PF12799.2_LRR_4_NA_OTHER_27.0 1 40;
Target=_GAG_lentiviridae_NA_GAG_NA 396 400;
Target=PF04434.12_SWIM_NA_OTHER_5.0 9 36;
Target=_RT_cavemovirus_NA_RT_NA 251 268;

RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).

I.3.b.c. GFF3 column 9 specifics: ribosomal DNA BlastN

Warning on generation of this file by REPET: Please make sure this file is really generated as described in this section. We’ve encountered problem with the Target field in a lot of rDNA BlastN GFF generated by the REPET package (see GNP-4949).

For rDNA, RepetDB will extract the target EMBL rDNA accession stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=embl|[ACESSION ID] [START] [END];

The components are:

[ACCESSION ID]: EMBL accession identifier
[START]: Match start
[END]: Math end

Examples:

Target=embl|L28107 1 817;
Target=embl|M38450 162 194;

RepetDB will also extract the target rDNA description from the target_desc or the target_description attribute from column #9.

Examples:

target_desc=L28107 Trichoderma reesei 25S ribosomal RNA.;
target_description=L28817 Candida albicans internal transcribed spacer 1 (ITS1) - 5.8S ribosomal RNA - internal transcribed spacer 2 (ITS2).;

II. RepetDB Track Hub dataset

Track hubs are web-accessible directories of genomic data that can be viewed on genome browsers. You can generate a Track Hub from copies of REPET consensus placed on genomes that references the RepetDB consensus card.

The generation of RepetDB track hub is separated from the classic RepetDB data integration. To generate a Track hub for a dataset, you will need:

Genome fasta file: a multi-fasta file of all of the chromosomes/scaffolds used in the REPET dataset
Consensus classification file: a TSV (tab separated values) file generated by the REPET package used to create tracks for each wicker group
GFF of consensus copies on the genome: a standard GFF3 file with chromosomes/scaffolds as reference sequence and consensus copies as features

In addition, the dataset must provide more metadata on the genome assembly to make it correctly referenceable in the UCSC Genome Browser and the EMBL Track Hub Registry

II.1. Track Hub dataset metadata

To add a RepetDB dataset into the RepetDB Track hub, you must provide additional metadata in the project.xml file as followed:

      <!-- Track Hub properties -->
      <property name="genome.scientificName" value="Vitis vinifera"/>

      <property name="genome.assembly.identifier" value="PN40024_12X"/>
      <property name="genome.assembly.GCA" value="GCA_000003745.2"/>
      <property name="genome.assembly.description" value="PN40024 12X"/>

      <property name="file.annotation.gff" value="ANNOTATION/allChr_PAST12Xv2Man2_2.gff3"/>
      <property name="file.annotation.fasta" value="ANNOTATION/allChr_PAST12Xv2Man2_2.fa"/>

The list of properties contains

The genome organism scientific name (property genome.scientificName, ex: “Vitis vinifera”)
The genome assembly unique identifier (property genome.assembly.identifier, ex: “PN40024_12X”)
The genome assembly Gene Bank accession (property genome.assembly.GCA, ex: “GCA_000003745.2”)
The genome assembly description (property genome.assembly.description, ex: “Vitis vinifera PN40024 12X assembly”)
The genome assembly multi-fasta sequence file (property file.annotation.fasta)
The consensus copies annotation on genome GFF3 file (property file.annotation.gff)