RepetDB Data Submission

RepetDB provides two entry point to REPET generated datasets.

The first entry point is the RepetDB InterMine web application which provides a search and visualization interface centered on the repeat consensus, their classification and their features.

The second (optional) entry point is the RepetDB Track Hub which provides repeat consensus copies on a genome browser.

Each of these entry point requires metadata and data files to be prepared and then integrated.

I. RepetDB Intermine dataset

A RepetDB dataset must be prepared into a folder in which you will regroup all the data files and prepare a project.xml file which lists these files and add metadata on the dataset.

The project.xml file should follow this template:

<source name="[DATASET NAME]" type="repetdb-main">
      <property name="src.data.dir" location="[DATA DIR]"/>

      <!-- Dataset metadata -->
      <property name="file.submission" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus fasta sequence -->
      <property name="file.fasta" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus classification -->
      <property name="file.classif" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus statistics -->
      <property name="file.stat" value="[RELATIVE PATH TO FILE]"/>

      <!-- Consensus copies -->
      <property name="file.gff3.copies" value="[RELATIVE PATH TO FILE]"/>

      <!-- Similarity evidences -->
      <property name="file.gff3.blrn" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.blrtx" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.blrx" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.rdnablrn" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.profiles" value="[RELATIVE PATH TO FILE]"/>

      <!-- Structural evidences -->
      <property name="file.gff3.orf" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.polya" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.ssr" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.tr" value="[RELATIVE PATH TO FILE]"/>
      <property name="file.gff3.trna" value="[RELATIVE PATH TO FILE]"/>
</source>

This template show all possible properties of a RepetDB dataset but most of them are optional.

The mandatory fields include:

All [RELATIVE PATH TO FILE] properties should contain a relative path to a data file inside the dataset data folder.

All the other properties point to optional data file

I.1. Dataset metadata

To submit data to RepetDB, you must also fill an Excel file providing the metadata about your dataset containing:

All of these metadata must be provided in the Excel submission file using the RepetDB_Submission_Template.xls template file.

For an example of a submission file, check the RepetDB_Sample_Submission.xls file.

Once filled, you then have to place the submission file to the dataset data directory and add the following property:

  <property name="file.submission" value="[SUBMISSION FILE PATH]"/>

Replace [SUBMISSION FILE PATH] by the relative file path of the submission file inside the dataset data directory.

I.2. Mandatory files

Two files are mandatory to integrate a new RepetDB dataset:

I.2.a. Consensus multi-fasta file

This file should follow the multi-fasta standard with the consensus identifier as a sequence identifier.

This file should be added to the project.xml file as followed:

  <property name="file.fasta" value="[FASTA FILE PATH]"/>

I.2.b. Consensus classification file

From the consensus classification file, five specific columns are used in RepetDB integration (order does not matter) :

Any other columns are ignored by the integration process.

This file should be added to the project.xml file as followed:

  <property name="file.classif" value="[CLASSIF FILE PATH]"/>

I.3. Optional files

I.3.a. Consensus copy statistics

This TSV file is generated by the REPET software package and should contain the headers as a first line.

The first column should have the “TE” header and should contain consensus identifiers (like in the classification file or multi-fasta file). RepetDB parses this file without the need for a specific order of column as the header line informs on their content.

26 optional statistical value columns are parsed in RepetDB:

Omitting any of the previous columns will be ignored.

This file should be added to the project.xml file as followed:

  <property name="file.stat" value="[STATISTICS FILE PATH]"/>

I.3.b. GFF3 files

RepetDB can integrate nine consensus features files following the GFF3 standard:

For all these files, the following columns are read by RepetDB:

Warning on volumetry: The bigger the consensus copies GFF is, the longer the data integration will take. You might consider filtering out consensus copies under a threshold of identity to the consensus.

I.3.b.a. GFF3 column 9 specifics: blast features files

For blast features (BlastX, BlastTX, BlastN) on the consensus, RepetDB will extract the target Repbase element name and classification stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=[REPBASE ID]:[WICKER CLASSIF] [START] [END];

The components are:

Examples:

Target=EnSpm-5_VV:ClassII:TIR:CACTA 2158 2231;
Target=VHARB-N1_VV:ClassII:?:? 59 234;

RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).

I.3.b.b. GFF3 column 9 specifics: protein profile feature file

For protein profile, RepetDB will extract the target GyDB or PFAM profile name stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=[PFAM ID][REPET MISC] [START] [END];
[OR]
Target=[GyDB ID][REPET MISC] [START] [END];

The components are:

Examples:

Target=PF12799.2_LRR_4_NA_OTHER_27.0 1 40;
Target=_GAG_lentiviridae_NA_GAG_NA 396 400;
Target=PF04434.12_SWIM_NA_OTHER_5.0 9 36;
Target=_RT_cavemovirus_NA_RT_NA 251 268;

RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).

I.3.b.c. GFF3 column 9 specifics: ribosomal DNA BlastN

Warning on generation of this file by REPET: Please make sure this file is really generated as described in this section. We’ve encountered problem with the Target field in a lot of rDNA BlastN GFF generated by the REPET package (see GNP-4949).

For rDNA, RepetDB will extract the target EMBL rDNA accession stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:

Target=embl|[ACESSION ID] [START] [END];

The components are:

Examples:

Target=embl|L28107 1 817;
Target=embl|M38450 162 194;

RepetDB will also extract the target rDNA description from the target_desc or the target_description attribute from column #9.

Examples:

target_desc=L28107 Trichoderma reesei 25S ribosomal RNA.;
target_description=L28817 Candida albicans internal transcribed spacer 1 (ITS1) - 5.8S ribosomal RNA - internal transcribed spacer 2 (ITS2).;

II. RepetDB Track Hub dataset

Track hubs are web-accessible directories of genomic data that can be viewed on genome browsers. You can generate a Track Hub from copies of REPET consensus placed on genomes that references the RepetDB consensus card.

The generation of RepetDB track hub is separated from the classic RepetDB data integration. To generate a Track hub for a dataset, you will need:

In addition, the dataset must provide more metadata on the genome assembly to make it correctly referenceable in the UCSC Genome Browser and the EMBL Track Hub Registry

II.1. Track Hub dataset metadata

To add a RepetDB dataset into the RepetDB Track hub, you must provide additional metadata in the project.xml file as followed:

      <!-- Track Hub properties -->
      <property name="genome.scientificName" value="Vitis vinifera"/>

      <property name="genome.assembly.identifier" value="PN40024_12X"/>
      <property name="genome.assembly.GCA" value="GCA_000003745.2"/>
      <property name="genome.assembly.description" value="PN40024 12X"/>

      <property name="file.annotation.gff" value="ANNOTATION/allChr_PAST12Xv2Man2_2.gff3"/>
      <property name="file.annotation.fasta" value="ANNOTATION/allChr_PAST12Xv2Man2_2.fa"/>

The list of properties contains