RepetDB provides two entry point to REPET generated datasets.
The first entry point is the RepetDB InterMine web application which provides a search and visualization interface centered on the repeat consensus, their classification and their features.
The second (optional) entry point is the RepetDB Track Hub which provides repeat consensus copies on a genome browser.
Each of these entry point requires metadata and data files to be prepared and then integrated.
A RepetDB dataset must be prepared into a folder in which you will regroup all the data files and prepare a project.xml
file which lists these files and add metadata on the dataset.
The project.xml
file should follow this template:
<source name="[DATASET NAME]" type="repetdb-main">
<property name="src.data.dir" location="[DATA DIR]"/>
<!-- Dataset metadata -->
<property name="file.submission" value="[RELATIVE PATH TO FILE]"/>
<!-- Consensus fasta sequence -->
<property name="file.fasta" value="[RELATIVE PATH TO FILE]"/>
<!-- Consensus classification -->
<property name="file.classif" value="[RELATIVE PATH TO FILE]"/>
<!-- Consensus statistics -->
<property name="file.stat" value="[RELATIVE PATH TO FILE]"/>
<!-- Consensus copies -->
<property name="file.gff3.copies" value="[RELATIVE PATH TO FILE]"/>
<!-- Similarity evidences -->
<property name="file.gff3.blrn" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.blrtx" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.blrx" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.rdnablrn" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.profiles" value="[RELATIVE PATH TO FILE]"/>
<!-- Structural evidences -->
<property name="file.gff3.orf" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.polya" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.ssr" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.tr" value="[RELATIVE PATH TO FILE]"/>
<property name="file.gff3.trna" value="[RELATIVE PATH TO FILE]"/>
</source>
This template show all possible properties of a RepetDB dataset but most of them are optional.
The mandatory fields include:
[DATASET NAME]
) which should be a unique identifier for your RepetDB dataset[DATA DIR]
in property src.data.dir
) which should contain the absolute path to the dataset data foldertaxonomy.id
, genome.assembly
and contact.name
)file.submission
) which should point to the Excel submission filefile.fasta
)file.classif
)All [RELATIVE PATH TO FILE]
properties should contain a relative path to a data file inside the dataset data folder.
All the other properties point to optional data file
To submit data to RepetDB, you must also fill an Excel file providing the metadata about your dataset containing:
All of these metadata must be provided in the Excel submission file using the RepetDB_Submission_Template.xls
template file.
For an example of a submission file, check the RepetDB_Sample_Submission.xls
file.
Once filled, you then have to place the submission file to the dataset data directory and add the following property:
Replace
[SUBMISSION FILE PATH]
by the relative file path of the submission file inside the dataset data directory.
Two files are mandatory to integrate a new RepetDB dataset:
This file should follow the multi-fasta standard with the consensus identifier as a sequence identifier.
This file should be added to the project.xml
file as followed:
From the consensus classification file, five specific columns are used in RepetDB integration (order does not matter) :
Any other columns are ignored by the integration process.
This file should be added to the project.xml
file as followed:
This TSV file is generated by the REPET software package and should contain the headers as a first line.
The first column should have the “TE” header and should contain consensus identifiers (like in the classification file or multi-fasta file). RepetDB parses this file without the need for a specific order of column as the header line informs on their content.
26 optional statistical value columns are parsed in RepetDB:
Omitting any of the previous columns will be ignored.
This file should be added to the project.xml
file as followed:
RepetDB can integrate nine consensus features files following the GFF3 standard:
For all these files, the following columns are read by RepetDB:
Warning on volumetry: The bigger the consensus copies GFF is, the longer the data integration will take. You might consider filtering out consensus copies under a threshold of identity to the consensus.
For blast features (BlastX, BlastTX, BlastN) on the consensus, RepetDB will extract the target Repbase element name and classification stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:
Target=[REPBASE ID]:[WICKER CLASSIF] [START] [END];
The components are:
[REPBASE ID]
: Repbase element identifier[WICKER CLASSIF]
: Wicker classification notation composed with: [Class]:[Order]:[Super Family]
[START]
: Match start[END]
: Math endExamples:
Target=EnSpm-5_VV:ClassII:TIR:CACTA 2158 2231;
Target=VHARB-N1_VV:ClassII:?:? 59 234;
RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).
For protein profile, RepetDB will extract the target GyDB or PFAM profile name stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:
Target=[PFAM ID][REPET MISC] [START] [END];
[OR]
Target=[GyDB ID][REPET MISC] [START] [END];
The components are:
[PFAM ID]
: PFAM profile identifier[GyDB ID]
: GyDB profile identifier[REPET MISC]
: Miscellaneous info concatenated by REPET[START]
: Match start[END]
: Math endExamples:
Target=PF12799.2_LRR_4_NA_OTHER_27.0 1 40;
Target=_GAG_lentiviridae_NA_GAG_NA 396 400;
Target=PF04434.12_SWIM_NA_OTHER_5.0 9 36;
Target=_RT_cavemovirus_NA_RT_NA 251 268;
RepetDB will also extract the identity score from the “Identity” field of the feature attributes column and the e-value score from the standard GFF3 score (column n#6).
Warning on generation of this file by REPET: Please make sure this file is really generated as described in this section. We’ve encountered problem with the Target
field in a lot of rDNA BlastN GFF generated by the REPET package (see GNP-4949).
For rDNA, RepetDB will extract the target EMBL rDNA accession stored in the “Target” field of the feature attributes (column #9). This field must follow the following format:
Target=embl|[ACESSION ID] [START] [END];
The components are:
[ACCESSION ID]
: EMBL accession identifier[START]
: Match start[END]
: Math endExamples:
Target=embl|L28107 1 817;
Target=embl|M38450 162 194;
RepetDB will also extract the target rDNA description from the target_desc
or the target_description
attribute from column #9.
Examples:
target_desc=L28107 Trichoderma reesei 25S ribosomal RNA.;
target_description=L28817 Candida albicans internal transcribed spacer 1 (ITS1) - 5.8S ribosomal RNA - internal transcribed spacer 2 (ITS2).;
Track hubs are web-accessible directories of genomic data that can be viewed on genome browsers. You can generate a Track Hub from copies of REPET consensus placed on genomes that references the RepetDB consensus card.
The generation of RepetDB track hub is separated from the classic RepetDB data integration. To generate a Track hub for a dataset, you will need:
In addition, the dataset must provide more metadata on the genome assembly to make it correctly referenceable in the UCSC Genome Browser and the EMBL Track Hub Registry
To add a RepetDB dataset into the RepetDB Track hub, you must provide additional metadata in the project.xml
file as followed:
<!-- Track Hub properties -->
<property name="genome.scientificName" value="Vitis vinifera"/>
<property name="genome.assembly.identifier" value="PN40024_12X"/>
<property name="genome.assembly.GCA" value="GCA_000003745.2"/>
<property name="genome.assembly.description" value="PN40024 12X"/>
<property name="file.annotation.gff" value="ANNOTATION/allChr_PAST12Xv2Man2_2.gff3"/>
<property name="file.annotation.fasta" value="ANNOTATION/allChr_PAST12Xv2Man2_2.fa"/>
The list of properties contains
genome.scientificName
, ex: “Vitis vinifera”)genome.assembly.identifier
, ex: “PN40024_12X”)genome.assembly.GCA
, ex: “GCA_000003745.2”)genome.assembly.description
, ex: “Vitis vinifera PN40024 12X assembly”)file.annotation.fasta
)file.annotation.gff
)