FEEDBACK  |  CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / Tools / REPET / TEannot tuto

TEannot tuto

Tutorial for TEannot included in REPET package v2.5

We advise to run first the TEdenovo pipeline but it is not compulsory. We suppose you begin by running the TEannot pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences. Thus, from now on, the project name is "DmelChr4".

Setup your working environment

If you already ran the TEdenovo pipeline, you won't have to do all the following tasks, skip to Rename your input fasta file by <project_name>_refTEs.fa

Set environment variables. 

REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").

  • export REPET_PATH=$HOME/src/repet_pipe/

Add the path towards REPET programs to your path:

  • export PATH=$REPET_PATH/bin:...:$PATH

If you want to use tools from REPET package, you will have to set some other variables. In this case, you can set the variables in the file "$REPET_PATH/config/setEnv.sh", and source it.
 

Create your project directory (for instance "DmelChr4_TEannot/") and go into it:

  • cd $HOME/work/
  • mkdir DmelChr4_TEannot
  • cd DmelChr4_TEannot

Copy the input fasta file recording the genomic sequences (it has to be named <project_name>.fa):

  • ln -s $REPET_PATH/db/DmelChr4.fa .

Format your fasta file to have only 60 bps (or less) by line for each sequence. Concerning the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers. Please, avoid space (" ") or symbols such as "=", ";", ":", "|"... 

Rename your input fasta file by <project_name>_refTEs.fa

The <project_name> must only contains letters, numbers and underscore '_', and be max. 15 characters long.
If you already ran the TEdenovo pipeline, retrieve the output fasta file containing the TE librairy you want.

Several TEdenovo output files can be chosen according to the detection type you used and the steps you launched. Please read the file "TEdenovo_tuto.txt".

 

  • ln -s $HOME/work/DmelChr4_TEdenovo/DmelChr4_Blaster_GrpRecPil_Struct_Map_TEclassif_Filtered_Clustered/DmelChr4_denovoLibTEs_filtered_clustered.fa DmelChr4_refTEs.fa

 

Copy the configuration file:

  • cp $REPET_PATH/config/TEannot.cfg .

Edit the configuration file "TEannot.cfg" in order to adapt it to your personal situation.

  • In the section "repet_env", indicate (ask your system administrator):
    • the host name of your MySQL database
    • your MySQL login
    • your MySQL password
    •  the name of your MySQL database
    • the name of your jobs manager running on the computing cluster you are using ("SGE" or "TORQUE")
  • In the section "project", indicate: 
    •  the name of your project (here: DmelChr4)
    • the absolute path to your project directory (here: $HOME/work/DmelChr4_TEdenovo)

 

Run the pipeline

The standard output is rather self-explaining. The programs from REPET almost always begin with the sentence "beginning of ..." and ends with the sentence "... finished successfully".Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned.Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.

To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out.To have more details, read the manual ("$ man nohup"). Here is an example:   

  • nohup TEannot.py -P ... -S 1 >& step1.txt &

To speed up the process, jobs are launched in parallel.In each section of configuration file, you can set option:

  • Resources (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
  • tmpDir (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch"). WARNING : if you let the empty default parameter, don't use 'yes' for the copy parameter described in step 2.
  • clean {yes|no} (default: yes): temporary files cleaning

 

Introduction

  TEannot is able to annote a genome using DNA sequences library. This library can be a predicted TE library built by TEdenovo Please have a look in steps descriptions below for commands examples. Quick description of TEannot's steps:

  • Step 1 : Genomic sequences and data banks preparation
  • Step 2 : Align the reference TE sequences on each chunk of the genome via Blaster and/or CENSOR and/or RepeatMasker
  • Step 3 : Combine and filter the HSPs obtained at steps 2.
  • Step 4 : Search for satellites (SSR) on the chunks via TRF, Mreps and RepeatMasker
  • Step 5 : Merge the SSR annotations from the 3 programs used at the step 4
  • Step 6 : Comparison with data banks (nucleotides or amino-acids, in fasta format, e.g. Repbase Update)
  • Step 7 : Remove spurious HSPs and join procedure
  • Step 8 : TE annotation export

Description of steps and command lines

Methodologic advices

In order to obtain the best TEs genome annotation, it is highly advised to perform the following method:

  • Firstly, run a quick TEannot, using only steps 1-2-3-7. You can use the output multifasta file from TEdenovo pipeline as TEs library (cf. "doc/TEdenovo_tuto.txt").
  • Then, run the "PostAnalyzeTELib.py" script with analyze 3 (-a 3) on the annotation table obtained after this TEannot (cf. doc/HelpFrom-commons-tools.txt).
  • Then, use the "GetSpecificTELibAccordingToAnnotation.py" script on the "*.annotStatsPerTE.tab" file from PostAnalyzeTELib.py
    The output file suffixed by "FullLengthFrag.fa" is a TEs library whose sequences have at least one perfect match in the genome. Thus, these TEs are validated.
  • Run an other complete TEannot (steps 1 to 8) on your original genome using this validated TEs library.

Step 1 genomic sequence and data banks preparation

  • Cut the input genomic sequences into chunks and load them in MySQL tables ("DmelChr4_chr_seq", "DmelChr4_chk_seq" and "DmelChr4_chk_map")
  • Randomize the chunks (shuffle but preserve both mono- and di-symbol composition) and load them in a MySQL table
  • Rename the headers of the reference TEs, load the reference TE library (e.g. from the TEdenovo pipeline) in a MySQL table ("DmelChr4_refTEs_seq", "DmelChr4_refTEs_map") and prepare it for Blaster (blastn)

Adjustable parameters :

Edit "TEannot.cfg" file at [prepare_data] section , if you need to change the default parameters in .The input genomic sequences are cut into chunks (threshold at 200kb with a 10kb overlap) but only if their length is below the threshold, i.e. a chunk will never be a concatenation of two different input sequences.In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb, the possibilityof putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs.

  • length threshold ("chunk_length: 200000")
  • overlap length ("chunk_overlap: 10000")
  • number of chunks per batch launched in parallel ("nb_seq_per_batch: 5")

In order to remove false positives, we apply an empirical statistical filter by comparing the reference TE library with the genomic sequences that have been randomized. Set "make_random_chunks: yes"If you don't want to use this filter, set "make_random_chunks: no". You will use your own filtering values at step 3 (see below).You may need to change parameters in [align_refTEs_with_genome] section too, because the reference TE library will be prepared according to the blast program you choose for step 2 (see below).The reference TE library usually comes from the TEdenovo pipeline (i.e. formatted as "classification_cluster_name", e.g. "DTX-incomp_Blc10_DmelChr4-B-G8-Map20").If not, no sequence header should be longer than 50 letters.When you are ready, launch the following command:       TEannot.py -P DmelChr4 -C TEannot.cfg -S 1The "DmelChr4_db/" directory is created, in which there are all the prepared data and two subdirectories "batches/" and "batches_rnd/".In database, it also creates tables called "DmelChr4_chr_seq", "DmelChr4_chk_seq", "DmelChr4_chk_map", "DmelChr4_refTEs_seq" and "DmelChr4_refTEs_map". 

Step 2 Align the reference TE sequences on each chunk

The second step aligns the reference TE sequences on each genomic chunk via BLASTER (high sensitivity, followed by MATCHER) AND/OR REPEATMASKER (cutoff at 200) AND/OR CENSOR (high sensitivity).For each program, you can do the same on the randomized chunks (option "-r"). 

Adjustable parameters:

Edit the configuarion file TEannot.cfg at [align_refTEs_with_genome] section and set some parameters :

  • If you want to use BLASTER with WU-BLAST, write "BLR_blast: wu".
  • If you want to use BLASTER with NCBI-BLAST, write "BLR_blast: ncbi".
  • If you want to use BLASTER with NCBI-BLAST-PLUS, write "BLR_blast: blastplus".
  • You can decrease the sensitivity from 4 to 1 by changing the value of "BLR_sensitivity".
  • To use the blast default values, write "BLR_sensitivity: 0".
  • If you want to use REPEATMASKER with WU-BLAST, write "RM_engine: wu".
  • If you want to use REPEATMASKER with CROSS_MATCH, write "RM_engine: cm".
  • If you want to use REPEATMASKER with NCBI-BLAST, write "RM_engine: ncbi".
  • You can decrease the sensitivity from "RM_sensitivity: s" to "RM_sensitivity: q" or even "RM_sensitivity: qq".
  • If you don't specify any sensitivity ("RM_sensitivity: "), the default one will be used.
  • If you want to use CENSOR with WU-BLAST, write "CEN_blast: wu".
  • If you want to use CENSOR with NCBI-BLAST, write "CEN_blast: ncbi".
  • Thus it is advised to keep only useful files (option "clean: yes" in configuration file).

When you are ready, launch the following command:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a BLR
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a RM
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a CEN

In order to compute a statistical filter in step 3, you can use the randomized chunks (optional) by launching step 2 again with "-r" option:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a BLR -r
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a RM -r
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a CEN -r

This step generates lots of files (by 'lots' it means up to dozens of Go, of course depending on the size of the input data bank).Two directories are created, "DmelChr4_TEdetect/" and "DmelChr4_TEdetect_rnd/", containing three subdirectories each (one per alignment program) where results are stored. 

Step 3 Filter and combine HSP

The third step filters and combines the HSPs obtained at step 2, i.e. the TE annotations.First, for each alignment program specified with option "-c" (by default, the 3 programs used at step 2), it determines the highest score obtained on the randomized chunks (of course, this requires the step 2 with option "-r" has been launched).More precisely, it uses the 95th percentile of the distribution of the highest scores obtained on each chunk.Then it filters the HSPs obtained on the "natural" chunks by keeping only the ones having a score higher than the threshold determined previously.For short input sequences, it may happen that a program (Blaster, Censor and/or RepeatMasker) doesn't find any HSP on the randomized chunks.In that case, a "Warning" is raised, a default value is given (from the configuration file) and the program "TEannot.py" goes on.If you don't want to use the filter values found on the randomized chunks, you can force the usage of your own values in the configuration file ("force_default_values: yes" in [filter] section).Next, for each batch, the 3 files (each from a different program) are concatenated and MATCHER is used to remove overlapping HSPs and make connections with the "join" procedure.

When you are ready, launch the following command:     

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 3 -c BLR+RM+CEN

A subdirectory "Comb/" is created in "DmelChr4_TEdetect/".This step also creates MySQL tables "DmelChr4_chk_allTEs_path" and "DmelChr4_chr_allTEs_path".

Step 4 Search for SSR

The fourth step searches for satellites on the genomic sequences via TRF, Mreps and RepeatMasker (look only for simple repeats). The SSR annotations are loaded into a MySQL table.If you are not interested in satellites detection, you can skip STEP 4 and STEP 5. In [SSR_detect] section, you can set:    RMSSR_engine : with wu or cm (crossmatch)    TRFmaxPeriod : maximum tandem repeats period size to be reported by TRF

When you are ready, launch the following command:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s TRF
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s Mreps
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s RMSSR

 A directory is created, "DmelChr4_SSRdetect/", containing three subdirectories (one per program) with the results that are also loaded into MySQL tables called "DmelChr4_chk_TRF", "DmelChr4_chk_Mreps" and "DmelChr4_chk_RMSSR".

Step 5 Merge SSR annotations

This step merges the SSR annotations from the 3 programs used at the previous step.For instance, a SSR detected by TRF with coordinates (100,500) and another detected by Mreps with coordinates (80,450) are merged into a SSR with coordinates (80,500).

When you are ready, launch the following command:  

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 5

A new MySQL table is created, called "DmelChr4_chk_allSSRs_set".

Step 6 Comparison with data banks

This step compares a data bank (nucleotides or amino-acids, in fasta format, e.g. Repbase Update) with each input genomic sequence via BLASTER with tblastx or blastx, followed by MATCHER.If you want to use a nucleotides data bank set bankBLRtx: <nucleotide_sequences_bank_name> in [align_other_banks] sectionIf you want to use a amino-acids data bank set bankBLRx: <amino-acids_sequences_bank_name> in [align_other_banks] section 

When you are ready, launch the following command:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 6 -b tblastx
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 6 -b blastx

A subdirectory "bankBLR(t)x/" is created in "DmelChr4_TEdetect/", that contains the results, along with MySQL tables ("DmelChr4_chk_bankBLR(t)x_path", "DmelChr4_chr_bankBLR(t)x_path", "DmelChr4_bankBLR(t)x_(nt/prot)_seq").

Step 7 Remove spurious HSPs and long join procedure

This step performs successive procedures on the MySQL tables such as removal of TE doublons, removal of SSR annotations included into TE annotations and "long join procedure" (described below).Because the input genomic sequences may contain large regions of heterochromatin, some TEs are expected to be nested. As a given copy can be interrupted by several other TEs inserted more recently, we expect to find distant fragments belonging to the same copies.MATCHER is used at step 3, not only to filter overlapping HSPs, but also to join them. However, it relies on a scoring scheme that, in some extreme cases (deep nesting, distant fragmentation), appears to be unsufficient. Therefore we implemented a "long join procedure" aimed at recovering the join of these fragments missed sometimes by MATCHER. Fragments involved in nesting patterns must respect the three following constraints: (i) be co-linear; (ii) have the same age, and (iii) be separated by younger TE insertions. The identity percentage with a reference consensus sequence is used to estimate the age of a copy . Consecutive fragments on both the genome and the same reference TE were automatically joined if they respect these constraints. We call them "nest join". Sometimes large non-TE sequence insertions can be observed in a TE copy. They are suspected to appear by gene conversion. In order to deal with these cases, we also join fragments if they are separated by an insert of less than 5kb and/or less than 500bp of mismatches, and have the same age. We call this a "simple join".Young copies are expected to keep longer fragments than old copies, because deletions accumulate with time. This is a final control of nested patterns based on a different assumption than consensus nucleotide identity percentage (see above). Thus, at the end, nested TEs are split if inner TE fragments are longer than outer joined fragments. They are reported as "split".

Based on Drosophila Melanogaster genome (release 4), we took conservative parameters settings to join only unambiguous cases (Bergman et al., Genome Biology 2006,7:R112).
A "deny long join" occurs when age of fragments differs by more than 2% ("join_id_tolerance" parameter). This rejection is frequent compared to other event highlighting the importance of this constraint (i.e. considering the age of the fragments to join)."Too long join" occurs when the fragments to be joined are distant by more than 100kb. This appears to be very marginal.
A "deny nest join" occurs when either there is not an enough high TE coverage of the insert (>95%, "join_TEinsert_cov" parameter) or there is older TEs inserted. This appears to occur rarely.
Some "simple join" are performed, but their number still remains low compared to the number of fragments treated. This is a consequence of MATCHER join efficiency, indicating that "simple join" is needed only rarely. The same conclusion can be drawn for "splits". One could have set parameters at less conservative value and thus obtained more "long join", but we felt that these cases could thus be too ambiguous and we preferred to leave our results conservative.
Below an explanation of parameters values found in TEannot.cfg in "[annot_processing]" section :

  • min_size, default 20 : copies with length below "min_size" bp are removed.
  • join_max_gap_size, default 5000 : if distance between two fragments exceed "join_max_gap_size", fragments are not connected.
  • join_max_mismatch_size, default 500 : if mismatch length (bp) between two fragments (in dynamic programming algorithm, see Quesneville et al. 2005) exceed "join_max_mismatch_size", fragments are not connected.- join_id_tolerance, default 2 : if age between two fragments (identity percentage) exceed "join_id_tolerance", fragments are not connected.
  • join_TEinsert_cov, default 0.95 : if distance between two fragments exceed "join_max_gap_size" and if at least "join_TEinsert_cov" % of genome sequence between fragments is composed of younger TEs, fragments are connected.
  • join_overlap, default 15 : if size (bp) of overlap between two fragments exceed "join_overlap", fragment are not connected.
  • join_minlength_split, default 100 : if nested TE is older than flanking fragments but its size exceed "join_minlength_split", fragments are not connected. 

When you are ready, launch the following command:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 7

In database, several tables are created, "DmelChr4_chk_allTEs_nr_path", "DmelChr4_chk_allTEs_nr_noSSR_path" and finally "DmelChr4_chk_allTEs_nr_noSSR_join_path".

Step 8 TE annotation export

This step allows to export annotations from the final MySQL table to gameXML or GFF3 format. These two annotation formats can be imported respectively in Apollo and GBrowse.

Further details are available on the web:

  • gameXML: http://www.fruitfly.org/annot/gamexml.dtd.txt- GFF3: http://www.sequenceontology.org/gff3.shtml
  • Apollo: http://gmod.org/wiki/index.php/Apollo- GBrowse: http://gmod.org/wiki/index.php/Gbrowse

Edit the configuration file "TEannot.cfg" if you need to change the default parameters in [export] section.

  • To export the annotations on the input sequences set "sequences: chromosomes" or on the chunks set "sequences: chunks"
  • To add the SSR annotations by setting "add_SSRs: yes" as well as the annotations found via tblastx by setting "add_tBx: yes" or blastx by setting "add_Bx: yes" (assuming you launched step 6 before).
  • To keep the gff3 files corresponding to the input genomic sequences without TEs annotation, set "keep_gff3_files_without_annotations: yes". In this case, the corresponding files will be empty unless the "gff3_with_genomic_sequence" is set to 'yes'
  • In the gff3 file, to merge redundant matches (same start, same end, same score and on the same sequence) set "gff3_merge_redundant_features: yes". The name of the other TEs are tagged 'other targets' in the attributes field (ninth field)
  • To generate a match part for each match, set "gff3_compulsory_match_part: yes"
  • To add the annotated genomic sequence at the end of gff3 files, set "gff3_with_genomic_sequence: yes"
  • To add the TE length in the attributes field (ninth field) for each match, set "gff3_with_TE_length: yes"
  • To add the TE classification information in the field attributes (ninth field) with tag "TargetDescription" for each match, set 'gff3_with_classif_info: yes" and give the name of the TE table by setting "classif_table_name: <name_of_TEs_table>" (default if empty: "<project_name>_consensus_classif" from TEdenovo)
  • To get gff3 files compatible with a chado database, set "gff3_chado: yes"
  • If you set "drop_tables: yes", be careful because all the MySQL tables will be deleted. Do it only if you are sure you don't need them anymore. 

When you are ready, launch one of the following command:

  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 8 -o gameXML
  • TEannot.py -P DmelChr4 -C TEannot.cfg -S 8 -o GFF3

A directory is created, "DmelChr4_gameXML" or "DmelChr4_GFF3", containing the annotations files, one per sequences (chromosome or chunk).

Update: 10 Apr 2017
Creation date: 10 Sep 2013
PLATFORM   RESEARCH   PROJECTS   DATA   TOOLS   SPECIES   ABOUT US   FEEDBACK   CONTACT US   REGISTER   EDIT