FEEDBACK  |  CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / Tools / REPET / TEdenovo tuto

TEdenovo tuto

Tutorial for TEdenovo included in REPET package v2.5

We suppose you begin by running the TEdenovo pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences.
Thus, from now on, the project name is "DmelChr4".

Setup your working environment

Set environment variables
REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").

  • export REPET_PATH=$HOME/src/repet_pipe/

Add the path towards REPET programs to your path:

  • export PATH=$REPET_PATH/bin:...:$PATH

If you want to use tools from REPET package, you will have to set some other variables.
In this case, you can set the variables in the file "$REPET_PATH/config/setEnv.sh", and source it.

Create your project directory (for instance "DmelChr4_TEdenovo/") and go into it:

  • cd $HOME/work/
  • mkdir DmelChr4_TEdenovo
  • cd DmelChr4_TEdenovo

Copy the input fasta file recording the genomic sequences (it has to be named <project_name>.fa):

  • ln -s $REPET_PATH/db/DmelChr4.fa .

 Format your fasta file to have only 60 bps (or less) by line for each sequence.
 Concerning the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
 Please, avoid space (" ") or symbols such as "=", ";", ":", "|"...

Copy the configuration file:

  • cp $REPET_PATH/config/TEdenovo.cfg .

Edit the configuration file "TEdenovo.cfg" in order to adapt it to your personal situation.

  • In the section "repet_env", indicate (ask your system administrator)
    • the host name of your MySQL database
    • your MySQL login
    • your MySQL password
    •  the name of your MySQL database
    • the name of your jobs manager running on the computing cluster you are using ("SGE" or "TORQUE")
  • In the section "project", indicate:
    •  the name of your project (here: DmelChr4)
    • the absolute path to your project directory (here: $HOME/work/DmelChr4_TEdenovo)

Run the pipeline

The standard output is rather self-explaining.The programs from REPET almost always begin with the sentence "beginning of ..." and ends with the sentence "... finished successfully".Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned.Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out.To have more details, read the manual ("$ man nohup").Here is an example:   

  • nohup TEdenovo.py -P ... -S 1 >& step1.txt &

To speed up the process, jobs are launched in parallel.In each section of configuration file, you can set option:

  • Resources (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
  • tmpDir (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch"). WARNING : if you let the empty default parameter, don't use 'yes' for the copy parameter described in step 2.
  • clean {yes|no} (default: yes): temporary files cleaning

Introduction

The TEdenovo pipeline follows a philosophy in three first steps:

  • Detection of repeated sequences (potential TE)
  • Clustering of these sequences
  • Generation of consensus sequences for each cluster, representing the ancestral TE

After that, other processes are launched:

  • Each consensus sequences is classified using Wicker's TE Classification
  • The consensus bank is filtered on several criteria
  • The TE are grouped by families

TEdenovo is able to look for repeated sequences by similarity and/or LTR retrotransposons by structural search. You can run either one or both means of detection.Please have a look in steps descriptions below for commands examples.Quick description of TEdenovo's steps:

  • Step 1 : Genomic sequences are cut into batches
  • Step 2 : The genome is aligned to itself using Blast
  • Step 2 structural : LTRs retrotransposons are searched in each batch using LTRharvest
  • Step 3 : The repetitives HSP from BLAST are clustered by Recon, Grouper and/or Piler
  • Step 3 structural : The predictions from LTRharvest are clustered using Blastclust or MCL
  • Step 4 : A multiple alignment is computed for each cluster, and a consensus sequence is derived from each multiple alignment
  • Step 5 : Particular features are detected on each consensus, such as structural features or homology with known TE, HMM profiles or host genes
  • Step 6 : The consensus are classified using Wicker's classification
  • Step 7 : SSR and under-represented unclassified consensus are filtered
  • Step 8 : The consensus are clustered into families to facilitate manual curation using Blastclust or MCL

Description of steps and command lines

Step 1 genomic sequence preparation

In this step, the input genomic sequences are cut into chunks (threshold at 200kb with a 10kb overlap) but only if their length is below the threshold, i.e. a chunk will neverbe a concatenation of two different input sequences.In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb, the possibilityof putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs. See TEdenovo.cfg config file section [prepare_batches]

  • length threshold ("chunk_length: 200000")
  • overlap length ("chunk_overlap: 10000")
  • number of chunks per batch launched in parallel ("nb_seq_per_batch: 5")

When you are ready, launch the following command:

TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 1

In the example, the directory DmelChr4_db is created with the fasta file containing the chunks is written. See OutPuts_RepetPipelines.xlsx (24.64 kB) . 

Step 2 all by all alignment

The second step aligns the genomic sequences of interest (in the example "DmelChr4.fa") with themselves in order to identify high-scoring segment pairs (HSPs) corresponding to repeats.

Adjustable parameters :
You can specify an option that may improve the computing time:     copy {yes|no} (default: no): if 'yes', the genomic sequence is copied in the tmpDir specified previously.    WARNING : if you specify 'yes', it improves computing performances ONLY if you specified a tmpDir, and if this tmpDir is a computing node directory (e.g. "/scratch").You also have to make sure that neither a password nor a passphrase are required to connect to the computing nodes from the submission node.Please ask your system administrator for these two crucial points before using this option The program BLASTER is used with stringent parameters. See TEdenovo.cfg section [self_align] :

  • you can choose the blast program between NCBI-BLAST, NCBI-BLAST+ and WU-BLAST ("blast: ncbi", "blast: blastplus", "blast: wu").
  • BLAST returns only HSPs having an E-value below 1e-300
  • BLAST returns only HSPs having a length above 100 (bp)
  • BLAST returns only HSPs having an identity percentage above 90 (in %)

After BLASTER ran, HSPs can be filtered ("filter_HSP: yes"). Even if threshold have already been defined above, you may want to be more stringent after the BLAST.Moreover, it is not possible during the BLAST, to filter a maximal HSP size (e.g. to remove matches corresponding to segmental duplications). See TEdenovo.cfg section [self_align] :

  • keep only HSPs having an E-value below 1e-300
  • keep only HSPs having an identity percentage above 90 (%)
  • keep only HSPs having a length below 100 (bp)
  • keep only HSPs having a length above 20000 (bp)

    The step 2 generates lots of files (by 'lots' we mean up to dozens of Go, of course depending on the size of the input data bank).    Thus it is advised to keep only useful files ("clean: yes"). To see the differences, launch the step 2 on the example with and without this option.

When you are ready, launch the following command:

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 2 -s Blaster

In the example, the directory DmelChr4_Blaster is created where all the results (list of HSPs) are stored, usually in a tabulated file called "DmelChr4.align.not_over.filtered" (HSPs due to chunk overlaps were removed, and filter applied). See files and directories tree. 

Step 2 Structural detection

The second step with structural option '--struct' launches LTRharvest tool on each batch created at step 1.To use this step, you have to install the Genome Tools package first. It is advised to download and install the latest stable version from the Hamburg University website.This step allows you to search for LTR retrotransposons in a structural manner, enriching the TE detection in your input genome. Moreover, no filter on repetitions numbersis applied, thus you could find rare LTR retrotransposons (less than 3 copies in the genome).The structural parameters used for considering a LTR retrotransposon are described below.See TEdenovo.cfg section [structural_search] :

  • Long-terminal repeats from 100pb to 1000pb
  • Distance of long-terminal repeats starting positions from 1000pb to 16000pb
  • Minimum similarity between the two long-terminal repeats of 90% by default
  • Target site duplications (TSD) of 4pb to 20pb required in a vicinity of 60pb on both sides of the two terminal repeats
  • In the case of overlapping or nested predictions, only the one with the most conserved terminal repeats is kept.

Adjustable parameters in [structural_search] section:

  •     "minLTRSize: 100" by default. This is the long-terminal repeats minimal size.
  •     "maxLTRSize: 1000" by default. This is the long-terminal repeats maximal size.
  •     "minElementSize: 1100" by default. This is the whole element minimal size.
  •     "maxElementSize: 16000" by default. This is the whole element maximal size.
  •     "LTR_similarity: 90" by default. Changing this parameter is useful when you want to search for younger LTRs retrotransposons, for example for a first pass on a highly repeated genome.
  •                                        Indeed, the youngest is the TE, the more similar are its LTRs.
  •     "overlaps_handling: best" by default. Possible values are 'no', 'best' and 'all'. This parameter is used in the case of overlapping or nested predictions.
     
    • With 'best', only the one with the most conserved terminal repeats is kept. This prediction may be the youngest TE, whose terminal repeats have derived the less. It is also the prediction which has the higher probability not to have sequences inserted in it.
    • With 'no', none of the nested predictions is returned. This is the more stringent value, leading to a lower sensitivity and higher specificity.
    • With 'all', all the nested predictions are returned. This is the less stringent value, leading to a higher sensitivity and lower specificity.

step 2 and step 2 'structural' are independent, they can be launched at the same time.Please note that the jobs may compete against each other on the computing cluster if you do so. 

  When you are ready, launch the following command:

TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 2 --struct

In the example, the directory DmelChr4_LTRharvest is created in which the fasta file containing LTRharvest predictions is written. See files and directories tree. 

Step 3 HSPs clustering

The third step clusters the HSPs from step 2 to build clusters of repetitions.Three clustering methods are available: GROUPER, RECON and PILER.It is better to launch the three methods in order to be able to combine the results afterwards.This step is not parallelized. It means you have to launch three times the same command, one for Grouper, one for Recon and one for Piler.As these programs have different running time, it allows you to launch the next step as soon as one program is finished.This is especially useful as Recon (and sometimes Grouper also) is usually much longer than Piler. But still, as the clustering programs usually require large resources, they will be launched on a cluster node within the pipeline (group ID is "DmelChr4_TEdenovo_Grouper" for instance).Adjustable parameters: Edit the configuration file TEdenovo.cfg section [cluster_HSPs] if you need to change the default parameters.For Grouper clustering program, parameters are:

  • Grouper_coverage, default 0.95 : coverage between all sequences in a group is at least "Grouper_coverage".
  • Grouper_join, default yes: join fragments before clustering.
  • Grouper_include, default 2 : keep groups where at least "Grouper_include" members are not included in other groups.
  • Grouper_maxJoinLength, default 30000 : maximum length of a join. If distance between 2 TEs is above "Grouper_maxJoinLength", TEs will not be joined.

For all clustering programs (Grouper, Recon and Piler), parameters are:

  • minNbSeqPerGroup, default 3 : minimum number of sequences per group.
  • nbLongestSeqPerGroup, default 20 : select the "nbLongestSeqPerGroup" longest sequences of each group.
  • maxSeqLength, default 20000 : max sequence length (bp) in groups.

When you are ready, launch the following commands:         

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Grouper
  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Recon
  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Piler

Results in the example, for each clustering method, the directory "DmelChr4_Blaster_<method_name>" is created and contains several files. See files and directories tree :

  • DmelChr4<XX>_filtered.log" contains some statistics about this step, where XX is the clustering method settings.
  • DmelChr4_Blaster_<method_name>_3elem_20seq.fa, contains the sequences, where headers indicate to which cluster the sequence belong.

Step 3 clustering of structural detection

The third step clusters the LTRharvest prediction from step 2 structural to form clusters of potential LTR retrotransposons.The clustering tool used here is the single-linkage NCBI Blastclust or Markov clustering using MCL. Neither minimal identity nor minimal coverage are required for Blastclust. 

Adjustable parameters:

Edit the configuration file TEdenovo.cfg  at section [ structural_search_clustering] if you need to change the default parameters.

  • type "Blastclust" (with 95% identity, 90% coverage and without requiring coverage on both neighbours), or "MCL".
  • MCL_inflation is 1.5 and MCL_coverage is  0 when you choose MCL .

The step 3 'structural' and the step 3 are independent, so they can be launched at the same time.Please note that the jobs may compete against each other on the computing cluster if you do so. 

When you are ready, launch the following command:      

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 --struct

Results in the example, the directory DmelChr4_LTRharvest_Blastclust is created is created and contains several files. See files and directories tree :

  • DmelChr4XX_filtered.log" that records some statistics about this step, where XX is the clustering method settings.
  • DmelChr4_LTRharvest_Blaster_1elem_20seq.fa" that contains the sequences, where the sequence headers indicate to which cluster the sequence belong.

Step 4 Build consensus

This step makes a multiple alignment for each cluster.The available multiple sequence alignment (MSA) program is Map.Indeed this program implements a global multiple alignment algorithm that specifically takes into account long gaps.Thus it always runs on clusters from Recon whereas sometimes MUSCLE can never end.Moreover, it seems to give better alignment compare to MAFFT.Note that, if the Map algorithm described in Huang (1994) remains unchanged, the program has been slightly improved to managed fasta files with several sequences more efficiently.Thus, in command-line, it is now called "rpt_map" instead of "map".Launch the command "TEdenovo.py -h" to know how to give a MSA program in the command line.Once the MSA is built, a consensus is derived by taking the most frequent base at each site.Moreover, if only one sequence has a base at a specific site all the other having a gap (case of a unique insertion for instance), then the site is not taken into account for the consensus(the minimal number of bases to edit a consensus is minBasesPerSite parameter in TEdenovo.cfg section [build_consensus] ).

When you are ready, launch the following command:  

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map 
  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map

If you ran the 'structural search' and want to use the results, launch the following command:

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 --struct -m Map

These commands are independent, they can be launched at the same time. Please note that the jobs may compete against each other on the computing cluster if you do so.

Results in the example, for each clustering method, the directory DmelChr4_Blaster_<method_name>_Map is created with several files. See files and directories tree :

  • DmelChr4_Blaster_<method_name>_Map_consensus.fa, consensus fasta file.

If you ran the structural search, a directory "DmelChr4_LTRharvest_Blastclust_Map" is created too, with DmelChr4_LTRharvest_Blastclust_Map_consens.fa consensus fasta file. 

Step 5 Consensus detect features

Then, we launch the first step of the PASTEClassifier, i.e. the detection of features on the consensus.Adjustable parameters: Edit the configuration file TEdenovo.cfg section [detect_feature] if you need to change the default parameters.    Several programs can be launched to look for:

  • terminal repeats with TRsearch by setting "term_rep: yes"
  • tandem repeats with TRF by setting "tand_rep: yes"
  • open reading frames with dbORF.py by setting "orf: yes"
  • poly-A tails with polyAtail by setting "polyA: yes"
  • TEclass: please do not change this option as it is experimental

You can choose the blast program between NCBI-BLAST "blast: ncbi", NCBI-BLAST+ "blast: blastplus" and WU-BLAST "blast: wu".

You can also use RepeatScout to generate additional input consensuses for this step. In order to do this you need:

    - to use the provided LaunchRepeatScout tool

    - In the configuration file, set "RepScout: yes" and set "RepScout_bank: <bank_of_RepeatScout>". Make sure that either <bank_of_RepeatScout> file is in your root project directory (copy or soft link), or to provide a valid absolute path to <bank_of_RepeatScout>. To use this feature, you have to install RepeatScout first. It is advised to download and install the latest stable version from the UCSD website ("http://bix.ucsd.edu/repeatscout/").

PASTEClassifier also looks for matches between the consensus and known TEs (e.g. Repbase Update). Repbase Update (Jurka J. et al., Cytogentic and Genome Research, 2005) is a famous databank of know repeats. To use it, you will have to register on "www.girinst.org".Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET. The archive contains two fasta files, one with nucleotide sequences given to BLASTER with tblastx ("TE_BLRtx: yes") and the other with aminoacid sequences given to BLASTER with blastx ("TE_BLRx: yes").If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.Furthermore, you can provide other data banks :

  • HMM profiles ("TE_HMMER: yes"), it is possible to search HMM profiles in the consensus via hmmer2 or hmmer3. It's very usefull for PASTEC (see below). You can download the profile bank for Repet ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm , which comes from Pfam database (M. Punta, et al., Nucleic Acids Research, 2012) and is formatted for REPET, WARNING : this bank can only be used with hmmer3. You can also use your own bank, but each profile name have to be well formatted (<ACC>_<NAME>_<type according to key words found in DESC>_<GA>).
  • cDNA from the host genome ("HG_BLRn: yes"), it is possible to compare them with the consensus via BLASTER with blastn.
  • rDNA ("rDNA_BLRn: yes"), it is possible to compare them with the consensus via BLASTER with blastn.

Make sure you put the databanks in your root project directory (copy or soft link) and indicate the name of each data bank in TEdenovo.cfg section [detect_feature] .You can choose the blast program between NCBI-BLAST, NCBI-BLAST+ and WU-BLAST ("blast: ncbi", "blast: blastplus", "blast: wu").You can also adjust "TRFmaxPeriod" : maximum tandem repeats period size to be reported by TRF    These programs listed above are launched in parallel. It can launch up to 1500 jobs (if there are 15000 consensus, each job will deal with 100 consensus).When you are ready, launch the following command:

  • If you want to firstly generate additional consensuses using RepeatScout please use the following command: LaunchRepeatScout.py -i <inputFastaFileName>   

        In the example, a file called <inputFastaFileName>_RepeatScoutConsensus.fa, created in the current directory, containing the correctly formated consensuses will be generated.         Please refer to the STEP 5 explanations above to see how to feed this file to the Step 5 as an additional source of consensuses.

  • If you want to use only detection by similarity, you must have ran corresponding previous steps. Please launch the following command:TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map   

        In the example, the directory "DmelChr4_Blaster_GrpRecPil_Map_TEclassif/detectFeatures" is created containing output files from the different programs that have been launched. See files and directories tree

  • If you want to use only structural detection, you must have ran corresponding previous steps (--struct option). Please launch the following command:TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 5 -m Map --struct   

        In the example, the DmelChr4_LTRharvest_Blastclust_Map_TEclassif/detectFeatures is created containing output files from the different programs that have been launched. See files and directories tree

  • If you want to combine results from both means of detection and ran the corresponding previous steps, please launch the following command:TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map --struct

        In the example, the DmelChr4_Blaster_GrpRecPil_Struct_Map_TEclassif/detectFeatures is created containing output files from the different programs that have been launched. See files and directories tree                Several MySQL tables have also been created and will be used in the following step. See tables list. 

Step 6 Consensus classification

This step classifies the consensus according to their features detected step 5. The classification is made by the PASTEClassifier (or PASTEC).
For each consensus, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" features (matches with known TEs,host genes, rDNA or HMM profiles).
PASTEC classifies elements based on the Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007).
After that, the consensus headers follow this general format (with a few variationsdetailed below):

  • {classification code}-{completeness(optional)}-{potential chimeric(optional)}_{consensus name}

If the element is a TE :

  • {Wicker's classification code}_{consensus name}
  • {Wicker's classification code}-{completeness}_{consensus name}
  •  {Wicker's classification code}-{completeness}-{potential chimeric}_{consensus name}
  • Completeness can only be specified when the order has been found
  • potential chimeric TE is an element that has characteristics of more than one classification; the selected classification is the most likely one. If there is no majority classification, 'XXX' is indicated with the "chim" tag.

The Wicker's classification code has three letters:

  • the first one represents the TE class,
  • the second one represents the TE order
  • the third the TE super family.

A 'X' indicates an unknown classification at this level. When no classification is found, the tag is "noCat". 

If the element is an SSR or a Host Gene:

  • SSR_{consensus name}    PotentialHostGene_{consensus name}

If the element is not classified:

  • noCat_{consensus name}

Classification Examples:

RLX-comp_ProjectName-L-B270-Map20 Complete Copia retrotransposon
RXX-LARD_ProjectName-B-R270-Map3 LARD
DTX-incomp_ProjectName-B-P350-Map5 Incomplete TIR
RXX_ProjectName-L-B28-Map1 Retrotransposon
DTX-comp-chim_ProjectName-B-G78-Map6 TIR potentially chimeric
PotentialHostGene_ProjectName-B-R52-Map3 Host gene
noCat_ProjectName-B-R878-Map8 Not classified

When all the consensus have been classified, a procedure to remove redundancy starts using Blaster+Matcher. Sequences included in others are discarded.To prevent the loss of well-classified sequences, only incomplete sequence included in complete sequences are removed.

Adjustable parameters:

The following settings can be tuned in TEdenovo.cfg section [classif_consensus] . The default parameters are defined from our experience with Drosophila melanogaster genome and from the paper "A unified classification system for eukaryotic transposable elements", Wicker et al., Nat.Rev.Genet., 2007.

  • min_redundancy_identity, default 0.95 <- minimal identity beyond which two consensus are considered identical
  • min_redundancy_coverage, defaul0.98 <- minimal coverage beyond which two consensus are considered identical
  • max_profiles_evalue, defaul1e-3 <- only matches on profiles bank below this e-value are kept
  • max_blastn_evalue, defaul1e-3 <- only matches on reference TEs below this e-value are kept (search for Helitron extremities)
  • min_blast_coverage, defaul5 <- minimal coverage between consensus and reference TEs (for BlastX and TBlastx)
  • min_profiles_coverage, defaul19  <- minimal coverage between consensus and profiles
  • min_SSR_coverage, defaul0.75 <- minimal percentage of SSR in the consensus
  • limit_job_nb, default 0 <- parameter to limit the jobs number for PASTEC. Each job represents a PASTEC process, so one connection. But at the beginning, PASTEC retrieves the results from database.

So depending on the amount of data in database and your computing cluster configuration (allowing, per example, 700 jobs running at the same time), MySQL server can be overload. You may want to limit the similtaneous connexion to MySQL server for PASTEC (0 = no limit). 

When you are ready, launch the following command:

  • If you want to use only detection by similarity, you must have ran corresponding previous steps. Please launch the following command:  TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map
  • If you want to use only structural detection, you must have ran corresponding previous steps (--struct option). Please launch the following command:  TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 6 -m Map --struct
  • If you want to combine results from both means of detection and ran the corresponding previous steps, please launch the following command:  TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map --struct

In the example, the directory "DmelChr4_Blaster_GrpRecPil_struct_Map_TEclassif/classifConsensus" is created containing output files from the different programs that have been launched. Details about the classification are in "DmelChr4_withoutRedundancy.classif_stats.txt" and "DmelChr4_withoutRedundancy_WickerH.classif". See files and directories tree        MySQL tables were also created along the way. See tables list

The library of denovo consensus is available in each "classifConsensus" directory under the name "DmelChr4_denovoLibTEs.fa".    This file can be used directly in the TEannot pipeline as the library of "reference TEs".

Before using the TEannot pipeline, please read the file "TEannot_tuto.txt". 

Step 7 Filtering

This step filters the SSR and the consensus classified as "NoCat" only when they were built from less than 10 sequences.In fact, before using the consensus data-bank in the TEannot pipeline, you may want to filter the consensus sequences.For instance you may want to remove the consensus classified as SSR, HostGene, chimeric and NoCat.To filter the consensus classified as "NoCat" only when they were built from less than 10 sequences, we use the "MSA program number". This number, in the header of each consensus after the name of the MSA program, corresponds to the number of sequences belonging to the multiple alignment from which the consensus was derived.

Adjustable parameters:

Edit the configuration file TEdenovo.cfg section [filter_consensus] if you need to change the default parameters.

  • "filter_SSR: yes " <- if set to yes, filter SSRs using parameter below
  • "length_SSR: 0" <- length below which a SSR is filtered (e.g. 300, default=0 : all SSR are filtered)
  • "filter_noCat: yes" <- if set to yes, filter consensus classified as noCat using parameter below
  • "filter_noCat_max_fragments: 10" <- minimum number of sequences in the MSA from which the noCat consensus has been built (default=0 :avoid)
  • "filter_host_gene: no" <- if set to yes, filter host genes
  • "filter_potential_chimeric: no" <- if set to yes, filter consensus classified as potentially chimeric
  • "filter_incomplete: no" <- if set to yes, filter consensus classified as incomplete
  • "filter_rDNA: no" <- if set to yes, filter consensus classified as rDNA

When you are ready, launch the following command:

TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map

If the step 6 has been launched with the "--struct" option, please add "--struct" to your command line.In the example, the directory "DmelChr4_Blaster_*_Map_TEclassif_Filtered" is created containing the output file "DmelChr4_denovoLibTEs_filtered.fa", which can be directly used in the TEannot pipeline. See files and directories treeBefore using the TEannot pipeline, please READ the file "TEannot_tuto.txt".

Step 8 Consensus clustering

For the last step, it is useful to investigate the relationships among the de novo consensus that have been built, by grouping them into clusters (i.e. "TE families").This step launch blastclust or the MCL programs according to "-f" option.Edit the configuration file TEdenovo.cfg section[cluster_consensus] if you need to change the default parameters.

Adjustable parameters: Edit the configuration file TEdenovo.cfg section [cluster_consensus] if you need to change the default parameters.

  • "Blastclust_identity: 0" <- Score coverage threshold (bit score / length if < 3.0, percentage of identities otherwise)
  • "Blastclust_coverage: 80" <- length coverage threshold above which consensus are regrouped in the same cluster   
  • "MCL_inflation: 1.5" <-  Low inflation leads to coarser clusterings, high inflation leads to fine-grained clusterings.
  • "MCL_coverage: 0.0" <- length coverage threshold above which consensus are regrouped in the same cluster

When you are ready, launch the following command:  

  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f Blastclust
  • TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL

If the step 7 has been launched with the "--struct" option, please add "--struct" to your command line.

In the example, the directory "DmelChr4_Blaster_*_Map_TEclassif_Filtered_Blastclust" is created containing the output file "DmelChr4_denovoLibTEs_filtered_Blastclust.fa" or "DmelChr4_Blaster_*_Map_TEclassif_Filtered_MCL" is created containing the output file "DmelChr4_denovoLibTEs_filtered_MCL.fa". The cluster number of each consensus appears in sequences headers of this file.        example : DTX-comp-chim_Blc10_ProjectName-B-G78-Map6_reversed, this consensus name means :

  • DTX-comp-chim : it's a complete TIR, potential chimeric (in step 6, PASTEC found an other possible classification : see "DmelChr4_*_withoutRedundancy_WickerH.classif" for details)
  • Blc10 : it belongs to the cluster 10 from blastclust (step 8)
  • ProjectName-B-G78-Map6 : it comes from Blaster (step 2), group 78 from Grouper (step 3), and build from 6 sequences aligned by Map (step 4)
  • reversed : this sequence was "reverse-complemented" (step 6), because the strand detected by PASTEC was "reverse" ("-").

WARNING : the "reversed" tag doesn't appear in consensus name of the file "DmelChr4_*_withoutRedundancy_WickerH.classif". This file can be use as the library of "reference TEs" in the TEannot pipeline. See files and directories tree.

Before using the TEannot pipeline, please READ "TEannot_tuto.txt".

Update: 17 Aug 2017
Creation date: 12 Mar 2013
PLATFORM   RESEARCH   PROJECTS   DATA   TOOLS   SPECIES   ABOUT US   FEEDBACK   CONTACT US   REGISTER   EDIT