FEEDBACK  |  CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / Tools / REPET / REPET practical course

REPET practical course

Contents

 

  • 1 Practical course: Transposable Elements identification with The REPET package
    • 1.1 Run the REPET pipelines
      • 1.1.1 Setup The REPET package environment
      • 1.1.2 Start TEdenovo pipeline
        • 1.1.2.1 Alternatively, you can launch the TEdenovo pipeline step by step:
      • 1.1.3 Post TEdenovo pipeline
        • 1.1.3.1 Parse MCL clustering results (TEdenovo step 8): create a list (tabulated file) with 2 columns "Cluster_id TE_id"
        • 1.1.3.2 Get all the annotations done by PASTEC (TEdenovo, step 5) on the Consensus
        • 1.1.3.3 Get the multiple-alignment used to build the consensus
      • 1.1.4 Start TEannot pipeline
        • 1.1.4.1 Alternatively, you can launch the TEannot.py pipeline step by step:
      • 1.1.5 Post TEannot pipeline
        • 1.1.5.1 TEdenovo consensus library classification corresponding to Chig_refTEs.fa
        • 1.1.5.2 Concatenate all gff files of genome annotation in one
        • 1.1.5.3 Compute statistics of TE genome annotation
        • 1.1.5.4 Compute and plot the consensuses coverage
        • 1.1.5.5 Select consensus for the second round of TEannot
    • 1.2 Results analysis
      • 1.2.1 TEdenovo (and post TEdenovo) most interesting output files
        • 1.2.1.1 TEdenovo output directories
        • 1.2.1.2 TEdenovo consensus library
        • 1.2.1.3 TEdenovo consensus library after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR – This library is used as input of TEannot pipeline
        • 1.2.1.4 Classification of TEdenovo consensus library (All consensuses including SSR and noCat built with less than 10 HSPs) according to Wicker classification nomenclature
        • 1.2.1.5 Classification statistics (All consensuses including SSR and noCat built with less than 10 HSPs)
        • 1.2.1.6 MCL clustering output files
      • 1.2.2 TEannot (and post TEannot) most interesting output files
        • 1.2.2.1 TEannot output directories
        • 1.2.2.2 Genome annotation file
        • 1.2.2.3 Classification of TEdenovo consensus library corresponding to Chig_refTEs.fa
        • 1.2.2.4 Genome annotation global statistics file
        • 1.2.2.5 TE annotation statistics per consensus
    • 1.3 Annexes
      • 1.3.1 Additional commands
  • 2 Practical course: Manual curation of the transposable elements library
    • 2.1 Compilation of consensus information : classification, genome annotation statistics, MCL clustering
    • 2.2 Consensus annotation (from PASTEC classifier) using IGV genome browser
    • 2.3 Display multiple alignment of HSP used to build the consensus using Jaview
    • 2.4 Plot genome copies related to a consensus

Practical course: Transposable Elements identification with The REPET package

Use case: 4th Chromosome of Arabidopsis thaliana

Run the REPET pipelines

Setup The REPET package environment

  • Connect to the virtual machine containing the REPET installation:
ssh -XY -p $port centos@localhost
  • Your home directory is by default : "/home/centos"
  • To start a new project, create a folder with the project name « ThalChr4 » :
mkdir ThalChr4
  • Change directory
cd ThalChr4
  •  Check the database parameters in the « setEnv.sh » configuration file:
more ~/data/setEnv.sh

export REPET_HOST="localhost"
export REPET_USER="orepet"
export REPET_PW="repet_pw"
export REPET_DB="repet"
export REPET_PORT="3306"
export REPET_PATH="/usr/local/REPET_linux-x64-2.5"
export PYTHONPATH=$REPET_PATH
export REPET_JOBS=MySQL
export REPET_JOB_MANAGER=slurm
export REPET_QUEUE=slurm
export SMART_PATH=$REPET_PATH/SMART/Java/Python
export PATH=$SMART_PATH:$REPET_PATH/bin:$PATH>
...

  • Source the environment before launching REPET pipeline:
. ~/data/setEnv.sh
  • Test the connexion to the MySQL database:
mysql -h $REPET_HOST -u $REPET_USER -p$REPET_PW $REPET_DB
  •  exit the database:
quit

Start TEdenovo pipeline

  • Create a directory to launch TEdenovo
mkdir TEdenovo; cd TEdenovo
  • Make a link (ln -s) to access the input fasta file of the genomic sequences – The genome fasta file must be “project_name.fa”
ln -s ~/data/TA_Chr4.fa ThalChr4.fa
  • Make a link (ln -s) to access the databanks used in similarity based classification.
ln -s ~/data/ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm
ln -s ~/data/repbase20.05_aaSeq_cleaned_TE.fsa
ln -s ~/data/repbase20.05_ntSeq_cleaned_TE.fsa
ln -s ~/data/rRNA_Eukaryota.fsa
  • Copy the configuration file « TEdenovo.cfg », into your TEdenovo working directory: (The original TEdenovo.cfg is available at “$REPET_PATH/config/TEdenovo.cfg”)
cp ~/data/TEdenovo.cfg ./

-Check if the configuration file is properly filled before launching TEdenovo:

gedit TEdenovo.cfg >/dev/null 2>&1 &

[repet_env]
repet_version: 2.5
repet_host: localhost
repet_user: orepet
repet_pw: repet_pw
repet_db: repet
repet_port: 3306
repet_job_manager: slurm
[project]
project_name: ThalChr4
project_dir: /home/centos/ThalChr4/TEdenovo

[detect_features]

TE_BLRn: yes
TE_BLRtx: yes
TE_nucl_bank: repbase20.05_ntSeq_cleaned_TE.fsa
TE_BLRx: yes
TE_prot_bank: repbase20.05_aaSeq_cleaned_TE.fsa
TE_HMMER: yes
TE_HMM_profiles: ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm

rDNA_BLRn: yes
rDNA_bank: rRNA_Eukaryota.fsa

  • TEdenovo pipeline consists of 8 steps that can be launched using only one command line:
nohup launch_TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -f MCL >& TEdenovo.log &

P: project name
f: clustering program used to find consensus families

  • Useful commands to follow the progress of steps

- job status (under slurm)

squeue

- the log files. ex:

more TEdenovo.log
tail TEdenovo.log

Alternatively, you can launch the TEdenovo pipeline step by step:

nohup TEdenovo.py -P name -C config.cfg -S step -[specific-step-param]
TEdenovo_1-2

 

 

 

 

 

  • Step 1: Genomic sequences are cut and grouped into batches
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 1 >& runS1.log &
  • Step 2: The genome is aligned to itself using BLAST
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 2 -s Blaster >& runS2.log &

 

TEdenovo_3

 

 

 

 

  • Step 3: The repetitives HSP from BLAST are clustered by Recon, Grouper and/or Piler
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Grouper >& runS3G.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Recon >& runS3R.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Piler >& runS3P.log &
TEdenovo_4

 

 

 

 

  • Step 4: A multiple alignment is computed for each cluster, and a consensus sequence is derived from each multiple alignment
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map >& runS4G.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map >& runS4R.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map >& runS4P.log &
TEdenovo_5-6-7

 

 

 

  • Step 5: Particular features are detected on each consensus, such as structural features or homology with known TE, HMM profiles or host genes
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map >& runS5.log &

mySQL table are created: contain the evidences of consensus annotation used by Pastec classifier

  • Step 6: The consensuses are classified using Wicker's TEs classification
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map >& runS6.log &
  • Step 7: SSR and under-represented unclassified ("noCat") consensus are filtered
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map >& runS7.log &
  • Step 8: The consensuses are clustered into families to facilitate manual curation using Blastclust or MCL
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL >& runS8.log &
Update: 22 Jul 2021
Creation date: 13 Jul 2021
PLATFORM   RESEARCH   PROJECTS   DATA   TOOLS   SPECIES   ABOUT US   FEEDBACK   CONTACT US   REGISTER   EDIT