FEEDBACK  |  CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / Tools / PASTEClassifier / PASTEClassifier tuto

PASTEClassifier tuto

This tutorial describes PASTEC (Pseudo Agent System of Transposable Element Classification), helps users to set paremters and gives the command lines. It is included in the delivery PASTEClassifier-1.0

Note: PASTEC is the REPET package Transposable Elements classifier used in the TEdenovo pipeline, and the corresponding stand alone tool is PASTEClassifier.py
A parallelized version of PASTEClassifier exists; usable on a computing cluster, this stand alone tool is called PASTEClassifier_parallelized.py

The PASTEClassifier (also PASTEClassifier_parallelized) follows two main steps:

After that, other processes are launched:

  • compute some statistics about the classification
  • compute reverse complement of the nucleotide sequences, if the classification process considers these sequences in negative strand (-r option)
  • rename headers according to Wicker's code (-w option)

More information


From now on, the project name is "ProjectName".

Setup your working environment

Set environment variables
REPET_PATH gives the absolute path to the srcDirectory where PASTEC has been installed.
    export REPET_PATH=$HOME/<srcDirectory>/repet_pipe/
Add the path towards REPET programs to your path:
    export PATH=$REPET_PATH/bin:...:$PATH

In the config directory, we find a template of setEnv.sh shell script in which these 2 environment variables can be set and source it.
   
Create your project directory (for instance "ProjectName") and go into it:

  •   cd $HOME/work/
  •   mkdir ProjectName
  •   cd ProjectName

Copy the input fasta file recording the genomic sequences, suspected to be transposable elements (TEs).
Format your fasta file with only 60 bps (or less) by line for each sequence.
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ") or symbols such as "=", ";", ":", "|"...

Copy the configuration file:

  •   cp $HOME/<srcDirectory>/config/PASTEClassifier.cfg .
  •   For parallelized version:
  •   cp $HOME/<srcDirectory>/config/PASTEClassifier_parallelized.cfg .

Edit the configuration file "PASTEClassifier.cfg" (or PASTEClassifier_parallelized.cfg) in order to adapt it to your personal situation.

  • In the section [repet_env] , indicate (ask your system administrator):
    • the host name of your MySQL database
    • your MySQL login
    • your MySQL password
    •  the name of your MySQL database
  • In the section [project] , indicate:
    • the name of your project (here: ProjectName)
    • the absolute path to your project directory (here: $HOME/work/ProjectName)
  • In [detect_features] and [classif_consensus] these sections indicate:
    • 'clean: yes' parameter will remove some temporary files.
    • For parallelized version; 'resources' (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
    • For parallelized version, 'tmpDir' (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch").

Run PASTEClassifier

The standard output is rather self-explaining.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".
This program runs a command even if the session is disconnected or the user logs out.
To have more details, read the manual ("$ man nohup").

Command Line :
To launch all steps of PASTEClassifier.py

  • nohup PASTEClassifier.py -i inpuFile.fa -C PASTEClassifier.cfg >& PASTEC.log &

It is possible to launch the 2 PASTEClassifier steps one by one:

  • nohup PASTEClassifier.py -i inpuFile.fa -C PASTEClassifier.cfg -S 1 >& PASTEC_step1.log &
  • and then  nohup PASTEClassifier.py -i inpuFile.fa -C PASTEClassifier.cfg -S 2 >& PASTEC_step2.log &

note: step 2 needs step 1 results, so if you only launch step 2 without step 1 an error will occur
For parallelized version:
To launch all steps of PASTEClassifier_parallelized.py

  • nohup PASTEClassifier_parallelized.py -i inpuFile.fa -C PASTEClassifier_parallelized.cfg >& PASTEC.log &

It is possible to launch the 2 PASTEC steps one by one:

  • nohup PASTEClassifier_parallelized.py -i inpuFile.fa -C PASTEClassifier_parallelized.cfg -S 1 >& PASTEC_step1.log &
  • and then  nohup PASTEClassifier_parallelized.py -i inpuFile.fa -C PASTEClassifier_parallelized.cfg -S 2 >& PASTEC_step2.log &

note: step 2 needs step 1 results, so if you only launch step 2 without step 1 an error will occur   

Step 1 : "Features Detection"

PASTEClassifier looks for structural features, such as terminal repeats (termR), polyA tail, tandem repeats (to detect SSR and polyA tail), open reading frames (ORF).
It also looks for homology with known TEs, host genes, rDNA using blast and with HMM profiles using hmmer.
You can choose the programs to launch in the configuration file PASTEClassifier.cfg (see below). But be carefull of the dependencies (blast, hmmer, banks).
It is highly advised to read the "Banks" part, at the end of the tutorial, before using any bank.
Make sure you copy (or soft link) the databanks in your root project directory (here: $HOME/work/ProjectName) and indicate the file name of each data bank in the configuration file.
Adjustable parameters in [detect_features] section:

  • terminal repeats with TRsearch, set "term_rep: yes"
  • tandem repeats with TRF, set "tand_rep: yes"
  • open reading frames with dbORF.py, set "orf: yes"
  • poly-A tails with polyAtail, set "polyA: yes"
  • homology using blast, select the program by setting:
    • "blast: ncbi" for NCBI-BLAST
    •  "blast: blastplus" for NCBI-BLAST+
    •  "blast: wu" for WU-BLAST
  • helitron extremities, set "TE_BLRn: yes" and write the file name of the known TE bank in "TE_nucl_bank:" (nucleotide sequences)
  • homology with known TEs using tblastx, set "TE_BLRtx: yes" and write the file name of the known TE bank in "TE_nucl_bank:" (nucleotide sequences)
  • homology with known TEs using blastx, set "TE_BLRx: yes" and write the file name of the known TE bank in "TE_prot_bank:" (amino-acid sequences)
  • homology with host genes, set "HG_BLRn: yes" and write the file name of the cDNA bank in "HG_nucl_bank:"
  • homology with rDNA, set "rDNA_BLRn: yes" and write the file name of the rDNA bank in "rDNA_bank:"
  • homology with HMM profiles, set "TE_HMMER: yes" and write the file name of the HMM profiles bank in "TE_HMM_profiles:". You can also change the E-value threshold used by hmmer.
  • tRNA with TRNAscanSE, set "tRNA_scan: yes". WARNING: these results are not used for the classification. This functionality is still in exploration.

Results:
A directory is created for each selected features, containing the results files.   
In database, tables are also created for each selected features and will be used in the following step.

Step 2 : "classification"

This step classifies the sequences according to their features detected previously. The classification is made by the PASTEC tool.
For each sequence, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" features (matches with known TEs, host genes, rDNA or HMM profiles).
PASTEC classifies elements based on the Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007).
Adjustable parameters in [classif_consensus] section:
The following parameters default values are defined from our experience with Drosophila melanogaster genome and from the paper "A unified classification system for eukaryotic transposable elements", Wicker et al., Nat.Rev.Genet., 2007.

  • "max_profiles_evalue: 1e-3" <- only matches on profiles bank below this e-value are kept
  • "min_TE_profiles_coverage: 20"  <- minimal coverage between query sequence and profiles for TEs
  • "min_HG_profiles_coverage: 75"  <- minimal coverage between query sequence and "OTHER" profiles (if no other classification was found)
  • "max_helitron_extremities_evalue: 1e-3" <- above this evalue, do not consider the match in regards to helitron classification
  • "min_TE_bank_coverage: 5" <- min coverage above which match gets disregarded
  • "min_HG_bank_coverage: 95" <- min coverage above which a sequence is considered as host gene
  • "min_HG_bank_identity: 90" <- min identity above which a sequence is considered as host gene (used in conjunction with the coverage threshold above)
  • "min_rDNA_bank_coverage: 95" <- min coverage above which a sequence is considered as rDNA
  • "min_rDNA_bank_identity: 90" <- min identity above which a sequence is considered as rDNA (used in conjunction with the coverage threshold above)
  • "min_SSR_coverage: 0.75" <- minimal percentage of SSR to consider sequence as SSR
  • "max_SSR_size: 100" <- max size to consider sequence as SSR

    For parallelized version:

  •     "limit_job_nb: 0" <- parameter to limit the jobs number for PASTEC. Each job represents a PASTEC process, so one connection. But at the beginning, PASTEC retrieves the results from database.

 So depending on the amount of data in database and your computing cluster configuration (allowing, per example, 700 jobs running at the same time), MySQL server can be overloaded.
 You may want to limit the simultaneous connections to MySQL server for PASTEC (0 = no limit).

Results:
Details about the classification are in ProjectName.classif (see "Classification output" for details).
Some statistics can be found in ProjectName.classif_stats.txt.
With the option "-r", a fasta file with the reverse-complemented sequences is created : <input file name without extension>_negStrandReversed.fa. The header of the reverse-complemented sequences have the tag "_reversed" at the end.
With the option "-w", a fasta file with renamed headers is created : <input file name without extension>_WickerH.fa (see "Rename headers" part for details).
The corresponding classification file is created: ProjectName_WickerH.classif.
If both options are selected, the fasta file name is "<input file name without extension>_negStrandReversed_WickerH.fa.

Classification output
The format of the classification output file has 8 fields:

  •  sequence name
  • sequence length
  • strand : "+" or "-" or "."
  • "ok" or "PotentialChimeric":
  • "ok" means that only one classification was found
  • "PotentialChimeric" means that several classifications are possible for this sequence. In this case, the best classification is given according to the confidence index. If no decision is possible, all the classifications are returned in the "order" field (separated by "|").
  • class classification : "I" or "II" or "noCat" or "NA"
  • order classification ("LTR" and/or "TIR" and/or "LINE" and/or "Crypton",...) or "PotentialHostGene" or "rDNA" or "SSR" or "noCat"
  • completeness : "complete" or "incomplete" or "NA"
  • confidence index ("CI=") and evidences. The confidence index is computed according to the evidence found for this classification (the best CI is 100). The evidences are separated in 2 types : structural ("struct=") and homology ("coding="). The evidences unused for the considered classification are in "other=" section.
  • "noCat" means that no classification was found at this level.
  • "NA" means "not available", according to the information in the "order" field.

Rename headers
PASTEC can rename the headers according to the classification found for each sequence, using the Wicker's code.
The headers follow this general format (with a few variations detailed below):
    {classification code}-{completeness(optional)}-{potential chimeric(optional)}_{consensus name}
Depending on the category:
If the element is a TE:
    {Wicker's classification code}_{consensus name}
    {Wicker's classification code}-{completeness}_{consensus name}
    {Wicker's classification code}-{completeness}-{potential chimeric}_{consensus name}
   
Notes: Completeness can only be specified when the order has been found.
A potential chimeric TE is an element that has characteristics of more than one classification; the selected classification is the most likely one. If there is no majority classification, 'XXX' is indicated with the "chim" tag.
The Wicker's classification code has three letters: the first one represents the TE class, the second one represents the TE order and the third the TE super family.
A 'X' indicates an unknown classification at this level. When no classification is found, the tag is "noCat".

If the element is an SSR or a Host Gene
    SSR_{consensus name}
    PotentialHostGene_{consensus name}
   
If the element is not classified
    noCat_{consensus name}

Classification Examples:
    Complete Copia retrotransposon : RLX-comp_ProjectName-L-B270-Map20
    LARD : RXX-LARD_ProjectName-B-R270-Map3
    Incomplete TIR: DTX-incomp_ProjectName-B-P350-Map5
    Retrotransposon: RXX_ProjectName-L-B28-Map1
    TIR potentially chimeric: DTX-comp-chim_ProjectName-B-G78-Map6
    Host gene: PotentialHostGene_ProjectName-B-R52-Map3
    Not classified: noCat_ProjectName-B-R878-Map8
   
Banks

Known TEs:

You can use Repbase Update, which is a famous databank of know repeats (Jurka J. et al., Cytogentic and Genome Research, 2005).
To use it, you will have to register on "www.girinst.org".
Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET.
The archive contains two fasta files, one with nucleotide sequences given to BLASTER with blastn parameter "TE_BLRn: yes" either/or tblastx parameter "TE_BLRtx: yes". This bank file name should replace "<bank_of_TE_nucleotide_sequences_such_as_Repbase>" in "TE_nucl_bank: " parameter.
The other file with aminoacid sequences given to BLASTER with blastx parameter "TE_BLRx: yes". This bank file name should replace "<bank_of_TE_amino-acid_sequences_such_as_Repbase>" in "TE_prot_bank:" parameter.
If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.
   
FORMAT : fasta with additional information in headers
Wicker's classification should be added at the end of each header, using ":" as separator. The pattern is "original header", "ClassI" or "ClassII", "order name", "superfamily name". For example, if the original header is ">TEName" and it's a TE classified as L1, the new header is :     >TEName:ClassI:LINE:L1
When the classification is not available at one level, a "?" can be specified (e.g. ">TEName:?:?:?" or ">TEName:ClassI:?:?").
WARNING : the original header should not have any ":".

HMM profiles:

You can download the bank ProfilesBankForREPET_Pfam26.0_GypsyDB.hmm at http://urgi.versailles.inra.fr/download/repet/, which comes from Pfam database (M. Punta, et al., Nucleic Acids Research, 2012) and also from GypsyDB ().
This version is specially formatted for REPET.
WARNING : this bank can only be used with hmmer3.
You should set "TE_HMM_profiles:" parameter with this bank. It is possible to search HMM profiles via hmmer2 or hmmer3, according to the format of the bank. You can also use your own bank, but each profile name has to be well formatted.

FORMAT : hmmer (.hmm) with additional information in NAME
According to the Pfam field, the original NAME is replaced by: <ACC>_<NAME>_<type according to key words found in DESC>_<GA>
For example, if the Pfam format is:
    //
    HMMER3/b [3.0 | March 2010]
    NAME  DUF3701
    ACC   PF12482.3
    DESC  Phage integrase protein
    LENG  96
    ALPH  amino
    RF    no
    CS    no
    MAP   yes
    DATE  Tue Sep 27 23:45:10 2011
    NSEQ  38
    EFFN  2.017822
    CKSUM 1455990630
    GA    21.50 21.50;
    ...
    the new NAME is  : PF12482.3_DUF3701_INT_21.5
    Details about TE profile type (and its abbreviation) can be found in the paper of Wicker et al. (see above).
    When no information is found in DESC (or NAME) to determine the profile type, we consider the profile not TE specific, and replace <type> by OTHER. These profiles can be used to classify sequences as "potential host genes".

Host genes:
If you have a bank of cDNA from the host genome, you can specify the file name in "HG_nucl_bank:" parameter.
FORMAT : fasta

rDNA:
If you have a bank of rDNA from eukaryota, you can specify the file name in "rDNA_bank:" parameter.
FORMAT : fasta

Update: 08 Oct 2013
Creation date: 07 Oct 2013
PLATFORM   RESEARCH   PROJECTS   DATA   TOOLS   SPECIES   ABOUT US   FEEDBACK   CONTACT US   REGISTER   EDIT