FEEDBACK  |  CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / Platform / User support / HPC Cluster

How to use URGI high-performance computing cluster

HPC Cluster

GNU/Linux system: Rocks Clusters

Job scheduler: Sun/Oracle Grid Engine

Power: 864 Intel Xeon cores


Cluster basics

The INRA URGI lab owns a computer cluster named "Saruman". Unlike standalone computers, a cluster is composed of many nodes:

  • 1 submission host: this is where you log in
  • 75 execution hosts: they run your jobs
  • 1 master host: sauron, which rules them all

On a cluster, the best practice is to write a job file and to sumbit it to the cluster engine: Grid Engine.

Jobs wait for computer resources to be available, then they are scheduled to run on one execution host. A job has a unique ID number.

When a job ends, you will find new files on your current directory:

  • the .o file: it contains the results that would have been printed on your screen on a standalone computer
  • the .e file: usually empty, it may contain error messages explaining why a job has failed
  • other files: depending on the job, the process can create other result files

Connect to saruman

To obtain a command-line shell on saruman, please connect with SSH to saruman.versailles.inra.fr

Remember that you must tell us the IP address of your computer before that.

 

Sample ncbi blast job

First create a sample query sequence:

[sreboux@saruman ~]$ vi seq.fsa
>sample
TTATCCACAGATTTGTTCTTTACTAATAATAATAGTAATTATTATTTTTTATTTTTTTTA

Choose a database in our database list , for example Univec.

Write the command line in a file. Write just what you would have typed on a standalone computer:

[sreboux@saruman ~]$ vi myblastjob
blastall -p blastn -i seq.fsa -d UniVec

Submit your job with qsub:

[sreboux@saruman ~]$ qsub myblastjob
Your job 7687195 ("myblastjob") has been submitted

Monitor your job with qstat: the first state is qw (queue wait). In qw, your job waits for computing resources to free. If all slots are used by other jobs, you have to wait.

[sreboux@saruman ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7687195 0.00000 myblastjob sreboux      qw    04/20/2011 16:19:04                                    1

When enough resources (RAM, CPU, slots, etc) are free for your job, grid engines routes it in an appropriate execution host where it runs. This is the r state:

[sreboux@saruman ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7687195 0.06250 myblastjob sreboux      r     04/20/2011 16:19:12 all.q@compute-2-1.local            1

You can disconnect from saruman and reconnect later if your job takes a long time to run.

Your job is finished when qstat no longer show its job id:

[sreboux@saruman ~]$ qstat
[sreboux@saruman ~]$ 

Check for errors in the .e file (an empty .e file means no error messages, which is good):

[sreboux@saruman ~]$ cat myblastjob.e7687195
[sreboux@saruman ~]$

Get the blast results in the .o file:

[sreboux@saruman ~]$ cat myblastjob.o7687195
BLASTN 2.2.21 [Jun-14-2009]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= sample
         (60 letters)

Database: UniVec
           2861 sequences; 660,151 total letters

Searching..................................................done



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value
gnl|uv|L08860.1:1-2910-49 pC194 cloning vector for Bacillus subt...    34   0.002
[...]
>gnl|uv|L08860.1:1-2910-49 pC194 cloning vector for Bacillus subtillis
          Length = 2959

 Score = 34.2 bits (17), Expect = 0.002
 Identities = 17/17 (100%)
 Strand = Plus / Plus


Query: 25   aataataatagtaatta 41
            |||||||||||||||||
Sbjct: 1728 aataataatagtaatta 1744

[...]

 

Grid Engine can schedule and run many jobs at once, this is the power of a HPC cluster.

 

Quick test jobs

You can run quick test jobs by asking the test resource, with -l:

[sreboux@saruman ~]$ qsub -l test test.sh
Your job 6126482 ("test.sh") has been submitted

Test jobs have a higher priority than other jobs: they are scheduled on top of the wait queue, which means that they will start with little or no waiting.

Test jobs must run less than 10 minutes. Longer test jobs will be forcibly terminated by grid engine.

 

Memory intensive jobs

On the saruman cluster, some nodes have more memory (RAM) than others: 16 GB, 48 GB and 96 GB. Memory is a resource, just request to the cluster the memory you need and it will automatically schedule your job on an appropriate node.

For example, consider a ncbi blast job that needs 30 GB of RAM to complete successfully. You can request the mem_free attribute with -l:

[sreboux@saruman ~]$ qsub -l mem_free=30G bigmem.sh
Your job 6126483 ("bigmem.sh") has been submitted

By requesting the resource mem_free=30G, you are sure that your job will run on a node that has at least 30G of free RAM. NB: mem_free does not define the limit of RAM your job can consume.

There are other interesting requestable resources in grid engine, use qconf -sc to list them.

 

Long process

On the saruman cluster, jobs cannot run more than 48 hours. Jobs that run longer than 48h are automatically killed.

To run longer jobs, you cannot use grid engine; you are invited to run your very long process directly on the saruman host, without gridengine, without qsub. You can monitor such process with top.

The saruman host has 96 GB of memory (RAM) and 24 CPUs, please remember that you are not alone and that you have to share these resources with other users. We monitor saruman usage and eventualy have to kill abnormal process.

Please note that running process without qsub limits you to 1 server, you will not benefit ot the power of the cluster this way.

 

Interactive process or graphical process

In case your process is interactive or has a graphical user interface, we also advise you to run it directly on saruman.

 

More info

There is a lot to learn about grid engine, please refer to the manual pages. A good starting point is:

[sreboux@saruman ~]$ man sge_intro
Update: 01 Oct 2018
Creation date: 20 Apr 2011
PLATFORM   RESEARCH   PROJECTS   DATA   TOOLS   SPECIES   ABOUT US   FEEDBACK   CONTACT US   REGISTER   EDIT