Fgate
From CCMSTWiki
Contents |
Getting Started with Fgate
Fgate can be accessed using ssh software only. The cluster is accessed by logging into the master node: ssh username@fgate.chemistry.gatech.edu. The internal name of the master node is control. Cluster nodes can be accessed from the master node using ssh. Nodes are named c0-0 through c0-15 (1st rack) and c1-0 through c1-31. Note that numbering starts from 0, unlike on Egate. Here's the information on Fgate nodes:
| Host Names | Type | Specs | Type of use |
|---|---|---|---|
| control | Head node | Two 2.66 GHz Woodcrests, 4 GB RAM, 145 GB disk | Job submission |
| c0-0 -- c0-7 | Fat nodes | Two 2.66 GHz Woodcrests, 16 GB RAM, 720 GB disk | LSF |
| c0-8 -- c0-15, c1-0 -- c1-31 | Thin nodes | Two 2.66 GHz Woodcrests, 4 GB RAM, 225 GB disk | LSF |
Running Jobs with LSF
Platform LSF is a batch system. LSF is the only legal way to run jobs on Fgate. Look at ~evaleev/LSF/test.cmd for a sample LSF command file. Note that your priority in the queue is inversely proportional to the computer time you have used recently.
Most useful LSF commands: (See man pages for more information)
- lsload
- List all nodes along with a summary of their current state. A status of "lockU" indicates a lockup of the node, but is not always serious --- it may just be that the node is very busy. r15s, r1m, and r15m give the 15 second, 1 minute, and 15 minute average number of threads running. ut is the fractional CPU utilization, and swp and mem are the amount of swap and memory available.
- bsub
- Submit a job to the queue. This must be used with the redirect when using a command file, i.e. bsub < test.cmd
- bjobs
- Monitor your own jobs in the queue. To see everyone's jobs use bjobs -u all
- blist
- Nicer version of bjobs (Perl script by David Sherrill); takes most bjobs arguments.
- bkill
- Kill a job in the queue with associated job number
- bdone
- List recently completed jobs (Python script by Sam Chill); takes most bacct arguments.
- bcwd [jobnumber]
- Print full directory path for a given jobnumber. Can also specify a particular user by -u user. Can be useful to use this in conjunction with an alias like alias bcd 'cd `bcwd \!*`'
- bmod
- Changes job resource requirements once a job has been submitted to a queue. Eg. bmod "-W 200:00" 12345 would change the time limit of job id 12345 to 200 hours.
- bhpart
- Shows the usage summary of the whole cluster by users. It sorts users according to their priority. If you notice that your jobs are not picking up, it is likely that people with higher priority have jobs pending.
- bhist
- Shows a summary of the amount of time your recent jobs have spent in various states (waiting, running, etc).
- bacct
- Shows summary of your recently completed jobs. The -l switch gives long (verbose) output.
Here is a handy reference card of LSF commands: Image:Lsf user qrefcard 60.pdf.
A sample LSF script
#!/bin/csh # This specifies a job name #BSUB -J s_mp2_qz # This specifies a stdout logfile name #BSUB -o s_mp2_qz.stdout # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves one scratch and 3300 MB per processor # Memory specifications are mandatory, use Scratch only if you need it. #BSUB -R "rusage[Scratch=1:Memory=3300]" runmolprop s.in
Large-memory jobs
The so-called fat nodes are available to run jobs requiring very large memory or larger disk space. To request these resources, submit to the fat node queue using the -q fat_nodes directive. This queue is reserved for those jobs requiring large memory or disk only.
ADF
The ADF license for Fgate permits to execute up to 64 tasks at a time (reminder: a parallel job has several tasks). To keep track of the total number of tasks running ADF, you must submit all ADF jobs to queue "q_adf". To do that, add
#BSUB -q q_adf
to the command file.
The following command starts 1 job with 4 parallel tasks:
#BSUB -q q_adf #BSUB -R "span[ptile=4] rusage[Memory=200]"
To examine the number of ADF tasks currently running, do
bhosts -s ADF
Jaguar
In order to run Jaguar, your job has to request Jaguar licenses. Here's how:
#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"
The first part selects hosts which have Jaguar license resource defined, the second part requests 200 MB and 1 Jaguar license.
In order to look up how many Jaguar licenses are available, do
bhosts -s Jaguar
A sample Jaguar command file:
#!/bin/csh #BSUB -J jaguar_test.test #BSUB -o %J.out #BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]" #BSUB -c 3:10 source /export/apps/etc/cshrc.schrodinger jaguar run -WAIT test.in
NB: The "source" command points to a file on Fgate and, thus, the directory differs from that on Egate!
NAMD
To run a serial version of NAMD, your script should be like this one:
#!/bin/csh # This specifies a job name #BSUB -J myJobName # This specifies an error logfile name #BSUB -e %J.err # This specifies a stdout logfile name #BSUB -o %J.out # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves 200 MB per processor # Memory specifications are MANDATORY. #BSUB -R "rusage[Memory=200]" /export/apps/NAMD_2.6_Linux-i686-TCP/namd2 config_file(s)
This script starts a parallel version of NAMD:
# This specifies a job name #BSUB -J myJobName # This specifies an error logfile name #BSUB -e %J.err # This specifies a stdout logfile name #BSUB -o %J.out # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves 200 MB per processor # Memory specifications are MANDATORY. #BSUB -R "rusage[Memory=200]" # This specifies a type of mpi #BSUB -a mpich2 # This specifies a number of processors #BSUB -n 4 /export/apps/NAMD_2.6_Linux-i686-TCP/namd2mpi config_file(s)
Molpro Versions Available
The latest version of Molpro 2006.1 was installed on 17 Dec 2007 and is located in /export/apps/molpro/2006.1.12-17-07. This version was compiled to support being invoked by SAPT2006.
mkdir and put bin under /export/apps/molpro/2006.1.12-17-07/bin. mkdir and put auxiliary directory under /export/apps/molpro/2006.1.12-17-07/lib. mkdir and put documentation under /export/apps/molpro/2006.1.12-17-07/doc. put HTML CGI under /export/apps/molpro/2006.1.12-17-07/doc also.
Did ./configure -mpp
Attempted compiling with Intel compiler (picked up by default), 8-byte integers, Intel MKL library for BLAS and LAPACK, and BLAS level 4 (use MOLPRO routines when necessary, otherwise use 32-bit integer routines from MKL).
After compiling, but before installing with make install, one needs to edit bin/molpro.rc to change the default location of scratch files (replace tmp in -d and -I with scratch. Otherwise the calculations will be massively slowed down and/or run out of room).
Couldn't get the interface with SAPT to work. Maybe the new version is ok for non-SAPT case though.
Parallel LSF jobs
Here is a sample PSI input for 2 threads:
#!/bin/csh #BSUB -J pd.3.2_0.2_n2 #BSUB -o pd.3.2-0.2_n2.stdout #BSUB -W 200:0 # The Memory below is the new way to do the accounting. *Per processor* #BSUB -R "rusage[Memory=1650] span[ptile=2]" #BSUB -n 2 setenv NUM_THREADS 2 psi3 pd.3.2-0.2.sto.in pd.3.2-0.2.sto.out psi3 pd.3.2-0.2.aDZ.in pd.3.2-0.2.aDZ.out psi3 pd.3.2-0.2.in pd.3.2-0.2.out
For MOLPRO, substitute NUM_COMPUTE_THREADS for NUM_THREADS. I think it might not be possible to specify I/O-heavy jobs with multiple threads, because it might think you're asking for NUM_THREADS scratch disks (and there's only one per node). [CDS]
