Fgate

From CCMSTWiki

Jump to: navigation, search

Contents

Getting Started with Fgate

Fgate can be accessed using ssh software only. The cluster is accessed by logging into the master node: ssh username@fgate.chemistry.gatech.edu. The internal name of the master node is control. Cluster nodes can be accessed from the master node using ssh. Nodes are named c0-0 through c0-15 (1st rack) and c1-0 through c1-31. Note that numbering starts from 0, unlike on Egate. Here's the information on Fgate nodes:

Host Names Type Specs Type of use
control Head node Two 2.66 GHz Woodcrests, 4 GB RAM, 145 GB disk Job submission
c0-0 -- c0-7 Fat nodes Two 2.66 GHz Woodcrests, 16 GB RAM, 720 GB disk LSF
c0-8 -- c0-15, c1-0 -- c1-31 Thin nodes Two 2.66 GHz Woodcrests, 4 GB RAM, 225 GB disk LSF

Running Jobs with LSF

Platform LSF is a batch system. LSF is the only legal way to run jobs on Fgate. Look at ~evaleev/LSF/test.cmd for a sample LSF command file. Note that your priority in the queue is inversely proportional to the computer time you have used recently.

Most useful LSF commands: (See man pages for more information)

lsload
List all nodes along with a summary of their current state. A status of "lockU" indicates a lockup of the node, but is not always serious --- it may just be that the node is very busy. r15s, r1m, and r15m give the 15 second, 1 minute, and 15 minute average number of threads running. ut is the fractional CPU utilization, and swp and mem are the amount of swap and memory available.
bsub
Submit a job to the queue. This must be used with the redirect when using a command file, i.e. bsub < test.cmd
bjobs
Monitor your own jobs in the queue. To see everyone's jobs use bjobs -u all
blist
Nicer version of bjobs (Perl script by David Sherrill); takes most bjobs arguments.
bkill
Kill a job in the queue with associated job number
bdone
List recently completed jobs (Python script by Sam Chill); takes most bacct arguments.
bcwd [jobnumber]
Print full directory path for a given jobnumber. Can also specify a particular user by -u user. Can be useful to use this in conjunction with an alias like alias bcd 'cd `bcwd \!*`'
bmod
Changes job resource requirements once a job has been submitted to a queue. Eg. bmod "-W 200:00" 12345 would change the time limit of job id 12345 to 200 hours.
bhpart
Shows the usage summary of the whole cluster by users. It sorts users according to their priority. If you notice that your jobs are not picking up, it is likely that people with higher priority have jobs pending.
bhist
Shows a summary of the amount of time your recent jobs have spent in various states (waiting, running, etc).
bacct
Shows summary of your recently completed jobs. The -l switch gives long (verbose) output.

Here is a handy reference card of LSF commands: Image:Lsf user qrefcard 60.pdf.


A sample LSF script

#!/bin/csh
# This specifies a job name
#BSUB -J s_mp2_qz
# This specifies a stdout logfile name
#BSUB -o s_mp2_qz.stdout
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves one scratch and 3300 MB per processor
# Memory specifications are mandatory, use Scratch only if you need it.
#BSUB -R "rusage[Scratch=1:Memory=3300]"

runmolprop s.in

Large-memory jobs

The so-called fat nodes are available to run jobs requiring very large memory or larger disk space. To request these resources, submit to the fat node queue using the -q fat_nodes directive. This queue is reserved for those jobs requiring large memory or disk only.

ADF

The ADF license for Fgate permits to execute up to 64 tasks at a time (reminder: a parallel job has several tasks). To keep track of the total number of tasks running ADF, you must submit all ADF jobs to queue "q_adf". To do that, add

#BSUB -q q_adf

to the command file.

The following command starts 1 job with 4 parallel tasks:

#BSUB -q q_adf
#BSUB -R "span[ptile=4] rusage[Memory=200]"

To examine the number of ADF tasks currently running, do

bhosts -s ADF

Jaguar

In order to run Jaguar, your job has to request Jaguar licenses. Here's how:

#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"

The first part selects hosts which have Jaguar license resource defined, the second part requests 200 MB and 1 Jaguar license.

In order to look up how many Jaguar licenses are available, do

bhosts -s Jaguar

A sample Jaguar command file:

#!/bin/csh
#BSUB -J jaguar_test.test
#BSUB -o %J.out
#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"
#BSUB -c 3:10
source /export/apps/etc/cshrc.schrodinger
jaguar run -WAIT test.in

NB: The "source" command points to a file on Fgate and, thus, the directory differs from that on Egate!


NAMD

To run a serial version of NAMD, your script should be like this one:

#!/bin/csh
# This specifies a job name
#BSUB -J myJobName
# This specifies an error logfile name
#BSUB -e %J.err
# This specifies a stdout logfile name
#BSUB -o %J.out
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves 200 MB per processor
# Memory specifications are MANDATORY.
#BSUB -R "rusage[Memory=200]"
/export/apps/NAMD_2.6_Linux-i686-TCP/namd2 config_file(s)


This script starts a parallel version of NAMD:

# This specifies a job name
#BSUB -J myJobName
# This specifies an error logfile name
#BSUB -e %J.err
# This specifies a stdout logfile name
#BSUB -o %J.out
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves 200 MB per processor
# Memory specifications are MANDATORY.
#BSUB -R "rusage[Memory=200]"
# This specifies a type of mpi
#BSUB -a mpich2
# This specifies a number of processors
#BSUB -n 4
/export/apps/NAMD_2.6_Linux-i686-TCP/namd2mpi config_file(s)

Molpro Versions Available

The latest version of Molpro 2006.1 was installed on 17 Dec 2007 and is located in /export/apps/molpro/2006.1.12-17-07. This version was compiled to support being invoked by SAPT2006.

mkdir and put bin under /export/apps/molpro/2006.1.12-17-07/bin. mkdir and put auxiliary directory under /export/apps/molpro/2006.1.12-17-07/lib. mkdir and put documentation under /export/apps/molpro/2006.1.12-17-07/doc. put HTML CGI under /export/apps/molpro/2006.1.12-17-07/doc also.


Did ./configure -mpp

Attempted compiling with Intel compiler (picked up by default), 8-byte integers, Intel MKL library for BLAS and LAPACK, and BLAS level 4 (use MOLPRO routines when necessary, otherwise use 32-bit integer routines from MKL).

After compiling, but before installing with make install, one needs to edit bin/molpro.rc to change the default location of scratch files (replace tmp in -d and -I with scratch. Otherwise the calculations will be massively slowed down and/or run out of room).

Couldn't get the interface with SAPT to work. Maybe the new version is ok for non-SAPT case though.

Parallel LSF jobs

Here is a sample PSI input for 2 threads:

#!/bin/csh
#BSUB -J pd.3.2_0.2_n2
#BSUB -o pd.3.2-0.2_n2.stdout
#BSUB -W 200:0
# The Memory below is the new way to do the accounting.  *Per processor*
#BSUB -R "rusage[Memory=1650] span[ptile=2]"
#BSUB -n 2

setenv NUM_THREADS 2

psi3 pd.3.2-0.2.sto.in pd.3.2-0.2.sto.out
psi3 pd.3.2-0.2.aDZ.in pd.3.2-0.2.aDZ.out
psi3 pd.3.2-0.2.in pd.3.2-0.2.out

For MOLPRO, substitute NUM_COMPUTE_THREADS for NUM_THREADS. I think it might not be possible to specify I/O-heavy jobs with multiple threads, because it might think you're asking for NUM_THREADS scratch disks (and there's only one per node). [CDS]

Personal tools