Fgate
From CCMSTWiki
Contents |
Getting Started with Fgate
Fgate can be accessed using ssh software only. The cluster is accessed by logging into the master node: ssh username@fgate.chemistry.gatech.edu. The internal name of the master node is control. Cluster nodes can be accessed from the master node using ssh. Nodes are named c0-0 through c0-15 (1st rack) and c1-0 through c1-31. Note that numbering starts from 0, unlike on Egate. Here's the information on Fgate nodes:
| Host Names | Type | Specs | Type of use |
|---|---|---|---|
| control | Head node | Two 2.66 GHz Woodcrests, 4 GB RAM, 145 GB disk | Job submission |
| c0-0 -- c0-7 | Fat nodes | Two 2.66 GHz Woodcrests, 16 GB RAM, 720 GB disk | LSF |
| c0-8 -- c0-15, c1-0 -- c1-31 | Thin nodes | Two 2.66 GHz Woodcrests, 4 GB RAM, 225 GB disk | LSF |
Running Jobs with LSF
If you have something very small you need to run (like some small tests to make sure a program compiled correctly), do NOT run them interactively on the head node. Instead, ask for an interactive shell on one of the nodes, using "bsub -Is csh" (that's a captital I followed by an s, as in "Interactive shell"). Make sure that you do not consume much memory or disk when you do this, or else you may kill someone else's job.
Platform LSF is a batch system. LSF is the only legal way to run jobs on Fgate. Look at http://www.ccmst.gatech.edu/wiki/index.php/Fgate#A_sample_LSF_script for a sample LSF command file. Note that your priority in the queue is inversely proportional to the computer time you have used recently.
Most useful LSF commands: (See man pages for more information)
- lsload
- List all nodes along with a summary of their current state. A status of "lockU" indicates a lockup of the node, but is not always serious --- it may just be that the node is very busy. r15s, r1m, and r15m give the 15 second, 1 minute, and 15 minute average number of threads running. ut is the fractional CPU utilization, and swp and mem are the amount of swap and memory available.
- bsub
- Submit a job to the queue. This must be used with the redirect when using a command file, i.e. bsub < test.cmd
- bjobs
- Monitor your own jobs in the queue. To see everyone's jobs use bjobs -u all
- blist
- Nicer version of bjobs (Perl script by David Sherrill); takes most bjobs arguments.
- bkill
- Kill a job in the queue with associated job number
- bdone
- List recently completed jobs (Python script by Sam Chill); takes most bacct arguments.
- bcwd [jobnumber]
- Print full directory path for a given jobnumber. Can also specify a particular user by -u user. Can be useful to use this in conjunction with an alias like alias bcd 'cd `bcwd \!*`'
- bmod
- Changes job resource requirements once a job has been submitted to a queue. Eg. bmod "-W 200:00" 12345 would change the time limit of job id 12345 to 200 hours.
- bhpart
- Shows the usage summary of the whole cluster by users. It sorts users according to their priority. If you notice that your jobs are not picking up, it is likely that people with higher priority have jobs pending.
- bhist
- Shows a summary of the amount of time your recent jobs have spent in various states (waiting, running, etc).
- bacct
- Shows summary of your recently completed jobs. The -l switch gives long (verbose) output.
Here is a handy reference card of LSF commands: Image:Lsf user qrefcard 60.pdf.
A sample LSF script
#!/bin/csh # This specifies a job name #BSUB -J s_mp2_qz # This specifies a stdout logfile name #BSUB -o s_mp2_qz.stdout # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves one scratch and 3300 MB per processor # Memory specifications are mandatory, use Scratch only if you need it. #BSUB -R "rusage[Scratch=1:Memory=3300]" runmolprop s.in
Large-memory jobs
The so-called fat nodes are available to run jobs requiring very large memory or larger disk space. To request these resources, submit to the fat node queue using the -q fat_nodes directive. This queue is reserved for those jobs requiring large memory or disk only.
ADF
The ADF license for Fgate permits to execute up to 64 tasks at a time (reminder: a parallel job has several tasks). To keep track of the total number of tasks running ADF, you must submit all ADF jobs to queue "q_adf". To do that, add
#BSUB -q q_adf
to the command file.
The following command starts 1 job with 4 parallel tasks:
#BSUB -q q_adf #BSUB -R "span[ptile=4] rusage[Memory=200]"
To examine the number of ADF tasks currently running, do
bhosts -s ADF
Jaguar
In order to run Jaguar, your job has to request Jaguar licenses. Here's how:
#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"
The first part selects hosts which have Jaguar license resource defined, the second part requests 200 MB and 1 Jaguar license.
In order to look up how many Jaguar licenses are available, do
bhosts -s Jaguar
A sample Jaguar command file:
#!/bin/csh #BSUB -J jaguar_test.test #BSUB -o %J.out #BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]" #BSUB -c 3:10 source /export/apps/etc/cshrc.schrodinger jaguar run -WAIT test.in
NB: The "source" command points to a file on Fgate and, thus, the directory differs from that on Egate!
NAMD
- NAMD 2.7 is now available (NEW!)
To run a serial version of NAMD, your script should be like this one:
#!/bin/csh # This specifies a job name #BSUB -J myJobName # This specifies an error logfile name #BSUB -e %J.err # This specifies a stdout logfile name #BSUB -o %J.out # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves 200 MB per processor # Memory specifications are MANDATORY. #BSUB -R "rusage[Memory=200]" /export/apps/NAMD_2.7b1_Linux-x86_64-TCP/namd2 config_file(s)
This script starts a parallel calculation with NAMD 2.7:
#!/bin/bash
#
# job name (enter your job name here)
#
#BSUB -J job_name
#
# stdout logfile name (edit with your logfile name)
#
#BSUB -o job_name.log
#
# error file name name (edit with your logfile name)
#
#BSUB -e job_name.err
#
# Number of processors (enter your desired number of processors)
#
#BSUB -n 4
#
# max runtime in hours:minutes (enter your limit, or delete for unlimited time)
#
#BSUB -W 00:20
#
# memory request (MB) per processor (substitute your estimate here. NOTE: this is mandatory!)
#
#BSUB -R "rusage[Memory=100]"
#
# Job proper starts here
#
# Configure NAMD environment
#
NAMDDIR=/export/apps/NAMD_2.7b1_Linux-x86_64-TCP
#
# use ssh for remote execution (the default would be rsh, which does not work)
export CONV_RSH=ssh
#
# set number of processors and hosts file. Some bash mumbo jumbo is used here
# (there should be no need of editing this part)
#
nprocs=0
echo "group main" > ${LSB_JOBID}_hosts
for host in ${LSB_HOSTS}; do
((nprocs= ${nprocs} + 1))
echo "host ${host}" >> ${LSB_JOBID}_hosts
done
echo "number of processes = ${nprocs}"
cat ${LSB_JOBID}_hosts
#
# run job
#
${NAMDDIR}/charmrun ${NAMDDIR}/namd2 +p${nprocs} ++nodelist ${LSB_JOBID}_hosts config_file(s)
This script runs a parallel calculation using the (old) 2.6 version of NAMD:
# This specifies a job name #BSUB -J myJobName # This specifies an error logfile name #BSUB -e %J.err # This specifies a stdout logfile name #BSUB -o %J.out # This specifies max runtime in hours:minutes #BSUB -W 24:00 # This reserves 200 MB per processor # Memory specifications are MANDATORY. #BSUB -R "rusage[Memory=200]" # This specifies a type of mpi #BSUB -a mpich2 # This specifies a number of processors #BSUB -n 4 /export/apps/NAMD_2.6_Linux-i686-TCP/namd2mpi config_file(s)
ACES III
Aces III is available on fgate. The program is installed under /export/apps/ACESII, and is compiled using a special version of OPENMPI installed in /export/apps/openmpi_ifort.
Here is a sample script for running a parallel job. ACES III is supposed to be used in parallel. Do not attempt to run single processor jobs for anything other than a simple Hartree-Fock calculation: the program will just sit there staring at you, consuming CPU cycles and doing noting. The same will likely happen if you use less than 4 processes and do not provide your own *SIP input namelist to divide the processes between computing and I/O.
#!/bin/bash
#
# job name (enter your job name here)
#
#BSUB -J h2o_aces
#
# stdout logfile name (edit with your logfile name)
#
#BSUB -o h2o_aces.log
#
# Number of processors (enter your desired number of processors)
#
#BSUB -n 4
#
# max runtime in hours:minutes (enter your limit, or delete for unlimited time)
#
#BSUB -W 01:00
#
# memory request (MB) per processor (substitute your estimate here)
#
#BSUB -R "rusage[Memory=400]"
#
# Configure ACES III environment
#
mpi_home=/export/apps/openmpi_ifort
aces_home=/export/apps/ACESII
export PATH=${mpi_home}/bin:${PATH}
export LD_LIBRARY_PATH=${mpi_home}/lib:${LD_LIBRARY_PATH}
export ACES_EXE_PATH=${aces_home}/bin
# Temporary directory for the job
tmpdir=/scratch/malagoli/${LSB_JOBID}
jobroot=/home/malagoli/H2O
echo "job working directory = ${tmpdir}"
mkdir -p ${tmpdir}
cd ${tmpdir}
cp ${jobroot}/GENBAS .
cp ${jobroot}/ZMAT .
#
# set number of processors and hosts file. Some bash mumbo jumbo is used here
# (there should be no need of editing this part)
#
nprocs=0
for host in ${LSB_HOSTS}; do
((nprocs= ${nprocs} + 1))
echo ${host} >> ${LSB_JOBID}_hosts
done
echo "number of processes = ${nprocs}"
cat ${LSB_JOBID}_hosts
# run the job
mpirun --prefix ${mpi_home} -x LD_LIBRARY_PATH -np ${nprocs} -machinefile ${LSB_JOBID}_hosts ${aces_home}/bin/xaces3 > h2o.out
# copy back the results (include any restart files here)
cp ${tmpdir}/h2o.out ${jobroot}
# remove temporary directory
cd ..
rm -rf ${tmpdir}
--malagoli 12:39, 24 April 2009 (EDT)
Molpro Versions Available
The latest version of Molpro 2006.1 was installed on 17 Dec 2007 and is located in /export/apps/molpro/2006.1.12-17-07. This version was compiled to support being invoked by SAPT2006.
mkdir and put bin under /export/apps/molpro/2006.1.12-17-07/bin. mkdir and put auxiliary directory under /export/apps/molpro/2006.1.12-17-07/lib. mkdir and put documentation under /export/apps/molpro/2006.1.12-17-07/doc. put HTML CGI under /export/apps/molpro/2006.1.12-17-07/doc also.
Did ./configure -mpp
Attempted compiling with Intel compiler (picked up by default), 8-byte integers, Intel MKL library for BLAS and LAPACK, and BLAS level 4 (use MOLPRO routines when necessary, otherwise use 32-bit integer routines from MKL).
After compiling, but before installing with make install, one needs to edit bin/molpro.rc to change the default location of scratch files (replace tmp in -d and -I with scratch. Otherwise the calculations will be massively slowed down and/or run out of room).
Couldn't get the interface with SAPT to work. Maybe the new version is ok for non-SAPT case though.
Threaded LSF jobs
Here is a sample PSI input for 2 threads:
#!/bin/csh #BSUB -J pd.3.2_0.2_n2 #BSUB -o pd.3.2-0.2_n2.stdout #BSUB -W 200:0 # The Memory below is the new way to do the accounting. *Per processor* #BSUB -R "rusage[Memory=1650] span[ptile=2]" #BSUB -n 2 setenv NUM_THREADS 2 psi3 pd.3.2-0.2.sto.in pd.3.2-0.2.sto.out psi3 pd.3.2-0.2.aDZ.in pd.3.2-0.2.aDZ.out psi3 pd.3.2-0.2.in pd.3.2-0.2.out
For MOLPRO, substitute NUM_COMPUTE_THREADS for NUM_THREADS. I think it might not be possible to specify I/O-heavy jobs with multiple threads, because it might think you're asking for NUM_THREADS scratch disks (and there's only one per node). [CDS]
General Parallel MPI LSF jobs
It can get a little tricky sorting out parallel MPI jobs when more than one MPI process might be running on a given node (and yet perhaps there are other nodes which do only run one MPI process). This script seems to handle all this:
#!/bin/sh
#BSUB -J mpi_test
#BSUB -o mpi_test.%J.out
#BSUB -R "rusage[Memory=100]"
#BSUB -W 50:0
#BSUB -n 5
export PATH=/export/apps/lib/mpich2/bin:$PATH
echo $LSB_MCPU_HOSTS | awk -F" " '{ for ( x=1; x<=NF; x=x+2 ) {print $x":"$(x+1)} }' > $LSB_JOBID
echo $LSB_MCPU_HOSTS | awk -F" " '{ for ( x=1; x<=NF; x=x+2 ) {print $x } }' > Machine$LSB_JOBID
NUM_HOSTS=`awk 'NF != 0 {++count} END {print count}' Machine$LSB_JOBID`
export NUM_HOSTS
echo "Number of unique hosts is $NUM_HOSTS"
echo "$LSB_MCPU_HOSTS"
mpdboot -f Machine$LSB_JOBID -n $NUM_HOSTS
mpiexec -machinefile $LSB_JOBID -n 5 hello_world >& hello_world.out
rm -f $LSB_JOBID
rm -f Machine$LSB_JOBID
mpdallexit
