Fgate

From CCMSTWiki

Jump to: navigation, search

Contents

Getting Started with Fgate

Fgate can be accessed using ssh software only. The cluster is accessed by logging into the master node: ssh username@fgate.chemistry.gatech.edu. The internal name of the master node is control. Cluster nodes can be accessed from the master node using ssh. Nodes are named c0-0 through c0-15 (1st rack) and c1-0 through c1-31. Note that numbering starts from 0, unlike on Egate. Here's the information on Fgate nodes:

Host Names Type Specs Type of use
control Head node Two 2.66 GHz Woodcrests, 4 GB RAM, 145 GB disk Job submission
c0-0 -- c0-7 Fat nodes Two 2.66 GHz Woodcrests, 16 GB RAM, 720 GB disk LSF
c0-8 -- c0-15, c1-0 -- c1-31 Thin nodes Two 2.66 GHz Woodcrests, 4 GB RAM, 225 GB disk LSF

Running Jobs with LSF

If you have something very small you need to run (like some small tests to make sure a program compiled correctly), do NOT run them interactively on the head node. Instead, ask for an interactive shell on one of the nodes, using "bsub -Is csh" (that's a captital I followed by an s, as in "Interactive shell"). Make sure that you do not consume much memory or disk when you do this, or else you may kill someone else's job.

Platform LSF is a batch system. LSF is the only legal way to run jobs on Fgate. Look at http://www.ccmst.gatech.edu/wiki/index.php/Fgate#A_sample_LSF_script for a sample LSF command file. Note that your priority in the queue is inversely proportional to the computer time you have used recently.

Most useful LSF commands: (See man pages for more information)

lsload
List all nodes along with a summary of their current state. A status of "lockU" indicates a lockup of the node, but is not always serious --- it may just be that the node is very busy. r15s, r1m, and r15m give the 15 second, 1 minute, and 15 minute average number of threads running. ut is the fractional CPU utilization, and swp and mem are the amount of swap and memory available.
bsub
Submit a job to the queue. This must be used with the redirect when using a command file, i.e. bsub < test.cmd
bjobs
Monitor your own jobs in the queue. To see everyone's jobs use bjobs -u all
blist
Nicer version of bjobs (Perl script by David Sherrill); takes most bjobs arguments.
bkill
Kill a job in the queue with associated job number
bdone
List recently completed jobs (Python script by Sam Chill); takes most bacct arguments.
bcwd [jobnumber]
Print full directory path for a given jobnumber. Can also specify a particular user by -u user. Can be useful to use this in conjunction with an alias like alias bcd 'cd `bcwd \!*`'
bmod
Changes job resource requirements once a job has been submitted to a queue. Eg. bmod "-W 200:00" 12345 would change the time limit of job id 12345 to 200 hours.
bhpart
Shows the usage summary of the whole cluster by users. It sorts users according to their priority. If you notice that your jobs are not picking up, it is likely that people with higher priority have jobs pending.
bhist
Shows a summary of the amount of time your recent jobs have spent in various states (waiting, running, etc).
bacct
Shows summary of your recently completed jobs. The -l switch gives long (verbose) output.

Here is a handy reference card of LSF commands: Image:Lsf user qrefcard 60.pdf.


A sample LSF script

#!/bin/csh
# This specifies a job name
#BSUB -J s_mp2_qz
# This specifies a stdout logfile name
#BSUB -o s_mp2_qz.stdout
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves one scratch and 3300 MB per processor
# Memory specifications are mandatory, use Scratch only if you need it.
#BSUB -R "rusage[Scratch=1:Memory=3300]"

runmolprop s.in

Large-memory jobs

The so-called fat nodes are available to run jobs requiring very large memory or larger disk space. To request these resources, submit to the fat node queue using the -q fat_nodes directive. This queue is reserved for those jobs requiring large memory or disk only.

ADF

The ADF license for Fgate permits to execute up to 64 tasks at a time (reminder: a parallel job has several tasks). To keep track of the total number of tasks running ADF, you must submit all ADF jobs to queue "q_adf". To do that, add

#BSUB -q q_adf

to the command file.

The following command starts 1 job with 4 parallel tasks:

#BSUB -q q_adf
#BSUB -R "span[ptile=4] rusage[Memory=200]"

To examine the number of ADF tasks currently running, do

bhosts -s ADF

Jaguar

In order to run Jaguar, your job has to request Jaguar licenses. Here's how:

#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"

The first part selects hosts which have Jaguar license resource defined, the second part requests 200 MB and 1 Jaguar license.

In order to look up how many Jaguar licenses are available, do

bhosts -s Jaguar

A sample Jaguar command file:

#!/bin/csh
#BSUB -J jaguar_test.test
#BSUB -o %J.out
#BSUB -R "select[defined(Jaguar)] rusage[Memory=200,Jaguar=1]"
#BSUB -c 3:10
source /export/apps/etc/cshrc.schrodinger
jaguar run -WAIT test.in

NB: The "source" command points to a file on Fgate and, thus, the directory differs from that on Egate!


NAMD

  • NAMD 2.7 is now available (NEW!)

To run a serial version of NAMD, your script should be like this one:

#!/bin/csh
# This specifies a job name
#BSUB -J myJobName
# This specifies an error logfile name
#BSUB -e %J.err
# This specifies a stdout logfile name
#BSUB -o %J.out
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves 200 MB per processor
# Memory specifications are MANDATORY.
#BSUB -R "rusage[Memory=200]"
/export/apps/NAMD_2.7b1_Linux-x86_64-TCP/namd2 config_file(s)


This script starts a parallel calculation with NAMD 2.7:

#!/bin/bash
#
# job name (enter your job name here)
#
#BSUB -J job_name
#
# stdout logfile name (edit with your logfile name)
#
#BSUB -o job_name.log
#
# error file name name (edit with your logfile name)
#
#BSUB -e job_name.err
#
# Number of processors (enter your desired number of processors)
#
#BSUB -n 4
#
# max runtime in hours:minutes (enter your limit, or delete for unlimited time)
#
#BSUB -W 00:20
#
# memory request (MB) per processor (substitute your estimate here. NOTE: this is mandatory!)
#
#BSUB -R "rusage[Memory=100]"
#
# Job proper starts here
#
# Configure NAMD environment
#

NAMDDIR=/export/apps/NAMD_2.7b1_Linux-x86_64-TCP

#
# use ssh for remote execution (the default would be rsh, which does not work)

export CONV_RSH=ssh

#
# set number of processors and hosts file. Some bash mumbo jumbo is used here
# (there should be no need of editing this part)
#
nprocs=0
echo "group main" > ${LSB_JOBID}_hosts
for host in ${LSB_HOSTS}; do
   ((nprocs= ${nprocs} + 1))
   echo "host ${host}" >> ${LSB_JOBID}_hosts
done

echo "number of processes = ${nprocs}"
cat ${LSB_JOBID}_hosts

#
# run job
#

${NAMDDIR}/charmrun ${NAMDDIR}/namd2 +p${nprocs} ++nodelist ${LSB_JOBID}_hosts config_file(s)

This script runs a parallel calculation using the (old) 2.6 version of NAMD:

# This specifies a job name
#BSUB -J myJobName
# This specifies an error logfile name
#BSUB -e %J.err
# This specifies a stdout logfile name
#BSUB -o %J.out
# This specifies max runtime in hours:minutes
#BSUB -W 24:00
# This reserves 200 MB per processor
# Memory specifications are MANDATORY.
#BSUB -R "rusage[Memory=200]"
# This specifies a type of mpi
#BSUB -a mpich2
# This specifies a number of processors
#BSUB -n 4
/export/apps/NAMD_2.6_Linux-i686-TCP/namd2mpi config_file(s)

ACES III

Aces III is available on fgate. The program is installed under /export/apps/ACESII, and is compiled using a special version of OPENMPI installed in /export/apps/openmpi_ifort.

Here is a sample script for running a parallel job. ACES III is supposed to be used in parallel. Do not attempt to run single processor jobs for anything other than a simple Hartree-Fock calculation: the program will just sit there staring at you, consuming CPU cycles and doing noting. The same will likely happen if you use less than 4 processes and do not provide your own *SIP input namelist to divide the processes between computing and I/O.

#!/bin/bash
#
# job name (enter your job name here)
#
#BSUB -J h2o_aces
#
# stdout logfile name (edit with your logfile name)
#
#BSUB -o h2o_aces.log
#
# Number of processors (enter your desired number of processors)
#
#BSUB -n 4
#
# max runtime in hours:minutes (enter your limit, or delete for unlimited time)
#
#BSUB -W 01:00
#
# memory request (MB) per processor (substitute your estimate here)
#
#BSUB -R "rusage[Memory=400]"

#
# Configure ACES III environment
#

mpi_home=/export/apps/openmpi_ifort
aces_home=/export/apps/ACESII

export PATH=${mpi_home}/bin:${PATH}
export LD_LIBRARY_PATH=${mpi_home}/lib:${LD_LIBRARY_PATH}
export ACES_EXE_PATH=${aces_home}/bin

# Temporary directory for the job

tmpdir=/scratch/malagoli/${LSB_JOBID}
jobroot=/home/malagoli/H2O

echo "job working directory = ${tmpdir}"

mkdir -p  ${tmpdir}
cd ${tmpdir}
cp ${jobroot}/GENBAS .
cp ${jobroot}/ZMAT .

#
# set number of processors and hosts file. Some bash mumbo jumbo is used here
# (there should be no need of editing this part)
#

nprocs=0
for host in ${LSB_HOSTS}; do
   ((nprocs= ${nprocs} + 1))
   echo ${host} >> ${LSB_JOBID}_hosts
done

echo "number of processes = ${nprocs}"
cat ${LSB_JOBID}_hosts

# run the job

mpirun --prefix ${mpi_home} -x LD_LIBRARY_PATH -np ${nprocs} -machinefile ${LSB_JOBID}_hosts ${aces_home}/bin/xaces3 > h2o.out

# copy back the results (include any restart files here)

cp ${tmpdir}/h2o.out  ${jobroot}

# remove temporary directory

cd ..
rm -rf  ${tmpdir}

--malagoli 12:39, 24 April 2009 (EDT)

Molpro Versions Available

The latest version of Molpro 2006.1 was installed on 17 Dec 2007 and is located in /export/apps/molpro/2006.1.12-17-07. This version was compiled to support being invoked by SAPT2006.

mkdir and put bin under /export/apps/molpro/2006.1.12-17-07/bin. mkdir and put auxiliary directory under /export/apps/molpro/2006.1.12-17-07/lib. mkdir and put documentation under /export/apps/molpro/2006.1.12-17-07/doc. put HTML CGI under /export/apps/molpro/2006.1.12-17-07/doc also.


Did ./configure -mpp

Attempted compiling with Intel compiler (picked up by default), 8-byte integers, Intel MKL library for BLAS and LAPACK, and BLAS level 4 (use MOLPRO routines when necessary, otherwise use 32-bit integer routines from MKL).

After compiling, but before installing with make install, one needs to edit bin/molpro.rc to change the default location of scratch files (replace tmp in -d and -I with scratch. Otherwise the calculations will be massively slowed down and/or run out of room).

Couldn't get the interface with SAPT to work. Maybe the new version is ok for non-SAPT case though.

Threaded LSF jobs

Here is a sample PSI input for 2 threads:

#!/bin/csh
#BSUB -J pd.3.2_0.2_n2
#BSUB -o pd.3.2-0.2_n2.stdout
#BSUB -W 200:0
# The Memory below is the new way to do the accounting.  *Per processor*
#BSUB -R "rusage[Memory=1650] span[ptile=2]"
#BSUB -n 2

setenv NUM_THREADS 2

psi3 pd.3.2-0.2.sto.in pd.3.2-0.2.sto.out
psi3 pd.3.2-0.2.aDZ.in pd.3.2-0.2.aDZ.out
psi3 pd.3.2-0.2.in pd.3.2-0.2.out

For MOLPRO, substitute NUM_COMPUTE_THREADS for NUM_THREADS. I think it might not be possible to specify I/O-heavy jobs with multiple threads, because it might think you're asking for NUM_THREADS scratch disks (and there's only one per node). [CDS]

General Parallel MPI LSF jobs

It can get a little tricky sorting out parallel MPI jobs when more than one MPI process might be running on a given node (and yet perhaps there are other nodes which do only run one MPI process). This script seems to handle all this:

#!/bin/sh
#BSUB -J mpi_test
#BSUB -o mpi_test.%J.out
#BSUB -R "rusage[Memory=100]"
#BSUB -W 50:0
#BSUB -n 5
export PATH=/export/apps/lib/mpich2/bin:$PATH
echo $LSB_MCPU_HOSTS | awk -F" " '{ for ( x=1; x<=NF; x=x+2 ) {print $x":"$(x+1)} }' > $LSB_JOBID
echo $LSB_MCPU_HOSTS | awk -F" " '{ for ( x=1; x<=NF; x=x+2 ) {print $x } }' > Machine$LSB_JOBID
NUM_HOSTS=`awk 'NF != 0 {++count} END {print count}' Machine$LSB_JOBID`
export NUM_HOSTS
echo "Number of unique hosts is $NUM_HOSTS"
echo "$LSB_MCPU_HOSTS"

mpdboot -f Machine$LSB_JOBID -n $NUM_HOSTS
mpiexec -machinefile $LSB_JOBID -n 5 hello_world >& hello_world.out
rm -f $LSB_JOBID
rm -f Machine$LSB_JOBID
mpdallexit
Personal tools