Kamiak Cluster at WSU¶
Here we document our experience using the Kamiak HPC cluster at WSU.
Resources¶
Kamiak Specific¶
- Kamiak Users Guide: Read this.
- Service Requests: Request access to Kamiak here and use this for other service requests (software installation, issues with the cluster, etc.)
- Queue List: List of queues.
General¶
TL;DR¶
If you have read everything below, then you can use this job script.
Notes:
- Make sure that you can clone everything without an SSH agent. (I.e. any pip-installable packages.)
Python on a Single Node¶
If you are running only on a single node, then it make sense to create an environment that uses a /local
scratch space since this is the fastest sort of storage available. Here we create the environment in our SLURM script storing the location in my_workspace
.
#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -t 0-00:10 # Runtime in D-HH:MM
# Local workspace for install environments.
# This will be removed at the end of the job.
my_workspace="$(mkworkspace --backend=/local --quiet)"
function clean_up { # Clean up. Remove temporary workspaces and the like.
rm -rf "${my_workspace}"
exit
}
trap 'clean_up' EXIT
# TODO: Why does hg-conda not work here?
module load conda mercurial
conda activate base
# TODO: Make this in /scratch for long-term use
export CONDA_PKGS_DIRS="${my_workspace}/.conda"
conda_prefix="${my_workspace}/current_conda_env"
#conda env create -f environment.yml --prefix "${conda_prefix}"
mamba env create -q -f environment.yml --prefix "${conda_prefix}"
conda activate "${conda_prefix}"
... # Do your work.
Overview¶
Using the cluster requires understanding the following components:
Obtaining Access¶
Request access by submitting a service request. Identify your advisor/supervisor.
Connecting¶
To connect to the cluster, use SSH. I recommend generating and installing an SSH key so you can connect without a password.
Jobs and Queues¶
All activity – including development, software installation, etc. – must be run on the compute nodes. You gain access to these by submitting a job to the appropriate job queue (scheduled with SLURM). There are three types of jobs:
- Dedicated jobs: If you or your supervisor own nodes on the system, you can submit jobs to the appropriate queue and gain full access to these, kicking anyone else off. Once you have access to your nodes, you can do what you like. An example would be the CAS queue
cas
. - Backfill jobs: The default is to submit a job to the Backfill queue
kamiak
. These will run on whatever nodes are not occupied, but can be preempted by the owners of the nodes. For this reason, you must implement a checkpoint-restart mechanism in your code so you can pickup where you left off when you get preempted.
On top of these, you can choose either background jobs (for computation) or interactive jobs (for development and testing).
Resources¶
When you submit a job, you must know:
- How many nodes you need.
- How many processes you will run.
- Roughly how much memory you will need.
- How long your job will take.
Make sure that your actual usage matches your request. To do this you must profile your code. Understand the expected memory and time usage before you run, then actually test this to make sure your code is doing what you expect. If you exceed the requested resources, you may slow down the cluster for other users. E.g. launching more processes than there are threads on a node will cause thread contention, significantly impacting the performance of your program and that of others.
Nodes are a shared resource - request only what you need and do not use more than you request.
Software¶
Much of the software on the system is managed by the Lmod module system. Custom software can be installed by sending service requests, or built in your own account. I maintain an up-to-date conda installation and various environments.
Preliminary¶
SSH¶
To connect to the cluster, I recommend configuring your local SSH server with something like this. (Change m.forbes
to your username!)
# ~/.ssh/config
Host kamiak
HostName kamiak.wsu.edu
User m.forbes
ForwardAgent yes
Host cn*
ProxyCommand ssh kamiak nc %h %p
User m.forbes
ForwardAgent yes
# The following are for jupyter notebooks. Run with:
# jupyter notebook --port 18888
# and connect with
# https://localhost:18888
######## PORT FORWARDING TO NODES DOES NOT WORK.
#LocalForward 10001 localhost:10001
#LocalForward 10002 localhost:10002
#LocalForward 10003 localhost:10003
#LocalForward 18888 localhost:18888
# The following is for snakeviz
#LocalForward 8080 localhost:8080
This will allow you to connect with ssh kamiak
rather than ssh m.forbes@kamiak.wsu.edu
. Then use ssh-keygen
to create a key and copy it to kamiak:~/.ssh/authorized_keys
. The second entry allows you to directly connect to the compute nodes, forwarding ports so you can run Jupyter notebooks. Only do this for nodes for which you have been granted control through the scheduler.
Interactive Queue¶
Before doing any work, be sure to start an interactive session on one of the nodes. (Do not do work on the login nodes, this is a violation of the Kamiak user policy.) Once you have tested and profiled your code, run it with a non-interactive job in the batch queue.
$ idev --partition=kamiak -t 60
Home Setup¶
I have included the following setup. This will cause your ~/.bashrc
file to load some environmental variables, and create links to the data directory.
ln -s /data/lab/forbes ~/data
ln -s ~/data/bashrc.d/inputrc ~/.inputrc # Up-arrow history for commands
ln -s ~/data/bashrc.d/bash_alias ~/.bash_alias # Sets up environment
If you do not have a .bashrc
file, then you can copy mine and similar related files.
cp ~/data/bashrc.d/bashrc ~/.bashrc
cp ~/data/bashrc.d/bash_profile ~/.bash_profile
cp ~/data/bashrc.d/hgrc ~/.hgrc
cp ~/data/bashrc.d/hgignore ~/.hgignore
If you do have one, then you can append these commands using cat
:
cat >> ~/.bashrc <<EOF
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# Source global definitions
if [ -f ~/.bash_alias ]; then
. ~/.bash_alias
fi
# Load the conda module which has mercurial
# Load the conda module which has mercurial and mr
module load conda mr
conda activate
EOF
In addition to this, you want to make sure that your .bashrc
file loads any required modules that might be needed by default. For example. If you want to be able to hg push
code to Kamiak, you will need to ensure that an appropriate module is loaded with mercurial. This can be done with the conda
module below which is what I do above.
Make sure you add your username to the .hgrc
file, create it:
# Mercurial (hg) Init File; -*-Shell-script-*-
# dest = ~/.hgrc # Keep this as the 2nd line for mmf_init_setup
#
# Place site-specific customizations in the appropriate .hg_site file.
[ui]
######## Be sure to add a name here, or to your ~/.hgrc_personal file.
#username = Your Full Name <yourname@your.domain>
# Common global ignores
ignore.common = ~/.hgignore
[extensions]
graphlog =
extdiff =
rebase =
record =
histedit =
Conda¶
I do not have a good solution yet for working with Conda on Kamiak. Here are some goals and issues:
Goals
- Allow users to work with custom environments ensuring reproducible computing.
- Allow users to install software using
conda
. (The other option is to usepip
, but I am migrating to make sure all of my packages are available on mymforbes
anaconda channel.
Issues
- Working with conda in the user's home directory (default) or on
/scratch
is very slow. For some timings, installing a minimal python3 two times in succession (so that the second time needs no downloads). We also compare the time required to copy the environment to the Home directory, and the time it takes to runrm -r pkgs envs
:
Location | Fresh Install | Second Install | Copy Home | Removal |
---|---|---|---|---|
Home | 3m32s | 1m00s | N/A | 1m03s |
Scratch | 2m16s | 0m35s | 2m53s | 0m45s |
Local | 0m46s | 0m11s | 1m05s | 0m00s |
Recommendation
- If you need a custom environment, use the Local drive
/local
and build it at the start of your job. A full anaconda installation takes about 5m24s on/local
. - If you need a persistent environment, build it in your Home directory, but keep the
pkgs
directory on Scratch or Local to avoid exceeding your quota. (Note: conda environments are not relocatable, so you can't just copy the one you built on Local to your home directory. With the copy speeds, it is faster just to build the environment again.)
Playing with Folders¶
We will need to manage our own environment so we can install appropriate versions of the python software stack. In principle this should be possible with Anaconda 4.4 (see this issue – Better support for conda envs accessed by multiple users – for example), but Kamiak does not yet have this version of Conda. Untill then, we maintain our own stack.
Conda Root Installation¶
We do this under our lab partition /data/forbes/apps/conda
so that others in our group
can share these environments. To use these do the following:
-
module load conda
: This will allow you to use our conda installation. -
conda activate
: This activates the base environment withhg
andgit-annex
. -
conda env list
: This will show you which environments are available. Choose the appropriate one and then: -
conda activate --stack <env>
: This will activate the specified environment, stacking this on top of the base environment so that you can continue to usehg
andgit-annex
. -
conda deactivate
: Do this a couple of times when you are done to deactivate your environments. -
module unload conda
: Optionally, unload the conda module.
Note: you do not need to use the undocumented --stack
feature for just running code:
conda activate <env>
will be fine.
Primary Conda Environments (OLD)¶
conda create -y -n work2 python=2
conda install -y -n work2 anaconda
conda update -y -n work2 --all
conda install -y -n work2 accelerate
conda create -y -n work3 python=3
conda install -y -n work3 anaconda
conda update -y -n work3 --all
conda install -y -n work3 accelerate
for _e in work2 work3;
do . activate $_e
pip install ipdb \
line_profiler \
memory_profiler \
snakeviz \
uncertainties \
xxhash \
mmf_setup
done
module load cuda/8.0.44 # See below - install cuda and the module files first
for _e in work2 work3;
do . activate $_e
pip install pycuda \
scikit-cuda
done
Once these base environments are installed, we lock the directories so that they cannot be changed accidentally.
To use python, first load the module of your choice:
[cn14] $ module av
...
anaconda2/2.4.0
anaconda2/4.2.0 (D)
anaconda3/2.4.0
anaconda3/4.2.0
anaconda3/5.1.0 (D)
[cn14] $ module load anaconda3
Now you can create an environment in which to update everything.
[cn14] $ conda create -n work3 python=3
Solving environment: done
## Package Plan ##
environment location: /home/m.forbes/.conda/envs/work3
added / updated specs:
- python=3
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2018.11.29 | py37_0 146 KB
wheel-0.33.1 | py37_0 39 KB
pip-19.0.3 | py37_0 1.8 MB
python-3.7.2 | h0371630_0 36.4 MB
setuptools-40.8.0 | py37_0 643 KB
------------------------------------------------------------
Total: 39.0 MB
The following NEW packages will be INSTALLED:
ca-certificates: 2019.1.23-0
certifi: 2018.11.29-py37_0
libedit: 3.1.20181209-hc058e9b_0
libffi: 3.2.1-hd88cf55_4
libgcc-ng: 8.2.0-hdf63c60_1
libstdcxx-ng: 8.2.0-hdf63c60_1
ncurses: 6.1-he6710b0_1
openssl: 1.1.1b-h7b6447c_0
pip: 19.0.3-py37_0
python: 3.7.2-h0371630_0
readline: 7.0-h7b6447c_5
setuptools: 40.8.0-py37_0
sqlite: 3.26.0-h7b6447c_0
tk: 8.6.8-hbc83047_0
wheel: 0.33.1-py37_0
xz: 5.2.4-h14c3975_4
zlib: 1.2.11-h7b6447c_3
Proceed ([y]/n)? y
Downloading and Extracting Packages
certifi-2018.11.29 | 146 KB | ################################################################################################################################################################### | 100%
wheel-0.33.1 | 39 KB | ################################################################################################################################################################### | 100%
pip-19.0.3 | 1.8 MB | ################################################################################################################################################################### | 100%
python-3.7.2 | 36.4 MB | ################################################################################################################################################################### | 100%
setuptools-40.8.0 | 643 KB | ################################################################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate work3
#
# To deactivate an active environment, use:
# > source deactivate
#
Now you can activate work3
and update anaconda etc.
[cn14] $ . activate work3
(work3) [cn14] $ conda install anaconda
Solving environment: done
## Package Plan ##
environment location: /home/m.forbes/.conda/envs/work3
added / updated specs:
- anaconda
The following packages will be downloaded:
package | build
---------------------------|-----------------
anaconda-2018.12 | py37_0 11 KB
keyring-17.0.0 | py37_0 49 KB
dask-core-1.0.0 | py37_0 1.2 MB
...
------------------------------------------------------------
Total: 559.3 MB
The following NEW packages will be INSTALLED:
alabaster: 0.7.12-py37_0
anaconda: 2018.12-py37_0
anaconda-client: 1.7.2-py37_0
...
The following packages will be DOWNGRADED:
ca-certificates: 2019.1.23-0 --> 2018.03.07-0
libedit: 3.1.20181209-hc058e9b_0 --> 3.1.20170329-h6b74fdf_2
openssl: 1.1.1b-h7b6447c_0 --> 1.1.1a-h7b6447c_0
pip: 19.0.3-py37_0 --> 18.1-py37_0
python: 3.7.2-h0371630_0 --> 3.7.1-h0371630_7
setuptools: 40.8.0-py37_0 --> 40.6.3-py37_0
wheel: 0.33.1-py37_0 --> 0.32.3-py37_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
anaconda-2018.12 | 11 KB | ################################################# | 100%
...
(work3) $ du -sh .conda/envs/*
36M .conda
(work2) $ du -sh /opt/apps/anaconda2/4.2.0/
2.2G /opt/apps/anaconda2/4.2.0/
Some files are installed, but most are linked so this does not create much of a burden.
Issues¶
The currently recommended approach for setting up conda is to source the file .../conda/etc/profile.d/conda.sh
. This does not work well with the module system, so I had to write a custom module file that does what this file does. This may get better in the future if the following issues are dealt with:
- #6820: Consider shell-agnostic activate.d/deactivate.d mechanism: This one even suggests using Lmod for activation.
- #7407: Some conda environment variables are not being unset when you deactivate the virtual environment: Closed, but references issue #7609.
- #7609: add conda deactivate --all flag: Might not help.
References¶
-
Conda Docs: Multi-User support: It seems like the Kamiak installations do not use a top-level
.condarc
file. -
Issue 1329: Better support for conda envs accessed by multiple users. - PR 5159: support stacking environments
-
Constructor Issue 145:
conda --clone
surprised me by downloading a stack of files.
Inspecting the Cluster¶
Sometimes you might want to see what is happening with the cluster and various jobs.
Queue¶
To see what jobs have been submitted use the squeue
command.
squeue
Nodes¶
Suppose you are running on a node and performance seems to be poor. It might be that you are overusing the resources you have requested. To see this, you can log into the node and use the top
command. For example:
$ squeue -u m.forbes
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
661259 kamiak idv4807 m.forbes R 2:41 1 cn94
$ squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %C %R" -w cn94
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
653445 kamiak SCR5 l... R 4-11:28:12 1 4 cn94
653448 kamiak SCR18 l... R 3-12:59:43 1 8 cn94
654674 kamiak SCR10 l... R 2-06:26:03 1 4 cn94
654675 kamiak SCR12 l... R 2-06:26:03 1 4 cn94
659459 kamiak meme1 e... R 2-06:26:03 1 1 cn94
660544 kamiak meme2 e... R 3-08:20:33 1 1 cn94
661259 kamiak idv4807 m... R 7:17 1 5 cn94
This tells us that I have 1 jobs running on note cn94
which requested 5 CPUs, while user l...
is running 4 jobs having requested a total of 20 CPUs, and user e...
is running 2 jobs, having requested 1 CPU each. (Note: to see the number of CPUs, I needed to manually adjust the format string as described in the manual.)
Node Capabilities¶
To see what the compute capabilities of the node are, you can use the lscpu
command:
[cn94] $ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping: 1
CPU MHz: 2404.687
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 3990.80
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
This tells us sum information about the node, including that there are 14 cores per socket and 2 sockets, for a total of 28 cores on the node, so the 27 requested CPUs above should run fine.
Node Usage¶
To see what is actually happening on the node, we can log in and run top:
$ ssh cn94
$ top -n 1
Tasks: 772 total, 14 running, 758 sleeping, 0 stopped, 0 zombie
%Cpu(s): 46.5 us, 0.1 sy, 0.0 ni, 53.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13172199+total, 10478241+free, 23872636 used, 3066944 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 10730780+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20936 e... 20 0 1335244 960616 1144 R 3.6 0.7 4769:39 meme
30839 l... 20 0 1350228 0.995g 7952 R 3.6 0.8 236:38.54 R
30853 l... 20 0 1350228 0.993g 7952 R 3.6 0.8 236:38.75 R
30862 l... 20 0 1350228 0.995g 7952 R 3.6 0.8 236:37.37 R
122856 l... 20 0 1989708 1.586g 7988 R 3.6 1.3 1452:29 R
122865 l... 20 0 1989704 1.585g 7988 R 3.6 1.3 1452:25 R
124397 l... 20 0 1885432 1.514g 7988 R 3.6 1.2 1434:18 R
124410 l... 20 0 1885428 1.514g 7988 R 3.6 1.2 1434:17 R
124419 l... 20 0 1885428 1.514g 7988 R 3.6 1.2 1434:17 R
26811 l... 20 0 2710944 2.259g 7988 R 3.6 1.8 2595:41 R
26833 l... 20 0 2710940 2.262g 7988 R 3.6 1.8 2595:51 R
122847 l... 20 0 1989700 1.585g 7988 R 3.6 1.3 1452:29 R
170160 e... 20 0 1150992 776276 1140 R 3.6 0.6 3216:06 meme
50214 m.forbes 20 0 168700 3032 1612 S 0.0 0.0 0:02.60 top
Here I am just looking with top
, but the other users are running 13 processes that are each using a full CPU on the node. The 3.6% = 1/28, since the node has 28 CPUs. (To see this view, you might have to press "Shift-I" while running top to disable Irix mode. If you want to save this as the default, press "Shift-W" which will write the defaults to your ~/.toprc
file.)
Note: there are several key-stroke commands you can use while running top
to adjust the display. When two options are available, the lower-case version affects the listing below for each process, while the upper-case version affects the top summary line:
-
e/E
: Changes the memory units. -
I
: Irix mode - toggles between CPU usage as a % of node capability vs as a % of CPU capability.
Software¶
Modules¶
To find out which modules exist, run module avail
:
[cn112] $ module avail
----------------------------------------- Compilers ------------------------------------------
StdEnv (L) gcc/6.1.0 intel/xe_2016_update3 (L,D)
gcc/4.9.3 gcc/7.3.0 (D) intel/16.2
gcc/5.2.0 intel/xe_2016_update2 intel/16.3
------------------------------- intel/xe_2016_update3 Software -------------------------------
bazel/0.4.2 espresso/5.3.0 (D) hdf5/1.10.2 nwchem/6.8 (D)
cmake/3.7.2 espresso/6.3.0 lammps/16feb16 octave/4.0.1
corset/1.06 fftw/3.3.4 mvapich2/2.2 siesta/4.0_mpi
dmtcp/2.5.2 gromacs/2016.2_mdrun netcdf/4 (D) stacks/1.44
eems/8ee979b gromacs/2016.2_mpi (D) netcdf/4.6.1 stacks/2.2 (D)
elpa/2016.05.003 hdf5/1.8.16 (D) nwchem/6.6
--------------------------------------- Other Software ---------------------------------------
anaconda2/2.4.0 git/2.6.3 python/2.7.10 (D)
anaconda2/4.2.0 (D) globus/6.0 python/2.7.15
anaconda3/2.4.0 google_sparsehash/4cb9240 python2/2.7.10 (D)
anaconda3/4.2.0 graphicsmagick/1.3.10 python2/2.7.15
anaconda3/5.1.0 (D) grass/6.4.6 python3/3.4.3
angsd/9.21 grass/7.0.5 python3/3.5.0
armadillo/8.5.1 grass/7.6.0 (D) python3/3.6.5 (D)
arpack/3.6.0 gsl/2.1 qgis/2.14.15
bamaddrg/1.0 hisat2/2.1.0 qgis/3.4.4 (D)
bamtools/2.4.1 htslib/1.8 qscintilla/2.9.4
bcftools/1.6 imagemagick/7.0.7-25 qscintilla/2.10 (D)
beagle/3.0.2 interproscan/5.27.66 r/3.2.2
beast/1.8.4 iperf/3.1.3 r/3.3.0
beast/1.10.0 (D) java/oracle_1.8.0_92 (D) r/3.4.0
bedtools/2.27.1 java/11.0.1 r/3.4.3
binutils/2.25.1 jellyfish/2.2.10 r/3.5.1
blast/2.2.26 jemalloc/3.6.0 r/3.5.2 (D)
blast/2.7.1 (D) jemalloc/4.4.0 (D) rampart/0.12.2
bonnie++/1.03e laszip/2.2.0 repeatmasker/4.0.7
boost/1.59.0 ldhot/1.0 rmblast/2.2.28
bowtie/1.1.2 libgeotiff/1.4.0 rmblast/2.6.0 (D)
bowtie2/2.3.4 libint/1.1.4 rsem/1.3.1
bowtie2/2.3.4.3 (D) libkml/1.3.0 salmon/0.11.3
bwa/0.7.17 liblas/1.8.0 samtools/1.3.1
canu/1.3 libspatialite/4.3.0a samtools/1.6
cast/dbf2ec2 libxsmm/1.4.4 samtools/1.9 (D)
ccp4/7.0 libzip/1.5.1 settarg/6.0.1
cellranger/2.1.0 lmod/6.0.1 shelx/2016.1
cellranger/3.0.2 (D) lobster/2.1.0 shore/0.9.3
centrifuge/1.0.4 matlab/r2018a shoremap/3.4
cp2k/4.1_pre_openmp matlab/r2018b (D) singularity/2.3.1
cp2k/4.1_pre_serial mercurial/3.7.3-1 singularity/2.4.2
cp2k/4.1 (D) mesa/17.0.0 singularity/3.0.0 (D)
cuda/7.5 migrate/3.6.11 smbnetfs/0.6.0
cuda/7.5.18 miniconda3/3.6 sqlite3/3.25.1
cuda/8.0.44 mocat2/2.0 sratoolkit/2.8.0
cuda/9.0.176 mothur/1.40.5 stringtie/1.3.5
cuda/9.1.85 (D) music/4.0 superlu/4.3_dist
cudnn/4_cuda7.0+ mysql/8.0.11 superlu/5.2.1
cudnn/5.1_cuda7.5 mzmine/2.23 superlu/5.4_dist (D)
cudnn/5.1_cuda8.0 namd/2.12_ib svn/2.7.10
cudnn/6.0_cuda8.0 namd/2.12_smp swig/3.0.12
cudnn/7.0_cuda9.1 namd/2.12 (D) tassel/3.0
cudnn/7.1.2_cuda9.0 netapp/5.4p1 tcl-tk/8.5.19
cudnn/7.1.2_cuda9.1 (D) netapp/5.5 (D) texinfo/6.5
cufflinks/2.2.1 octave/4.2.0 texlive/2018
dislin/11.0 octave/4.4.0 tiff/3.9.4
dropcache/master octave/4.4.1 (D) tophat/2.1.1
eigan/3.3.2 openblas/0.2.18_barcelona towhee/7.2.0
emboss/6.6.0 openblas/0.2.18_haswell trimmomatic/0.38
exonerate/2.2 openblas/0.2.18 trinity/2.2.0
exonerate/2.4 (D) openblas/0.3.0 (D) trinity/2.8.4 (D)
fastqc/0.11.8 orangefs/2.9.6 underworld/1.0
fastx_toolkit/0.0.14 parallel/3.22 underworld2/2.5.1
freebayes/1.1.0 parallel/2018.10.22 (D) underworld2/2.6.0dev (D)
freebayes/1.2.0 (D) parflow/3.2.0 valgrind/3.11.0
freetype/2.7.1 parmetis/4.0.3 vcflib/1.0.0-rc2
freexl/1.0.2 paxutils/2.3 vcftools/0.1.16
gatk/3.8.0 perl/5.24.1 (D) vmd/1.9.3
gdal/2.0.0 perl/5.28.0 workspace_maker/master (L,D)
gdal/2.1.0 pexsi/0.9.2 workspace_maker/1.1b
gdal/2.3.1 (D) phenix/1.13 workspace_maker/1.1
gdb/7.10.1 picard/2.18.6 workspace_maker/1.2
geos/3.5.0 proj/4.9.2 wrf/3.9.1
geos/3.6.2 (D) proj/5.1.0 (D) zlib/1.2.11
------------------------------------- Licensed Software --------------------------------------
amber/16 clc_genomics_workbench/8.5.1 (D) green/1.0
buster/17.1 dl_polly/4.08 stata/14
clc_genomics_workbench/6.0.1 gaussian/09.d.01 vasp/5.4.4
Where:
L: Module is loaded
D: Default Module
Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".
You can alo use module spider
for searching. For example, to find all the modules related to conda you could run:
[cn112] $ module -r spider ".*conda.*"
----------------------------------------------------------------------------
anaconda2:
----------------------------------------------------------------------------
Description:
Anaconda is a freemium distribution of the Python programming
language for large-scale data processing, predictive analytics, and
scientific computing.
Versions:
anaconda2/2.4.0
anaconda2/4.2.0
----------------------------------------------------------------------------
For detailed information about a specific "anaconda2" module (including how to load the modules) use the module's full name.
For example:
$ module spider anaconda2/4.2.0
----------------------------------------------------------------------------
----------------------------------------------------------------------------
anaconda3:
----------------------------------------------------------------------------
Description:
Anaconda is a distribution of the Python programming language that
includes the Python interpeter, as well as Conda which is a package
and virtual environment manager, and a large collection of Python
scientific packages. Anaconda3 uses python3, which it also calls
python. Anaconda Navigator contains Jupyter Notebook and the Spyder
IDE.
Versions:
anaconda3/2.4.0
anaconda3/4.2.0
anaconda3/5.1.0
----------------------------------------------------------------------------
For detailed information about a specific "anaconda3" module (including how to load the modules) use the module's full name.
For example:
$ module spider anaconda3/5.1.0
----------------------------------------------------------------------------
----------------------------------------------------------------------------
conda: conda
----------------------------------------------------------------------------
Description:
Michael Forbes custom Conda environment.
This module can be loaded directly: module load conda
----------------------------------------------------------------------------
miniconda3: miniconda3/3.6
----------------------------------------------------------------------------
Description:
Miniconda is a distribution of the Python programming language that
includes the Python interpeter, as well as Conda which is a package
and virtual environment manager. Miniconda3 uses python3, which it
also calls python.
You will need to load all module(s) on any one of the lines below before the "miniconda3/3.6" module is available to load.
gcc/4.9.3
gcc/5.2.0
gcc/6.1.0
gcc/7.3.0
intel/16.2
intel/16.3
intel/xe_2016_update2
intel/xe_2016_update3
Help:
For further information, see:
https://conda.io/miniconda.html
To create a local environment using the conda package manager:
conda create -n myenv
To use the local environment:
source activate myenv
To install packages into your local environment:
conda install somePackage
To install packages via pip:
conda install pip
pip install somePackage
When installing, the "Failed to create lock" message can be ignored.
Miniconda3 uses python3, which it also calls python.
To use a different version for the name python:
conda install python=2
To inspect the actual module file (for example, if you would like to make your own based on this) you can use the module show
command:
$ module show anaconda3
------------------------------------------------------
/opt/apps/modulefiles/Other/anaconda3/5.1.0.lua:
------------------------------------------------------
whatis("Description: Anaconda is a distribution of the Python programming language...")
help([[For further information...]])
family("conda")
family("python2")
family("python3")
prepend_path("PATH","/opt/apps/anaconda3/5.1.0/bin")
prepend_path("LD_LIBRARY_PATH","/opt/apps/anaconda3/5.1.0/lib")
prepend_path("LIBRARY_PATH","/opt/apps/anaconda3/5.1.0/lib")
prepend_path("CPATH","/opt/apps/anaconda3/5.1.0/include")
prepend_path("MANPATH","/opt/apps/anaconda3/5.1.0/share/man")
Running Jobs¶
Before you consider running a job, you need to profile your code to determine the following:
- How many nodes and how many cores-per-node do you need?
- How much memory do you need per node?
- How long will your program run?
- What modules do you need to load to run your code?
- What packages need to be installed to run your code?
Once you have this information, make sure that your code is committed to a repository, then clone this repository to Kamiak. Whenever you perform a serious calculation, you should make sure you are running from a clean checkout of a repository with a well-defined set of libraries installed so that your runs are reproducible. This information should be stored along side your data so that you know exactly what version of your code produced the data.
Here are my recommended steps.
- Run an interactive session.
- Log in directly to node so agent get forwarded.
-
Checkout your code into a repository.
mkdir ~/repositories cd repositories hg clone ...
- Link your run folder to
~/now
. - Make a SLURM file in
~/runs
.
#!/bin/bash
#SBATCH --partition=kamiak ### Partition (like a queue in PBS)
#SBATCH --job-name=HiWorld ### Job Name
#SBATCH --output=Hi.out ### File in which to store job output
#SBATCH --error=Hi.err ### File in which to store job error messages
#SBATCH --time=0-00:01:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Node count required for the job
#SBATCH --ntasks-per-node=1 ### Nuber of tasks to be launched per Node
./hello
Issues¶
Interactive Jobs do not ForwardAgent¶
Jupyter Notebook: Tunnel not working¶
For some reason, trying to tunnel to compute nodes is failing. It might be administrative settings disallow TCP through tunnels, or it might be something with the multi-hop.
Mercurial and Conda¶
I tried the usual approach of putting mercurial in the conda base
environment, but when running conda, mercurial cannot be found. Instead, one needs to load the mercurial module. I need to see if this will work with with mmfhg
.
Permissions¶
Building and Installing Software¶
The following describes how I have built and installed various pieces of software. You should not do this - just use the software as described above. However, this information may be useful if you need to install your own software.
#mkdir -p /data/lab/forbes # Provided by system.
ln -s /data/lab/forbes ~/data
mkdir -p ~/data/modules
ln -s ~/data/modules ~/.modules
mkdir -p ~/data/bashrc.d
cat > ~/data/bashrc.d/inputrc <<EOF
# Link to ~/.inputrc
"\M-[A": history-search-backward
"\M-[B": history-search-forward
"\e[A": history-search-backward
"\e[B": history-search-forward
EOF
cat > ~/data/bashrc.d/bash_alias <<EOF
# Link to ~/.bash_alias
# User specific aliases and functions
export INPUTRC=~/.inputrc
# Custom module files
export MODULEPATH="~/.modules:~/data/modules:/data/lab/forbes/modules:${MODULEPATH}"
# Load the conda module which has mercurial
module load conda
EOF
Conda¶
Our base conda environment is based on the mforbes/base environment and includes:
- Mercurial, with topics and the hg-git bridge.
- Black
- Anaconda Project
- Poetry
- mmf-setup
- nox and nox-poetry
```
chmod a+rx /data/lab/forbes/
module load intel/xe_2016_update3
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh -b -f -p /data/lab/forbes/apps/conda
rm Miniconda2-latest-Linux-x86_64.sh
cat > /data/lab/forbes/apps/conda/.condarc <<EOF
# System configuration override.
channels:
- mforbes
- defaults #- conda-forge # Don't do this by default -- too slow
create_default_packages:¶
- ipykernel # No point until forwarding works.¶
EOF
To create and update environments:
```bash
module load conda # Requires conda.lua below
conda activate base
conda install anaconda-client
conda env update mforbes/base
conda deactivate
conda env update -n jupyter mforbes/jupyter
conda env update -n work mforbes/work
conda env create mforbes/_gpe
conda.lua¶
cat > ~/.modules/conda.lua <<EOF
-- -*- lua -*-
whatis("Description: Michael Forbes custom Conda environment.")
setenv("_CONDA_EXE", "/data/lab/forbes/apps/conda/bin/conda")
setenv("_CONDA_ROOT", "/data/lab/forbes/apps/conda")
setenv("CONDA_SHLVL", "0")
set_shell_function("_conda_activate", [[
if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then
PS1="$CONDA_PS1_BACKUP";
\unset CONDA_PS1_BACKUP;
fi;
\local ask_conda;
ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix activate "$@")" || \return $?;
\eval "$ask_conda";
\hash -r]],
""
)
set_shell_function("_conda_deactivate", [[
\local ask_conda;
ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix deactivate "$@")" || \return $?;
\eval "$ask_conda";
\hash -r]],
""
)
set_shell_function("_conda_reactivate", [[
\local ask_conda;
ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix reactivate)" || \return $?;
\eval "$ask_conda";
\hash -r]],
"")
set_shell_function("conda", [[
if [ "$#" -lt 1 ]; then
$_CONDA_EXE;
else
\local cmd="$1";
shift;
case "$cmd" in
activate)
_conda_activate "$@";
;;
deactivate)
_conda_deactivate "$@";
;;
install|update|uninstall|remove)
$_CONDA_EXE "$cmd" "$@" && _conda_reactivate;
;;
*)
$_CONDA_EXE "$cmd" "$@";
;;
esac
fi]],
"echo Conda C"
)
-- prepend_path("PATH", "/data/lab/forbes/apps/conda/bin")
-- prepend_path("LD_LIBRARY_PATH", "~/data/apps/conda/lib")
always_load("intel/xe_2016_update3")
family("conda")
family("python2")
family("python3")
--[[
Build:
# mkdir -P /data/lab/forbes/
# module load intel/xe_2016_update3
# wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
# bash Miniconda2-latest-Linux-x86_64.sh -b -f -p /data/lab/forbes/apps/conda
# rm Miniconda2-latest-Linux-x86_64.sh
--]]
EOF
MyRepos¶
data="/data/lab/forbes"
mkdir -p "${data}/repositories"
git clone git://myrepos.branchable.com/ "${data}/repositories/myrepos"
mkdir -p "${data}/apps/myrepos/bin"
ln -s "${data}/repositories/myrepos/mr" "${data}/apps/myrepos/bin/"
cat > ~/.modules/mr.lua <<EOF
-- -*- lua -*-
whatis("Description: myrepos (mr) Multiple repository management: https://myrepos.branchable.com.")
prepend_path("PATH", "/data/lab/forbes/apps/myrepos/bin/")
--[[
Build:
data="/data/lab/forbes"
mkdir -p "${data}/repositories"
git clone git://myrepos.branchable.com/ "${data}/repositories/myrepos"
mkdir -p "${data}/apps/myrepos/bin"
ln -s "${data}/repositores/myrepos/mr" "${data}/apps/myrepos/bin/"
--]]
EOF
mmfhg¶
data="/data/lab/forbes"
module load conda
module load mr
conda activate
mkdir -p "${data}/repositories"
hg clone ssh://hg@bitbucket.org/mforbes/mmfhg "${data}/repositories/mmfhg"
cd "${data}/repositories/mmfhg"
make install
cat >> "${data}/bashrc.d/bash_alias" <<EOF
export MMFHG=/data/lab/forbes/repositories/mmfhg
export HGRCPATH="\${HGRCPATH}:\${MMFHG}/hgrc"
. "\${MMFHG}/src/bash/completions.bash"
EOF
To Do¶
mmfhg mmfutils mmf_setup hgrc mr gitannex get these working
Questions¶
Kamiak¶
How to forward and ssh port to a compute node?¶
How to use slurm script to configure environment and for interactive sessions?¶
Conda: best way to setup environments?¶
Some options:
- Install environments on local scratch directory (only good for single node jobs).
- Install into
~
but redirect conda package dir to local or scratch. (Makes sure we can use current package.) - Install in global scratch which is good for 2 weeks.
-
Cloning the base environment? In principle this should allow one to reuse much of the installed material, but in practice it seems like everything gets downloaded again.
- First remove my conda stuff from my
.bashrc
file. -
Initial attempt. Install a package that is not in the installed anaconda distribution:
module load anaconda3 conda install -c conda-forge uncertainties # Takes a long time...
-
Try creating a clone environment with
conda create -n mmf --clone base
. This is not a good option as kit downloads a ton of stuff into~/.conda/envs/mmf
and~/.conda/pkgs
.
- First remove my conda stuff from my
$ module load anaconda3
$ conda env list
# conda environments:
#
/data/lab/forbes/apps/conda
/data/lab/forbes/apps/conda/envs/_gpe
/data/lab/forbes/apps/conda/envs/jupyter
/data/lab/forbes/apps/conda/envs/work2
work3 /home/m.forbes/.conda/envs/work3
base * /opt/apps/anaconda3/5.1.0
$ conda create -n mmf --clone base
Source: /opt/apps/anaconda3/5.1.0
Destination: /home/m.forbes/.conda/envs/mmf
The following packages cannot be cloned out of the root environment:
- conda-env-2.6.0-h36134e3_1
- conda-4.5.12-py36_1000
- conda-build-3.4.1-py36_0
Packages: 270
Files: 4448
- Stack on top of another environment? This is an undocumented feature that allows you to stack environments. After playing with it a bit, however, it seems like it would only be useful for different applications, not for augmenting a python library.
$ conda install -c conda-forge -n mmf_stack uncertainties --no-deps
This fails because it does not install python. The previous python is used and it cannot see the new uncertainties package.
$ conda config --set max_shlvl 6 # Allows stacking
$ time conda create -n mmf_stack # Create environment for stacking.
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.5.12
latest version: 4.6.14
Please update conda by running
$ conda update -n base conda
## Package Plan ##
environment location: /home/m.forbes/.conda/envs/mmf_stack
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate mmf_stack
#
# To deactivate an active environment, use:
# > source deactivate
#
$ . /opt/apps/anaconda3/5.1.0/etc/profile.d/conda.sh # Source since module does not install anaconda properly.
$ conda activate mmf_stack
(mmf_stack) $ conda install -c conda-forge uncertainties
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.5.12
latest version: 4.6.14
Please update conda by running
$ conda update -n base conda
## Package Plan ##
environment location: /home/m.forbes/.conda/envs/mmf_stack
added / updated specs:
- uncertainties
The following packages will be downloaded:
package | build
---------------------------|-----------------
libblas-3.8.0 | 8_openblas 6 KB conda-forge
tk-8.6.9 | h84994c4_1001 3.2 MB conda-forge
wheel-0.33.1 | py37_0 34 KB conda-forge
liblapack-3.8.0 | 8_openblas 6 KB conda-forge
setuptools-41.0.1 | py37_0 616 KB conda-forge
uncertainties-3.0.3 | py37_1000 116 KB conda-forge
libffi-3.2.1 | he1b5a44_1006 46 KB conda-forge
bzip2-1.0.6 | h14c3975_1002 415 KB conda-forge
numpy-1.16.3 | py37he5ce36f_0 4.3 MB conda-forge
zlib-1.2.11 | h14c3975_1004 101 KB conda-forge
pip-19.1 | py37_0 1.8 MB conda-forge
openblas-0.3.6 | h6e990d7_1 15.8 MB conda-forge
xz-5.2.4 | h14c3975_1001 366 KB conda-forge
sqlite-3.26.0 | h67949de_1001 1.9 MB conda-forge
openssl-1.1.1b | h14c3975_1 4.0 MB conda-forge
certifi-2019.3.9 | py37_0 149 KB conda-forge
libcblas-3.8.0 | 8_openblas 6 KB conda-forge
readline-7.0 | hf8c457e_1001 391 KB conda-forge
ncurses-6.1 | hf484d3e_1002 1.3 MB conda-forge
python-3.7.3 | h5b0a415_0 35.7 MB conda-forge
------------------------------------------------------------
Total: 70.2 MB
The following NEW packages will be INSTALLED:
bzip2: 1.0.6-h14c3975_1002 conda-forge
ca-certificates: 2019.3.9-hecc5488_0 conda-forge
certifi: 2019.3.9-py37_0 conda-forge
libblas: 3.8.0-8_openblas conda-forge
libcblas: 3.8.0-8_openblas conda-forge
libffi: 3.2.1-he1b5a44_1006 conda-forge
libgcc-ng: 8.2.0-hdf63c60_1
libgfortran-ng: 7.3.0-hdf63c60_0
liblapack: 3.8.0-8_openblas conda-forge
libstdcxx-ng: 8.2.0-hdf63c60_1
ncurses: 6.1-hf484d3e_1002 conda-forge
numpy: 1.16.3-py37he5ce36f_0 conda-forge
openblas: 0.3.6-h6e990d7_1 conda-forge
openssl: 1.1.1b-h14c3975_1 conda-forge
pip: 19.1-py37_0 conda-forge
python: 3.7.3-h5b0a415_0 conda-forge
readline: 7.0-hf8c457e_1001 conda-forge
setuptools: 41.0.1-py37_0 conda-forge
sqlite: 3.26.0-h67949de_1001 conda-forge
tk: 8.6.9-h84994c4_1001 conda-forge
uncertainties: 3.0.3-py37_1000 conda-forge
wheel: 0.33.1-py37_0 conda-forge
xz: 5.2.4-h14c3975_1001 conda-forge
zlib: 1.2.11-h14c3975_1004 conda-forge
Proceed ([y]/n)?
Presumably people can update software
- Currently it seems I need to use my own conda (until anaconda 4.4.0)
$ module load conda
$ hg clone ssh://hg@bitbucket.org/mforbes/cugpe ~/work/mmfbb/cugpe
$ cd current
$ ln -s ~/work/mmfbb/cugpe cugpe
$ cd cugpe
$ module load cuda
$ conda env update -f environment.cugpe.yml -p /data/lab/forbes/apps/conda/envs/cugpe
Investigations¶
Here we include some experiments run on Kamiak to see how long various things take. These results may change as the system undergoes transformations, so this information may be out of date.
Conda¶
Here we investigate the timing of creating some conda environments using the user's home directory vs /scratch
, vs /local
:
Home¶
$ time conda create -y -n mmf0 python=3 # Includes downloading packages
real 3m32.787s
$ time conda create -y -n mmf1 python=3 # Using downloaded packages
real 1m0.429s
$ time conda create -y -n mmf1c --clone mmf0
real 0m56.507s
$ du -sh ~/.conda/envs/*
182M /home/m.forbes/.conda/mmf0
59M /home/m.forbes/.conda/mmf1
59M /home/m.forbes/.conda/mmf1c
$ du -shl ~/.conda/envs/*
182M /home/m.forbes/.conda/mmf0
182M /home/m.forbes/.conda/mmf1
182M /home/m.forbes/.conda/mmf1c
$ du -sh ~/.conda/pkgs/
341M /home/m.forbes/.conda/pkgs/
From this we see that there is some space saving from the use of hard-links. Note that the packages also take up quite a bit of space.
$ time rm -r envs pkgs/
real 1m2.734s
Scratch¶
mkworkspace -n m.forbes_conda
mkdir /scratch/m.forbes_conda/envs
mkdir /scratch/m.forbes_conda/pkgs
ln -s /scratch/m.forbes_conda/envs ~/.conda/
ln -s /scratch/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages
real 2m16.052s
$ time conda create -y -n mmf1 python=3 # Using downloaded packages
real 0m35.337s
$ time conda create -y -n mmf1c --clone mmf0
real 0m27.982s
$ time rm -r /scratch/m.forbes_conda/envs /scratch/m.forbes_conda/pkgs/
real 0m45.193s
Local¶
mkworkspace -n m.forbes_conda --backend=/local
mkdir /local/m.forbes_conda/envs
mkdir /local/m.forbes_conda/pkgs
ln -s /local/m.forbes_conda/envs ~/.conda/
ln -s /local/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages
real 0m45.948s
$ time conda create -y -n mmf1 python=3 # Using downloaded packages
real 0m10.670s
$ time conda create -y -n mmf1c --clone mmf0
real 1m42.742s
$ time rm -r /local/m.forbes_conda/envs/ /local/m.forbes_conda/pkgs/
real 0m0.387s
Home/Local¶
mkworkspace -n m.forbes_conda --backend=/local
mkdir /local/scratch/m.forbes_conda/pkgs
ln -s /local/scratch/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages
real 1m58.410s
$ time conda create -y -n mmf1 python=3 # Using downloaded packages
real 1m41.889s
real 1m39.003s
$ time conda create -y -n mmf1c --clone mmf0
real 1m42.742s
$ time rm -r /local/m.forbes_conda/envs/ /local/m.forbes_conda/pkgs/
real 0m0.387s
Local -> Home¶
$ my_workspace="$(mkworkspace -n m.forbes_conda --backend=/local --quiet)"
$ export CONDA_PKGS_DIRS="${my_workspace}/pkgs"
$ conda_prefix="${my_workspace}/current_conda_env"
$ time conda create -y --prefix "${conda_prefix}" python=3
real 0m16.295s
$ time conda create -y --prefix ~/clone_env --clone "${conda_prefix}"
real 0m49.573s
$ time conda create -y --prefix ~/clone_env2 python=3
real 0m44.628s
$ my_workspace="$(mkworkspace -n m.forbes_conda --backend=/local --quiet)"
$ export CONDA_PKGS_DIRS="${my_workspace}/pkgs"
$ conda_prefix="${my_workspace}/current_conda_env"
$ time conda env create --prefix "${conda_prefix}" mforbes/work
real 0m16.295s
$ time conda create -y --prefix ~/clone_env_work --clone "${conda_prefix}"
$ time conda env create --prefix ~/clone_env_work2 mforbes/work
real 14m21.985s
$ time conda create -y --prefix ~/clone_env --clone "${conda_prefix}"