Department of Physics and Astronomy

The Forbes Group

Workflow

$\newcommand{\vect}[1]{\mathbf{#1}} \newcommand{\uvect}[1]{\hat{#1}} \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\norm}[1]{\lVert#1\rVert} \newcommand{\I}{\mathrm{i}} \newcommand{\ket}[1]{\left|#1\right\rangle} \newcommand{\bra}[1]{\left\langle#1\right|} \newcommand{\braket}[1]{\langle#1\rangle} \newcommand{\op}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\d}{\mathrm{d}} \newcommand{\pdiff}[3][]{\frac{\partial^{#1} #2}{\partial {#3}^{#1}}} \newcommand{\diff}[3][]{\frac{\d^{#1} #2}{\d {#3}^{#1}}} \newcommand{\ddiff}[3][]{\frac{\delta^{#1} #2}{\delta {#3}^{#1}}} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\order}{O} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\sech}{sech} $

Forbes Group Workflow

This post describes the recommended workflow for developing and documenting code using the Python ecosystem. This workflow uses Jupyter notebooks for documenting code development and usage, python modules with comprehensive testing, version control, online collaboration, and and Conda environments with carefully specified requirements for reproducible computing.

Prerequisites

Background

It is assumed that you have all of the prerequisites listed in that post.

If you have any additional resources that you found useful while learning, please send me a note or add it to the prerequisites file.

Setting up your Machine

Most of our code relies on the scientific Python stack of projects. To manage these dependencies I recommend Conda. Here are the things you need to do on your machine (laptop, etc.) in order to get working productively.

  1. Install Miniconda. On my computers, I install it in /data/apps/anaconda but you can install it in your home directory if you do not have permission (for example, if using on a cluster).
  2. Install some useful environments from my Anaconda Cloud environment collection. In particular:

    conda update anaconda-client  # Needed for the following
    conda env update mforbes/jupyter
    conda env update mforbes/work
    

    The jupyter environment provides a server for Jupyter notebooks with the NBConda extension so that you can use any environment as a notebook kernel as long as you include the ipykernel package in that environment's requirements. The work3 environment is rather complete environments that include SciPy, NumPy, matplotlib etc. They contain the overall anaconda package as well as some custom packages I use regularly. I use these for day-to-day work, but do not rely on them for specific work on projects as it makes it difficult to reproduce dependencies.

    For work on a specific project, following the instructions below to create minimal specific environments for each of those. This way, if you introduce code that needs a new package, you will be forced to update the corresponding environment so you can reproduce your work later.

  3. Install Mercurial, Git, and myrepos (mr). Generally, these should be installed at the system level with your package manager. Mercurial is a little special because it is a python application, so I tend to install it using conda, however, I do not include it in all of my conda environments. Instead, I link the mercurial executable to a point on my path: ~/.local/bin/hg for example.

Overview

Most projects will have the following components:

  • Code and related documentation. This should be stored in a repository on our Heptapod server, [GitLab], GitHub, or the equivalent. Currently I recommend our Heptapod server because it allows you to manage private repositories (before you are ready to publish) and to use Mercurial which is somewhat easier to use than Git. As these code repositories mature, useful components should be migrated to their own python projects which can be imported with Conda or Pip.
  • Papers. I generally recommend creating a separate repository for each paper. These will generally be consumers of the various code projects. LaTeX manuscripts should be kept here, as well as scripts/notebooks required to generate figures and data associated with the paper. The goal here is reproducibility: someone with access to this repository years down the road (likely you!) should be able to reproduce the results.
  • Data. Simulations can generate too much data to store in repositories. My current recommendation is to use Git Annex to store this data in a Git repo. This data can then be selectively backed up on various devices, in the cloud, etc. If you use Git repos, you can directly include the data in your project, otherwise, these should be managed separately.
  • For integrating these repositories, I rely quite heavily on symlinks. These are not well supported by automatic syncing software such as Dropbox (seems to work but will eventually break), OneDrive, or Google Drive. Thus, we need to define a process for restoring these when needed.

TL;DR

  1. Create a new mercurial repository on .
  2. Clone it to your computer.
  3. Create README.md, .hgignore, environment.yml files etc.

    <Include a minimial example of these.>

  4. Create _ext folder: this will NOT be version controlled (add it to .hgignore). This should be a last ditch attempt to provide dependencies. Better is to depend on specific versions (in your environment.yml) file, etc. The contents of this directory will need to be managed on each machine. Use symlinks into this folder and commit these symlinks to use.

    _ext/ny_repo_data: Link to a data directory (perhaps on an external hard-drive or HPC scratch space). This folder should mirror your repo, and you should include appropriate symlinks. _ext/gpe: Link to a repository that you depend on which you need to modify heavily. The idea here is that this project is in such a state of flux that you need to depend on a specific VC version (not just a release). Often we will symlink the underlying python package to the top level so we can import it without installing. This functionality is provided by import mmf_setup.setpath.hgroot or (import mmf_setup;mmf_setup.nbinit() in Jupyter notebooks) which will add the root level of your repo (the one containing the .hg folder) to your sys.path. In the following example, this allows you to import gpe without installing the gpe project because we include a symlink directly to the importable module.

    Example:

    my_repo
    |--.hg               # hg creates this.
    |--.hgignore         # Ignore _ext, *.pyc, *.pyo etc.
    |--README.md
    |--environment.yml
    |--setup.py          # If your project is a package, you should make it installable.
    |--my_repo           # This is your python code
    |  |--__init__.py
    |  \--my_module.py
    |--Docs
    |  |--README.ipynb
    |  |--Examples
    |     \--_data       # -> ../_ext/my_repo_data/Docs/Examples
    |--runs
    |  \--_data          # -> ../_ext/my_repo_data/runs
    |--gpe               # -> _ext/gpe/gpe
    \--_ext
       |--my_repo_data   # This folder should be an incomplete mirror of my_repo/
       |  |--.hg         # Maybe... if the data is not too big
       |  |--Docs
       |  |  \--Examples # Symlink from my_repo/Docs/Examples/_data 
       |  \--runs        # Symlink from my_repo/runs/_data
       \--gpe
          |--.hg
          |--setup.py
          \--gpe         # Symlink from my_repo/gpe so it can be imported.
             |--__init__.py
             |--interfaces.py
             ...
    

Starting a New Project

Conda Environments

To enable a reproducible computing environment, we require that each project function and be tested in a well specified environment. This environment should be specified in an environment yml file. These environment files should be managed under version control with each project.

We consider two types of environments:

  • Complete environments. A complete environment specification can be rendered by running

    conda env export > environment.yml
    

    This will produce a complete rendering of the packages installed, i.e.:

    name: base
    channels:
      - defaults
      - conda-forge
      ...
    dependencies:
      - nbstripout=0.3.3=py_0
      - anaconda-client=1.7.2=py27_0
      - argcomplete=1.9.4=py27_0
      ...
      - pip:
        - dulwich==0.19.5
        - mmf-setup==0.1.12.dev0
        ...
    

    This includes specific version numbers which is generally suitable for establishing a reproducible computing environment (but see the notes in Reproducible Computing). It is not very manageable as a specification of a projects requirements, however, since many of these packages result only as dependencies from other packages.

  • Requirements. To specify environments, we instead generally specify the minimal number of dependencies, i.e.:

    name: base
    channels:
      - defaults
      - conda-forge
      - simplistix
    dependencies:
      - conda-build
      - anaconda-client
      - mercurial >= 4.7
      - picky-conda
      - docutils
    
      # https://conda.io/docs/user-guide/configuration/enable-tab-completion.html
      - argcomplete     # eval "$(register-python-argcomplete conda)"
    
      - pip:
        - python-hglib
        # - hg-git
        # Dev version needed to fix issue https://bitbucket.org/durin42/hg-git/
        #                         issues/244/typeerror-unexpected-keyword-argument
        - hg+https://mforbes@bitbucket.org/durin42/hg-git
    

    This file would be called environment.base.yml in a project and includes only the packages explicitly needed. Note that comments can be included to explain why packages might be needed, or explicit versions can be specified. This is not suitable for reproducible computing since it is quite likely that, in the distant future, one or more of the packages will change their API, thereby breaking your code. In principle, one might look at the date of the commit and work back through the library version to see which was current then, but users (like yourself!) should not be expected to do this. Thus, we advocate keeping both types of environments. Specifically, when tests finally pass, the environment should be frozen.

To determine the minimal packages, one can use the following tools:

Here we demonstrate the use of the latter. I have packaged this for use with Conda:

conda install -c mforbes conda-tree
conda-tree.py -n base leafs

This might show something more reasonable like:

[u'conda-tree', 
 u'anaconda-client', 
 u'conda-verify', 
 u'conda-build', 
 u'pip', 
 u'docutils', 
 u'python.app', 
 u'picky-conda', 
 u'nbstripout', 
 u'argcomplete', 
 u'mercurial']
In [ ]:
!conda-tree.py -n base leafs

Packages

From time to time you might like to make a package with your code for others to use. Here we consider two options: A standard Python package hosted in PyPI and a Conda package hosted on Anaconda cloud.

Pip Packages

To make a package that is installable from PyPI or with pip. You should do this first following the the tutorial: Packaging Python Projects. This amounts to creating the following files:

  • setup.py: Provides meta-data such as the package name, version, and requirements.
  • README.md: Documentation for the user. At least explain what your package does and how to use it.
  • LICENSE: License file. (We typically use the MIT license.)

Conda Packages

Once you have a pip-installable package, you can make a corresponding conda-installable package.

References:

Pip/PiPI:

  • Packaging Python Projects: The authoritative tutorial on how to make a standard python package. You should do this first as most people will expect to be able to install your package using pip.

Conda/Anaconda Cloud:

Mercurial/Heptapod Workflow

We use a Heptapod server running on an AWS instance. (For details about this instance, see AWS Server on our Discourse site.) This has a few peculiarities. Specifically, one should follow their Heptapod workflow which has a few limitations which we discuss below.

Heptapod

As mentioned above, there is a specific Heptapod workflow with a few limitations.

  • Named branches cannot have multiple heads: Named branches are long-term names associated with development lines in mercurial. By default, there is a single branch called default. You can start working on a named branch by issuing the hg branch <branchname> command. Examples might be for different versions of the software. This allows people to hg update <version> to update to the latest version.
  • is how I recommend you proceed.

Improvements (To Do List)

Many aspects of these workflow procedures could use improvement. Here are some issues/suggestions.

Continuous Integration

Automatic continuous integration upon commit is an important tool for verifying code validity. A procedure for incorporating this is needed.

Reproducible Computing

  • Freeze versions of system libraries. Although we use Conda environments to manage our python dependencies, we don't typically include system libraries in this specification. Specifically, the version of things like FFTW and the CUDA toolkit could affect results. A convenient cross-platform procedure is needed for freezing these.
In [ ]: