Department of Physics and Astronomy

The Forbes Group

git-annex

$\newcommand{\vect}[1]{\mathbf{#1}} \newcommand{\uvect}[1]{\hat{#1}} \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\norm}[1]{\lVert#1\rVert} \newcommand{\I}{\mathrm{i}} \newcommand{\ket}[1]{\left|#1\right\rangle} \newcommand{\bra}[1]{\left\langle#1\right|} \newcommand{\braket}[1]{\langle#1\rangle} \newcommand{\op}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\d}{\mathrm{d}} \newcommand{\pdiff}[3][]{\frac{\partial^{#1} #2}{\partial {#3}^{#1}}} \newcommand{\diff}[3][]{\frac{\d^{#1} #2}{\d {#3}^{#1}}} \newcommand{\ddiff}[3][]{\frac{\delta^{#1} #2}{\delta {#3}^{#1}}} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\order}{O} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\sech}{sech} $

git-annex is a tool for managing large data files with git. The idea is to store the information about the file in a git repository that can be synchronized, but to store the actual data separately. The annex keeps track of where the file actually resides (which may be in a different repository, or on another compute) and allows you to control the file (renaming, moving, etc.) without having to have the actual file present.

Here we explore git-annex as a mechanism for replacing and interacting with Dropbox, Google Drive, One Drive etc. with the following goals:

  1. Multiple users can share data.
  2. Data shared across many platforms: HPC clusters, laptops, desktops, Mac, Windows, Linux, CoCalc, etc.
  3. Allow only a subset of data to be stored on any particular device (esp. laptops) if memory on that device is limited.
  4. Utilize cloud storage options including Google Cloud, Dropbox, Microsoft One Drive both as redundant backups, but also as a mechanism for sharing data with others who need to be able to use only once of these services.
  5. Automatic and manual sync options.

My use-case is that I run a research group at WSU with ~10 collaborators. We need to share source code, experimental data, papers, plots, and simulation data on a regular basis. Some of the collaborators are used to using Dropbox or Google Drive, which work for syncing, but have issues (listed below). WSU provides 1TB of storage through Microsoft One Drive for all students and faculty, so this would be a natural storage too, but few use it yet. We run simulations on our local machines, office desktops, a local HPC cluster [Kamiak] and online using [CoCalc].

Overview of git-annex

Here is my current understanding of git-annex and the git-annex assistant.

  • Each "annex" is a git repository where the annex information is stored under the .git directory. Note: the annex information is separate from regular git information. Thus, if you do not initialize the git repository, git status will return fatal: This operation must be run in a work tree. You can use the annex as both a git-annex and a regular git repo if you choose. All of the annex information (log files, repository information, etc.) is stored in the .git directory.
  • For a GUI you can run git-annex-webapp. This seems to open the annex in the current folder or the last used annex. You can create other repositories from the GUI, but they will not appear in the Dashboard - only the repositories managed by the current "annex" (i.e. stored in the .git folder for that annex) appear in the Dashboard.
  • Configuration information is stored in ~/.config/git-annex and .git/config. The former includes autostart information about which repositories should be synced.
  • Look at the various workflows. This explains the difference between sync (which syncs meta-data but not the file contents) and sync --content.

Installation

Installation on major platforms is quite straightforward using the prebuilt binaries. Read the instructions carefully to see what you need to add to your path so you can access the git annex commands. (On my Mac OS X laptop, I added /Applications/git-annex.app/Contents/MacOS to my path using the GNU Environment Modules mechanism, which I have installed for keeping track of various paths.)

To build git-annex from source might be a challenge since it is written in Haskell. Fortunately, one can use remote computers without installing git-annex on them. In many cases, it can be installed using package managers.

Conda

This is now my preferred method on Linux servers:

conda install -c conda-forge git-annex
conda install -c conda-forge tenacity pydrive
pip install annexremote git-annex-remote-googledrive

Note: this requires python 3 but also includes the client require for using Google Drive (see below).

Instead I install this in a conda environment using the following yml file:

# environment.app_gitannex.yml
name: app_gitannex
channels:
  - defaults
  - conda-forge
dependencies:
  - tenacity
  - pydrive
  - git-annex

  - pip:
    - annexremote
    - git-annex-remote-googledrive

by running

conda env install --file environment.app_gitannex.yml

then linking this to a bin/ directory on path:

ln -s "$(conda env list | grep app_gitannex | awk '{print $2}')/bin/git-annex-remote-googledrive" /usr/local/bin/

Mac OS X

Unfortunately, a conda git-annex binary is not maintained for OS X so I use instead the pre-built binary for Os X.

  • git-annex-turtle might be useful for integration with FileManager (I have not tried yet).

Ubuntu

The standard Ubuntu install is old. I tried adding Justin Geibel's repo as discussed at the bottom of the Ubuntu install page

sudo add-apt-repository ppa:jtgeibel/ppa
sudo apt-get update
sudo apt-get install git-annex

but this version does not have the webapp, assistant, etc. I ended up resorting to using Conda

Walk-through (git-annex only)

Following the walkthrough (slightly), I created a repository and initialized the annex:

In [1]:
%%bash
mkdir -p /tmp/annex_eg
cd /tmp/annex_eg
git init
git annex init "my laptop:annex_eg"
Initialized empty Git repository in /private/tmp/annex_eg/.git/
init my laptop:annex_eg ok
(recording state in git...)

Now add some "big" files:

In [2]:
%%file /tmp/annex_eg/data.dat
Tonnes of data!
Writing /tmp/annex_eg/data.dat

And now add them to the annex.

In [3]:
%%bash
cd /tmp/annex_eg
git annex add data.dat
git commit -a -m "Added data.dat"
add data.dat ok
(recording state in git...)
[master (root-commit) c815f59] Added data.dat
 1 file changed, 1 insertion(+)
 create mode 120000 data.dat

This is simply a symbolic link to the actuall data file which presently lives here in this repository, so we can view it.

In [4]:
!ls -la /tmp/annex_eg/data.dat
!cat /tmp/annex_eg/data.dat
lrwxr-xr-x  1 mforbes  wheel  188 Aug 18 23:22 /tmp/annex_eg/data.dat -> .git/annex/objects/WM/JK/SHA256E-s15--fa0ec06b25ff9438f9d594f07685954a57e05a90f5081ad51d85bcf551ab594b.dat/SHA256E-s15--fa0ec06b25ff9438f9d594f07685954a57e05a90f5081ad51d85bcf551ab594b.dat
Tonnes of data!

We cannot modify it though: git-annex is very cautious about potential data loss!

In [19]:
%%bash
echo "More!" >> /tmp/annex_eg/data.dat
bash: line 1: /tmp/annex_eg/data.dat: Permission denied

If you want to modify it, you need to first unlock it. This replaces the symlink with the actual file and allows you to edit it.

In [20]:
%%bash
cd /tmp/annex_eg/
git annex unlock data.dat
ls -lah data.dat
unlock data.dat (copying...) ok
-rw-r--r--  1 mforbes  wheel    15B May 19 23:36 data.dat
In [21]:
%%bash
echo "Revised!" >> /tmp/annex_eg/data.dat

If you want to discard the data, you could just lock the file again. Of course, safety checks are in place to make sure you do not loose data:

In [22]:
%%bash
cd /tmp/annex_eg/
git annex lock data.dat
lock data.dat 
git-annex: Locking this file would discard any changes you have made to it. Use 'git annex add' to stage your changes. (Or, use --force to override)

You can force this:

In [23]:
%%bash 
cd /tmp/annex_eg/
git annex lock --force data.dat
lock data.dat ok
(recording state in git...)

If you want to save the changed data, you need to add and commit that data:

In [24]:
%%bash
cd /tmp/annex_eg/
git annex unlock data.dat
echo "Revised!" >> /tmp/annex_eg/data.dat
git annex add data.dat
git commit -a -m "Modified data.dat"
ls -la data.dat
unlock data.dat (copying...) ok
add data.dat ok
(recording state in git...)
[master 7bf9548] Modified data.dat
 1 file changed, 1 insertion(+), 1 deletion(-)
lrwxr-xr-x  1 mforbes  wheel  188 May 19 23:36 data.dat -> .git/annex/objects/m9/8k/SHA256E-s24--4b6a60412e8ef0f79851a1c922880e1311f9ded6e3acdd8ca9b9241cb8ef1f82.dat/SHA256E-s24--4b6a60412e8ef0f79851a1c922880e1311f9ded6e3acdd8ca9b9241cb8ef1f82.dat

Note that git-annex keeps both versions of the file:

In [5]:
%%bash
cd /tmp/annex_eg/
git log data.dat
commit c815f594dda628d1437be8c7e38a69251acc23db
Author: Michael McNeil Forbes <michael.forbes+github@gmail.com>
Date:   Sat Aug 18 23:22:51 2018 -0700

    Added data.dat

Updating to the specified revision will link to the appropriate version (if available):

In [26]:
%%bash
cd /tmp/annex_eg/
cat data.dat
git checkout -q master^
cat data.dat
git checkout master
Tonnes of data!Revised!
Tonnes of data!
Previous HEAD position was bee37bf... Added data.dat
Switched to branch 'master'

Remotes: Directory

Now let's add a remote. We will start with a simple remote directory in another folder. This is called a special remote. Note that this is not a git repository - it is simply a directory that will hold the data.

In [6]:
%%bash
mkdir -p /tmp/annex_eg_data_dir
cd /tmp/annex_eg
git annex initremote data_dir \
    type=directory directory=/tmp/annex_eg_data_dir encryption=none
initremote data_dir ok
(recording state in git...)

We can move data to that repository:

In [7]:
%%bash
cd /tmp/annex_eg
git annex move data.dat --to data_dir
move data.dat (to data_dir...) 
ok
(recording state in git...)

Now the data is not here since it was moved rather than copied.

In [8]:
!cat /tmp/annex_eg/data.dat
cat: /tmp/annex_eg/data.dat: No such file or directory

... it is in the data directory:

In [9]:
%%bash
cd /tmp/annex_eg
git annex list
here
|data_dir
||web
|||bittorrent
||||
_X__ data.dat

(Technical note: only the current version of the data was moved. The previous revision is still in /tmp/annex_eg.) If you want to move all versions, follow the instructions here:

git annex move --unused --to data_dir

If we need it, we can git annex get it:

In [10]:
%%bash
cd /tmp/annex_eg
git annex get data.dat
cat data.dat
get data.dat (from data_dir...) 
(checksum...) ok
(recording state in git...)
Tonnes of data!

Now it is in both places.

In [11]:
%%bash
cd /tmp/annex_eg
git annex list
here
|data_dir
||web
|||bittorrent
||||
XX__ data.dat

Cheat-Sheet

  • get annex get <file>: Get a file or directory
  • git annex list: Shows which files are where.

Remotes

The utility of git-annex comes from storing data on remotes. These are managed within your repository but can point to many different places, such as folders, Google Drive, etc. which do not run git.

There are a couple of points to note:

  • Each remote has a name that is stored in the git repository. You must make sure that your remotes have useful names so that you and your collaborators can find them. For example, the default documentation uses names such as "google" which are fine for testing, but should be replaced by something more specific later. I use things like the following:

    • google-wsu: My generic git annex repo in My Drive/git-annex/ at WSU. I would use a prefix such as mmfbb.gpe_data to organize the data within a folder that points to where I host the master repo: https://bitbucket.org/mforbes/gpe_data/.
    • google-team-wsu: Typically I share data through a google Team Drive. This be in something like Team Drives/Project 1/git-annex/ hosted at WSU. To specify this folder, I would use the remote_id option.
    • google-uw: A similar repository at UW.

Rclone

Now we try setting up some other remotes that are enabled with rclone. First one needs to install rclone:

Mac OS X:

cd ~/zips && curl -O https://downloads.rclone.org/rclone-current-osx-amd64.zip
unzip -a rclone-current-osx-amd64.zip && cd rclone-*-osx-amd64
mv rclone /usr/local/bin/
#mv rclone.1 /usr/local/man/man1/
rm -rf ~/zips/rclone-*-osx-amd64

Now one should install Daniel Dent's git-annex-remote-rclone project:

cd /usr/local/bin && curl -O https://raw.githubusercontent.com/DanielDent/git-annex-remote-rclone/master/git-annex-remote-rclone

Microsoft One Drive

For my WSU buisness account I need to grant permission for RClone to access this. I could not find a way to do it without admin access, so I have filed a ticket with ITS.

Google Drive

Currently, to use Google Drive you must use the git-annex-remote-googledrive project. This requires Python 3 to run, but installed nicely from source. Follow the instructions to authenticate with your drive and you should be good to go.

I install this in a conda environment using the following yml file:

# environment.app_gitannex.yml
name: app_gitannex
channels:
  - defaults
  - conda-forge
dependencies:
  - tenacity
  - pydrive

  - pip:
    - annexremote
    - git-annex-remote-googledrive

by running

conda env install --file environment.app_gitannex.yml

then linking this to a bin/ directory on path:

ln -s "$(conda env list | grep app_gitannex | awk '{print $2}')/bin/git-annex-remote-googledrive" /usr/local/bin/

Initialize Remote

Once you have this environment installed, you can create an initialize a remote.

git-annex-remote-googledrive setup

This will provide a URL to follow, which you must use to connect to the specific account where the data will be stored. This will store an authentication token on a file token.json. Now add the remote using the token:

conda activate app_gitannex
git annex initremote                                                        \
    google-wsu                       # Name you will use to refer to remote \
    prefix=git-annex/mmfbb.gpe_data  # Path in your Google Drive            \
    type=external externaltype=googledrive                                  \
    chunk=50MiBencryption=shared mac=HMACSHA512

Note, there are several ways of specifying where the data will be stored on the Google Drive:

  • prefix=: This allows you to specify a folder name, which will be located off of your My Drive. Simple, but not so versatile.
  • root_id=: When you connect to the drive online, the last part of the URL is this number. This allows you to connect with folders anywhere, such as in a Team Drive or one that is not included in your My Drive.

Team Drives (not working)

Team Drives support is not yet working: See this issue.

git annex initremote google-team-wsu             \
    root_id=<root id>                            \
    type=external                                \
    externaltype=googledrive                     \
    chunk=50MiB encryption=shared mac=HMACSHA512

Cleanup

To remove these repositories is a bit tricky because git-annex sets its permissions to be very restrictive.

In [14]:
%%bash
find /tmp/annex_eg -exec chmod u+rw {} \;
find /tmp/annex_eg_data_dir -exec chmod u+rw {} \;
rm -rf /tmp/annex_eg /tmp/annex_eg_data_dir
find: /tmp/annex_eg: No such file or directory
find: /tmp/annex_eg_data_dir: No such file or directory

git-annex assistant

Without the assistant, one must explicitly synchronize between two repositories. This allows one explicit control of what is on any particular machine, but does not satisfy my desire for automatic synchronization.

Sync between two folders

First we create the two annex folders.

In [ ]:
%%bash
PREF="/tmp/"
cd "$PREF"
rm -rf annex1 annex2

cd "$PREF"
mkdir annex1
cd annex1
git init
git annex init "Annex 1"

cd "$PREF"
mkdir annex2
cd annex2
git init
git annex init "Annex 2"

Now we add each annex as a remote.

In [19]:
%%bash
PREF="/tmp/"
# Add annex2 as a remote to annex1
cd "$PREF"/annex1
git remote add annex2 ../annex2

# Add annex1 as a remote to annex2
cd "$PREF"/annex2
git remote add annex1 ../annex1
Initialized empty Git repository in /Users/mforbes/tmp/git/annex1/.git/
init Annex 1 ok
(recording state in git...)
Initialized empty Git repository in /Users/mforbes/tmp/git/annex2/.git/
init Annex 2 ok
(recording state in git...)

Now start the git-annex assistant.

Here I am following the 10-minute intro walk-through.

  1. Install git-annex on a remote server (I did this on my office desktop swan):

    ssh swan
    sudo apt-get install git-annex
    
  2. Run the git-annex locally:

    git-annex-webapp
    
  3. Add a remote server.

    *I had issues connecting with my SSH key. Password worked fine.**

Issues and Gotchas

Git annex automatically merges files. If there is a conflict, then the conflicted files will appear with a name like file.AAA and file.BBB. You must then decide what to do with this. This mechanism is chosen so that git-annex never enters a conflicted merge state.

SSH

When I tried to configure a repository on a remote server with an "existing SSH key" the connection failed. Here is the log (enabled by going to Preferences in the webapp and enabling debug information in the log):

  • The first problem is that I tried to use an SSH alias I had setup (swan) which was translated to swan.home for some reason. The reason is explained here. The solution is to use the fully qualified host-name and manually correct after.
  • The second problem was that the assistant attempts to run the following checks on the remote host:

    cd <repo-name> && git config --list
    

    Thus, the repo must exist on the remote [2018-05-19 13:38:55.564412] process done ExitFailure 2 [2018-05-19 13:43:39.138347] read: ssh-keygen ["-F","swan.physics.wsu.edu"] [2018-05-19 13:43:39.163278] process done ExitSuccess [2018-05-19 13:43:39.163418] chat: ssh ["-oNumberOfPasswordPrompts=0","-oStrictHostKeyChecking=yes","-n","-p","22","mforbes@swan.physics.wsu.edu","sh -c 'echo '\"'\"'git-annex-probe loggedin'\"'\"';if which git-annex-shell; then echo '\"'\"'git-annex-probe git-annex-shell'\"'\"'; fi;if which git; then echo '\"'\"'git-annex-probe git'\"'\"'; fi;if which rsync; then echo '\"'\"'git-annex-probe rsync'\"'\"'; fi;if which ~/.ssh/git-annex-shell; then echo '\"'\"'git-annex-probe ~/.ssh/git-annex-shell'\"'\"'; fi;if which ~/.ssh/git-annex-wrapper; then echo '\"'\"'git-annex-probe ~/.ssh/git-annex-wrapper'\"'\"'; fi;cd '\"'\"'annex'\"'\"' && git config --list'"] [2018-05-19 13:43:40.021729] process done ExitFailure 2 ```

```bash

ssh -oNumberOfPasswordPrompts=0 \ -oStrictHostKeyChecking=yes -n -p 22 mforbes@swan.physics.wsu.edu

Questions

  1. How does git-annex handle conflicts? Probably diverging branches?
  2. Can I symlink a directory for distribution with git-annex? (For example: suppose I want to add sync an existing folder. Can I symlink it into a git-annex folder and have it go over automatically?)
  3. How do I removed files? The correct way is discussed here. Basically, one should use git annex drop $file, however, this will fail if there are not copies safely stored somewhere else. I am not sure how to quickly remove all files...

Cheat Sheet

  • Basic add, commit, and sync:

    git annex add data.dat
     git commit -a -m "Added data.dat"
    
  • git annex unannex: Undo a previous addition of files.

  • get annex get <file>: Get a file or directory from some remote so it exists on the compute where the command is run.

  • git annex drop <file>: Drops the file or directory from the local computer. Only allowed if enough copies exist elsewhere.
  • git annex copy <file> --to <remote>
  • git annex move <file> --to <remote>: Move or copy the specified file or directory to the specified remote. This is manual syncing.

  • git annex repair: If things get corrupted, try this.

Inspecting a Repo

This section contains useful commands for inspecting a repository to understand its status and configuration.

  • Start with git-annex-webapp: This will open your browser, showing a list of remotes and their syncing status.

GitAnnexStatus.png

  • git remote: List names of remotes which are git repositories names. (Add -v if you want more detail.) This does not list special remotes.
  • git annex info: Information about the git repository, including a list of special remotes and summary information about disk usage.
  • git annex info --fast *: Gives information about the files and folders in the current directory such as their total size, and current size on disk.
  • git annex list: Shows which files are where.

  • du -L | sort -n, du -Lh | sort -n: Shows the disk usage of files in the annex (see this discussion]

Workflows

New Collaborator

To share data with a new collaborator do the following.

Prerequisites:

  1. Have a repository, annex, and data setup, with both the repository and the remotes accessible to the collaborator. (I.e. host the repo on bitbucket or github, and the data in a shared Google Team Drive.)
  2. Have the collaborator install git-annex and the appropriate drivers.

Here is the workflow for the collaborator to connect with and share the data:

  1. Clone the repository (not the data) and go into the cloned repo.

    git clone git@bitbucket.org:mforbes/gpe_data.git
    cd gpe_data
    
  2. Connect to the remote. Here we use the example of Google Drive, which requires several steps.

    1. Authenticate/connect to the google drive:

      git-annex-remote-googledrive setup
      

      This will provide a URL to follow, which you must use to connect to the specific account where the data will be stored. This will store an authentication token on a file token.json.

    2. Add the remote using the token:

      git annex initremote google-wsu                  \
          prefix=git-annex/mmfbb.gpe_data              \
          type=external externaltype=googledrive       \
          chunk=50MiB encryption=shared mac=HMACSHA512
      
      git annex initremote google-team-wsu             \
          root_id=<ROOT ID>                            \
          type=external                                \
          externaltype=googledrive                     \
          chunk=50MiB encryption=shared mac=HMACSHA512
      

      Note, there are several ways of specifying where the data will be stored on the Google Drive:

      • prefix=: This allows you to specify a folder name, which will be located off of your My Drive. Simple, but not so versatile.
      • root_id=: When you connect to the drive online, the last part of the URL is this number. This allows you to connect with folders anywhere, such as in a Team Drive or one that is not included in your My Drive.

        Team Drives Not Working: See this issue.

git clone git@bitbucket.org:mforbes/gpe_data.git
git-annex-remote-googledrive setup

git annex initremote google type=external externaltype=googledrive prefix=git-annex chunk=50MiB encryption=shared mac=HMACSHA512

Rename a Remote

It turns out that renaming a special remote is a bit tricky. There is a todo request but it has not yet been implemented. Here is how to do this:

  • Regular git remotes: If the remote is another git repository, then you can just rename it using:

    git remote rename <old> <new>
    
  • Special remotes: Special remotes such as Google Drive, Folders, etc. (which are not git repositories) need some extra work as described here. The easiest way, however, is to run the webapp:

    git-annex-webapp
    

    When you rename a remote in the webapp, it will change everything appropriately.

Performance

Git and git-annex can become slow if you have lots of files. Here are some tips to help:

  • Repositories with large number of files:

    • Use a version 4 index:

      git update-index --index-version 4
      
    • If git count-objects returns a large number (>25000), run the garbage collector:

      git repack -ad
      git gc
      git prune
      

References

git-annex

Alternatives

Here is my current use-case. I run a research group at WSU with ~10 collaborators. We need to share source code, experimental data, papers, plots, and simulation data on a regular basis. Some of the collaborators are used to using Dropbox or Google Drive, which work for syncing, but have issues (listed below). WSU provides 1TB of storage through Microsoft One Drive for all students and faculty, so this would be a natural storage too, but few use it yet. We run simulations on our local machines, office desktops, a local HPC cluster Kamiak and online using CoCalc.

Dropbox

Currently our preferred tool for sharing. The main issue is lack of space relative to Google Drive and One Drive, as well as issues with selective sync:

Pros:

  • Automatic syncing with conflict resolution.
  • Fast syncing.
  • Available Linux client (works on CoCalc).
  • Support symlinks. (Allows you to have your actual directory elsewhere on your computer rather than have to keep it in the drive.)
  • Selective syncing by folder.

Cons:

  • Poor selective syncing:
    • Cannot exclude based on filename.
    • Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
  • Potential issues with syncing .hg or .git directories (these should be typically excluded).
  • Limited space by default (~3GB).

Google Drive (Google Backup and Sync)

Because it offers more space, we use this for syncing data with experimentalists, but it also has a range of issues. Some of these may be resolved using Insync (commercial one-time-fee). This may be a more attractive solution once G-Suite for Education is setup at WSU.

Pros:

  • Automatic syncing with conflict resolution.
  • Reasonably fast syncing (slower than Dropbox).
  • Selective syncing by folder.
  • Larger default space (~15GB)

Cons:

  • No Linux client (cannot use on CoCalc).
  • No symlink support.
  • Poor selective syncing:
    • Cannot exclude based on filename.
    • Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
  • Potential issues with syncing .hg or .git directories (these should be typically excluded).

References

Microsoft One Drive

The main reason to consider this is that WSU offers all faculty and students a 1TB share. I have much less experience with this.

Pros:

  • Automatic syncing (with conflict resolution?).
  • (Selective syncing by folder?)
  • Larger default space (~1TB for us)

Cons:

  • No Linux client (cannot use on CoCalc).
  • No symlink support.
  • Poor selective syncing:
    • Cannot exclude based on filename.
    • Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
  • Potential issues with syncing .hg or .git directories (these should be typically excluded).

References

git-annex

git-annex works by having the large file actually be a symlink to the file, which is hidden (or which may not be present). When you git annex get the file, it will be transferred from the closest repository that has a copy so you can use it locally. When you are finished, you can remove the actual file to save disk space, or move it offsite.

A really nice feature is that you can store the files in many place, even if they do not have git. For example, you can store the files on a remote server where you have ssh/rsync access, or in Amazon S3 storage, etc. My hope is that I can use git-annex to interact with the other three services. By itself, git-annex does not provide automatic syncing - everything must be done manually - but the git-annex assistant fills this role.

In [ ]: