git-annex is a tool for managing large data files with git. The idea is to store the information about the file in a git repository that can be synchronized, but to store the actual data separately. The annex keeps track of where the file actually resides (which may be in a different repository, or on another compute) and allows you to control the file (renaming, moving, etc.) without having to have the actual file present.
Here we explore git-annex as a mechanism for replacing and interacting with Dropbox, Google Drive, One Drive etc. with the following goals:
- Multiple users can share data.
- Data shared across many platforms: HPC clusters, laptops, desktops, Mac, Windows, Linux, CoCalc, etc.
- Allow only a subset of data to be stored on any particular device (esp. laptops) if memory on that device is limited.
- Utilize cloud storage options including Google Cloud, Dropbox, Microsoft One Drive both as redundant backups, but also as a mechanism for sharing data with others who need to be able to use only once of these services.
- Automatic and manual sync options.
My use-case is that I run a research group at WSU with ~10 collaborators. We need to share source code, experimental data, papers, plots, and simulation data on a regular basis. Some of the collaborators are used to using Dropbox or Google Drive, which work for syncing, but have issues (listed below). WSU provides 1TB of storage through Microsoft One Drive for all students and faculty, so this would be a natural storage too, but few use it yet. We run simulations on our local machines, office desktops, a local HPC cluster [Kamiak] and online using [CoCalc].
Overview of git-annex¶
Here is my current understanding of git-annex and the git-annex assistant.
- Each "annex" is a git repository where the annex information is stored under the
.git
directory. Note: the annex information is separate from regular git information. Thus, if you do not initialize the git repository,git status
will returnfatal: This operation must be run in a work tree
. You can use the annex as both a git-annex and a regular git repo if you choose. All of the annex information (log files, repository information, etc.) is stored in the.git
directory. - For a GUI you can run
git-annex-webapp
. This seems to open the annex in the current folder or the last used annex. You can create other repositories from the GUI, but they will not appear in the Dashboard - only the repositories managed by the current "annex" (i.e. stored in the.git
folder for that annex) appear in the Dashboard. - Configuration information is stored in
~/.config/git-annex
and.git/config
. The former includes autostart information about which repositories should be synced. - Look at the various workflows. This explains the difference between
sync
(which syncs meta-data but not the file contents) andsync --content
.
Installation¶
Installation on major platforms is quite straightforward using the prebuilt binaries. Read the instructions carefully to see what you need to add to your path so you can access the git annex
commands. (On my Mac OS X laptop, I added /Applications/git-annex.app/Contents/MacOS
to my path using the GNU Environment Modules mechanism, which I have installed for keeping track of various paths.)
To build git-annex from source might be a challenge since it is written in Haskell. Fortunately, one can use remote computers without installing git-annex on them. In many cases, it can be installed using package managers.
Conda¶
This is now my preferred method on Linux servers:
conda install -c conda-forge git-annex
conda install -c conda-forge tenacity pydrive
pip install annexremote git-annex-remote-googledrive
Note: this requires python 3 but also includes the client require for using Google Drive (see below).
Instead I install this in a conda environment using the following yml file:
# environment.app_gitannex.yml
name: app_gitannex
channels:
- defaults
- conda-forge
dependencies:
- tenacity
- pydrive
- git-annex
- pip:
- annexremote
- git-annex-remote-googledrive
by running
conda env install --file environment.app_gitannex.yml
then linking this to a bin/
directory on path:
ln -s "$(conda env list | grep app_gitannex | awk '{print $2}')/bin/git-annex-remote-googledrive" /usr/local/bin/
Mac OS X¶
Unfortunately, a conda git-annex binary is not maintained for OS X so I use instead the pre-built binary for Os X.
-
git-annex-turtle
might be useful for integration with FileManager (I have not tried yet).
Ubuntu¶
The standard Ubuntu install is old. I tried adding Justin Geibel's repo as discussed at the bottom of the Ubuntu install page
sudo add-apt-repository ppa:jtgeibel/ppa
sudo apt-get update
sudo apt-get install git-annex
but this version does not have the webapp, assistant, etc. I ended up resorting to using Conda
Walk-through (git-annex only)¶
Following the walkthrough (slightly), I created a repository and initialized the annex:
%%bash
mkdir -p /tmp/annex_eg
cd /tmp/annex_eg
git init
git annex init "my laptop:annex_eg"
Now add some "big" files:
%%file /tmp/annex_eg/data.dat
Tonnes of data!
And now add them to the annex.
%%bash
cd /tmp/annex_eg
git annex add data.dat
git commit -a -m "Added data.dat"
This is simply a symbolic link to the actuall data file which presently lives here in this repository, so we can view it.
!ls -la /tmp/annex_eg/data.dat
!cat /tmp/annex_eg/data.dat
We cannot modify it though: git-annex is very cautious about potential data loss!
%%bash
echo "More!" >> /tmp/annex_eg/data.dat
If you want to modify it, you need to first unlock it. This replaces the symlink with the actual file and allows you to edit it.
%%bash
cd /tmp/annex_eg/
git annex unlock data.dat
ls -lah data.dat
%%bash
echo "Revised!" >> /tmp/annex_eg/data.dat
If you want to discard the data, you could just lock the file again. Of course, safety checks are in place to make sure you do not loose data:
%%bash
cd /tmp/annex_eg/
git annex lock data.dat
You can force this:
%%bash
cd /tmp/annex_eg/
git annex lock --force data.dat
If you want to save the changed data, you need to add and commit that data:
%%bash
cd /tmp/annex_eg/
git annex unlock data.dat
echo "Revised!" >> /tmp/annex_eg/data.dat
git annex add data.dat
git commit -a -m "Modified data.dat"
ls -la data.dat
Note that git-annex keeps both versions of the file:
%%bash
cd /tmp/annex_eg/
git log data.dat
Updating to the specified revision will link to the appropriate version (if available):
%%bash
cd /tmp/annex_eg/
cat data.dat
git checkout -q master^
cat data.dat
git checkout master
Remotes: Directory¶
Now let's add a remote. We will start with a simple remote directory in another folder. This is called a special remote. Note that this is not a git repository - it is simply a directory that will hold the data.
%%bash
mkdir -p /tmp/annex_eg_data_dir
cd /tmp/annex_eg
git annex initremote data_dir \
type=directory directory=/tmp/annex_eg_data_dir encryption=none
We can move data to that repository:
%%bash
cd /tmp/annex_eg
git annex move data.dat --to data_dir
Now the data is not here since it was moved rather than copied.
!cat /tmp/annex_eg/data.dat
... it is in the data directory:
%%bash
cd /tmp/annex_eg
git annex list
(Technical note: only the current version of the data was moved. The previous revision is still in /tmp/annex_eg
.) If you want to move all versions, follow the instructions here:
git annex move --unused --to data_dir
If we need it, we can git annex get
it:
%%bash
cd /tmp/annex_eg
git annex get data.dat
cat data.dat
Now it is in both places.
%%bash
cd /tmp/annex_eg
git annex list
Cheat-Sheet¶
-
get annex get <file>
: Get a file or directory -
git annex list
: Shows which files are where.
Remotes¶
The utility of git-annex comes from storing data on remotes. These are managed within your repository but can point to many different places, such as folders, Google Drive, etc. which do not run git.
There are a couple of points to note:
-
Each remote has a name that is stored in the git repository. You must make sure that your remotes have useful names so that you and your collaborators can find them. For example, the default documentation uses names such as "google" which are fine for testing, but should be replaced by something more specific later. I use things like the following:
-
google-wsu
: My generic git annex repo inMy Drive/git-annex/
at WSU. I would use a prefix such asmmfbb.gpe_data
to organize the data within a folder that points to where I host the master repo: https://bitbucket.org/mforbes/gpe_data/. -
google-team-wsu
: Typically I share data through a google Team Drive. This be in something likeTeam Drives/Project 1/git-annex/
hosted at WSU. To specify this folder, I would use theremote_id
option. -
google-uw
: A similar repository at UW.
-
Rclone¶
Now we try setting up some other remotes that are enabled with rclone. First one needs to install rclone:
Mac OS X:
cd ~/zips && curl -O https://downloads.rclone.org/rclone-current-osx-amd64.zip
unzip -a rclone-current-osx-amd64.zip && cd rclone-*-osx-amd64
mv rclone /usr/local/bin/
#mv rclone.1 /usr/local/man/man1/
rm -rf ~/zips/rclone-*-osx-amd64
Now one should install Daniel Dent's git-annex-remote-rclone project:
cd /usr/local/bin && curl -O https://raw.githubusercontent.com/DanielDent/git-annex-remote-rclone/master/git-annex-remote-rclone
Microsoft One Drive¶
For my WSU buisness account I need to grant permission for RClone to access this. I could not find a way to do it without admin access, so I have filed a ticket with ITS.
Google Drive¶
Currently, to use Google Drive you must use the git-annex-remote-googledrive project. This requires Python 3 to run, but installed nicely from source. Follow the instructions to authenticate with your drive and you should be good to go.
I install this in a conda environment using the following yml file:
# environment.app_gitannex.yml
name: app_gitannex
channels:
- defaults
- conda-forge
dependencies:
- tenacity
- pydrive
- pip:
- annexremote
- git-annex-remote-googledrive
by running
conda env install --file environment.app_gitannex.yml
then linking this to a bin/
directory on path:
ln -s "$(conda env list | grep app_gitannex | awk '{print $2}')/bin/git-annex-remote-googledrive" /usr/local/bin/
Initialize Remote¶
Once you have this environment installed, you can create an initialize a remote.
git-annex-remote-googledrive setup
This will provide a URL to follow, which you must use to connect to the specific account where the data will be stored. This will store an authentication token on a file token.json
. Now add the remote using the token:
conda activate app_gitannex
git annex initremote \
google-wsu # Name you will use to refer to remote \
prefix=git-annex/mmfbb.gpe_data # Path in your Google Drive \
type=external externaltype=googledrive \
chunk=50MiBencryption=shared mac=HMACSHA512
Note, there are several ways of specifying where the data will be stored on the Google Drive:
-
prefix=
: This allows you to specify a folder name, which will be located off of yourMy Drive
. Simple, but not so versatile. -
root_id=
: When you connect to the drive online, the last part of the URL is this number. This allows you to connect with folders anywhere, such as in a Team Drive or one that is not included in yourMy Drive
.
Team Drives (not working)¶
Team Drives support is not yet working: See this issue.
git annex initremote google-team-wsu \
root_id=<root id> \
type=external \
externaltype=googledrive \
chunk=50MiB encryption=shared mac=HMACSHA512
Cleanup¶
To remove these repositories is a bit tricky because git-annex sets its permissions to be very restrictive.
%%bash
find /tmp/annex_eg -exec chmod u+rw {} \;
find /tmp/annex_eg_data_dir -exec chmod u+rw {} \;
rm -rf /tmp/annex_eg /tmp/annex_eg_data_dir
git-annex assistant¶
Without the assistant, one must explicitly synchronize between two repositories. This allows one explicit control of what is on any particular machine, but does not satisfy my desire for automatic synchronization.
Sync between two folders¶
First we create the two annex folders.
%%bash
PREF="/tmp/"
cd "$PREF"
rm -rf annex1 annex2
cd "$PREF"
mkdir annex1
cd annex1
git init
git annex init "Annex 1"
cd "$PREF"
mkdir annex2
cd annex2
git init
git annex init "Annex 2"
Now we add each annex as a remote.
%%bash
PREF="/tmp/"
# Add annex2 as a remote to annex1
cd "$PREF"/annex1
git remote add annex2 ../annex2
# Add annex1 as a remote to annex2
cd "$PREF"/annex2
git remote add annex1 ../annex1
Now start the git-annex assistant.
Here I am following the 10-minute intro walk-through.
-
Install git-annex on a remote server (I did this on my office desktop
swan
):ssh swan sudo apt-get install git-annex
-
Run the git-annex locally:
git-annex-webapp
-
Add a remote server.
*I had issues connecting with my SSH key. Password worked fine.**
Issues and Gotchas¶
Git annex automatically merges files. If there is a conflict, then the conflicted files will appear with a name like file.AAA
and file.BBB
. You must then decide what to do with this. This mechanism is chosen so that git-annex never enters a conflicted merge state.
SSH¶
When I tried to configure a repository on a remote server with an "existing SSH key" the connection failed. Here is the log (enabled by going to Preferences
in the webapp and enabling debug information in the log):
- The first problem is that I tried to use an SSH alias I had setup (
swan
) which was translated toswan.home
for some reason. The reason is explained here. The solution is to use the fully qualified host-name and manually correct after. -
The second problem was that the assistant attempts to run the following checks on the remote host:
cd <repo-name> && git config --list
Thus, the repo must exist on the remote [2018-05-19 13:38:55.564412] process done ExitFailure 2 [2018-05-19 13:43:39.138347] read: ssh-keygen ["-F","swan.physics.wsu.edu"] [2018-05-19 13:43:39.163278] process done ExitSuccess [2018-05-19 13:43:39.163418] chat: ssh ["-oNumberOfPasswordPrompts=0","-oStrictHostKeyChecking=yes","-n","-p","22","mforbes@swan.physics.wsu.edu","sh -c 'echo '\"'\"'git-annex-probe loggedin'\"'\"';if which git-annex-shell; then echo '\"'\"'git-annex-probe git-annex-shell'\"'\"'; fi;if which git; then echo '\"'\"'git-annex-probe git'\"'\"'; fi;if which rsync; then echo '\"'\"'git-annex-probe rsync'\"'\"'; fi;if which ~/.ssh/git-annex-shell; then echo '\"'\"'git-annex-probe ~/.ssh/git-annex-shell'\"'\"'; fi;if which ~/.ssh/git-annex-wrapper; then echo '\"'\"'git-annex-probe ~/.ssh/git-annex-wrapper'\"'\"'; fi;cd '\"'\"'annex'\"'\"' && git config --list'"] [2018-05-19 13:43:40.021729] process done ExitFailure 2 ```
```bash
ssh -oNumberOfPasswordPrompts=0 \ -oStrictHostKeyChecking=yes -n -p 22 mforbes@swan.physics.wsu.edu
Questions¶
- How does git-annex handle conflicts? Probably diverging branches?
- Can I symlink a directory for distribution with git-annex? (For example: suppose I want to add sync an existing folder. Can I symlink it into a git-annex folder and have it go over automatically?)
- How do I removed files? The correct way is discussed here. Basically, one should use
git annex drop $file
, however, this will fail if there are not copies safely stored somewhere else. I am not sure how to quickly remove all files...
Cheat Sheet¶
-
Basic add, commit, and sync:
git annex add data.dat git commit -a -m "Added data.dat"
-
git annex unannex
: Undo a previous addition of files. -
get annex get <file>
: Get a file or directory from some remote so it exists on the compute where the command is run. -
git annex drop <file>
: Drops the file or directory from the local computer. Only allowed if enough copies exist elsewhere. git annex copy <file> --to <remote>
-
git annex move <file> --to <remote>
: Move or copy the specified file or directory to the specified remote. This is manual syncing. -
git annex repair
: If things get corrupted, try this.
Inspecting a Repo¶
This section contains useful commands for inspecting a repository to understand its status and configuration.
- Start with
git-annex-webapp
: This will open your browser, showing a list of remotes and their syncing status.
-
git remote
: List names of remotes which are git repositories names. (Add-v
if you want more detail.) This does not list special remotes. -
git annex info
: Information about the git repository, including a list of special remotes and summary information about disk usage. -
git annex info --fast *
: Gives information about the files and folders in the current directory such as their total size, and current size on disk. -
git annex list
: Shows which files are where. -
du -L | sort -n
,du -Lh | sort -n
: Shows the disk usage of files in the annex (see this discussion]
Workflows¶
New Collaborator¶
To share data with a new collaborator do the following.
Prerequisites:
- Have a repository, annex, and data setup, with both the repository and the remotes accessible to the collaborator. (I.e. host the repo on bitbucket or github, and the data in a shared Google Team Drive.)
- Have the collaborator install git-annex and the appropriate drivers.
Here is the workflow for the collaborator to connect with and share the data:
-
Clone the repository (not the data) and go into the cloned repo.
git clone git@bitbucket.org:mforbes/gpe_data.git cd gpe_data
-
Connect to the remote. Here we use the example of Google Drive, which requires several steps.
-
Authenticate/connect to the google drive:
git-annex-remote-googledrive setup
This will provide a URL to follow, which you must use to connect to the specific account where the data will be stored. This will store an authentication token on a file
token.json
. -
Add the remote using the token:
git annex initremote google-wsu \ prefix=git-annex/mmfbb.gpe_data \ type=external externaltype=googledrive \ chunk=50MiB encryption=shared mac=HMACSHA512 git annex initremote google-team-wsu \ root_id=<ROOT ID> \ type=external \ externaltype=googledrive \ chunk=50MiB encryption=shared mac=HMACSHA512
Note, there are several ways of specifying where the data will be stored on the Google Drive:
-
prefix=
: This allows you to specify a folder name, which will be located off of yourMy Drive
. Simple, but not so versatile. -
root_id=
: When you connect to the drive online, the last part of the URL is this number. This allows you to connect with folders anywhere, such as in a Team Drive or one that is not included in yourMy Drive
.Team Drives Not Working: See this issue.
-
-
git clone git@bitbucket.org:mforbes/gpe_data.git
git-annex-remote-googledrive setup
git annex initremote google type=external externaltype=googledrive prefix=git-annex chunk=50MiB encryption=shared mac=HMACSHA512
Rename a Remote¶
It turns out that renaming a special remote is a bit tricky. There is a todo request but it has not yet been implemented. Here is how to do this:
-
Regular git remotes: If the remote is another git repository, then you can just rename it using:
git remote rename <old> <new>
-
Special remotes: Special remotes such as Google Drive, Folders, etc. (which are not git repositories) need some extra work as described here. The easiest way, however, is to run the webapp:
git-annex-webapp
When you rename a remote in the webapp, it will change everything appropriately.
Performance¶
Git and git-annex can become slow if you have lots of files. Here are some tips to help:
-
Repositories with large number of files:
-
Use a version 4 index:
git update-index --index-version 4
-
If
git count-objects
returns a large number (>25000), run the garbage collector:git repack -ad git gc git prune
-
References¶
git-annex¶
- Marking a remote dead: When you lose access to a remote, you may lose data, but you can mark that remote dead.
- Related Software: Has some references to potentially useful tools.
- Preferred Content: These are rules for specifying what gets automatically synced.
Alternatives¶
Here is my current use-case. I run a research group at WSU with ~10 collaborators. We need to share source code, experimental data, papers, plots, and simulation data on a regular basis. Some of the collaborators are used to using Dropbox or Google Drive, which work for syncing, but have issues (listed below). WSU provides 1TB of storage through Microsoft One Drive for all students and faculty, so this would be a natural storage too, but few use it yet. We run simulations on our local machines, office desktops, a local HPC cluster Kamiak and online using CoCalc.
Dropbox¶
Currently our preferred tool for sharing. The main issue is lack of space relative to Google Drive and One Drive, as well as issues with selective sync:
Pros:
- Automatic syncing with conflict resolution.
- Fast syncing.
- Available Linux client (works on CoCalc).
- Support symlinks. (Allows you to have your actual directory elsewhere on your computer rather than have to keep it in the drive.)
- Selective syncing by folder.
Cons:
- Poor selective syncing:
- Cannot exclude based on filename.
- Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
- Potential issues with syncing
.hg
or.git
directories (these should be typically excluded). - Limited space by default (~3GB).
Google Drive (Google Backup and Sync)¶
Because it offers more space, we use this for syncing data with experimentalists, but it also has a range of issues. Some of these may be resolved using Insync (commercial one-time-fee). This may be a more attractive solution once G-Suite for Education is setup at WSU.
Pros:
- Automatic syncing with conflict resolution.
- Reasonably fast syncing (slower than Dropbox).
- Selective syncing by folder.
- Larger default space (~15GB)
Cons:
- No Linux client (cannot use on CoCalc).
- No symlink support.
- Poor selective syncing:
- Cannot exclude based on filename.
- Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
- Potential issues with syncing
.hg
or.git
directories (these should be typically excluded).
References¶
- Insync is a commercial option for Google Drive that apparently alliviates many of the problems including Linux, symlink support, and ignore lists.
- Temporary fix for using symlinks with google drive (gdrive) on mac
- Symbolic link between files WITHIN google drive
Microsoft One Drive¶
The main reason to consider this is that WSU offers all faculty and students a 1TB share. I have much less experience with this.
Pros:
- Automatic syncing (with conflict resolution?).
- (Selective syncing by folder?)
- Larger default space (~1TB for us)
Cons:
- No Linux client (cannot use on CoCalc).
- No symlink support.
- Poor selective syncing:
- Cannot exclude based on filename.
- Excluding a directory from syncing deletes it locally so you have to go through a convoluted procedure of copying it (you can't move it or you wont be able to exclude the non-existent directory), excluding it, and then restoring the contents.
- Potential issues with syncing
.hg
or.git
directories (these should be typically excluded).
References¶
git-annex¶
git-annex works by having the large file actually be a symlink to the file, which is hidden (or which may not be present). When you git annex get
the file, it will be transferred from the closest repository that has a copy so you can use it locally. When you are finished, you can remove the actual file to save disk space, or move it offsite.
A really nice feature is that you can store the files in many place, even if they do not have git. For example, you can store the files on a remote server where you have ssh/rsync access, or in Amazon S3 storage, etc. My hope is that I can use git-annex to interact with the other three services. By itself, git-annex does not provide automatic syncing - everything must be done manually - but the git-annex assistant fills this role.