Transferring files to and from the clusters¶

This section is about transferring data between your computer and a cluster.

Those examples are only for Linux and MacOs computers or if you are using WSL. The commands are executed in your computer. Your SSH client configuration file must be correct as explained in the Connecting from a UNIX/Linux or MacOS computer section. Replace cecicluster by the Host name of the cluster defined in your .ssh/config file.

Copying a file or directory¶

The simplest way to copy a file to or from a cluster is to use the scp command.

scp ./file.txt cecicluster:destination/path/

Copying it back is done with

scp cecicluster:path/to/file.txt .

If you want to copy a directory and its content, use the -r option, just like with cp.

scp -r cecicluster:path/to/folder .

Transferring a large number of small files¶

Transferring a lot of small files will take a very long time with scp because of the overhead of copying every file individually. In such case, using the tar command will reduce the transfer time significantly. You can first create a tar archive, then scp it as a single file and then ‘untar’ the file. But the most efficient way is to do all three operations in one go, without creating an intermediate file, like this:

tar cz ./source_dir | ssh cecicluster 'tar xvz -C destination/path'

This will create a large file containing the small files and remove the overhead of dealing with many small files.

Copying it back from cluster to your computer is done with:

ssh cecicluster 'tar -C source/path -cz source_dir' | tar -xz

Transferring large files¶

When transferring large files, it is often interesting to use the -C option of scp to first compress the file, send it, and then decompress it. Using it simply with

scp -C ./large_file.txt cecicluster:destination/path/

Resuming interrupted transfers¶

If, for any reason, a transfer is interrupted, you might end up with part of the files being transferred. Rather than restarting the transfer from scratch, you should then use the rsync command. The rsync command will compare the source and destination directories and only transfer what needs to be transferred: missing files, modified files, etc.

Use it this way (assuming again that your SSH client is properly configured):

rsync -va ./source_dir cecicluster:destination/path

Make sure not to leave trailing slashes in your path names (e.g. NOT destination/path/) as you might end up with a full copy of the directory inside the existing, partial, one. Use the -n (dry-run) option of rsync to check what will happen before you run the actual command.

If one large file is left half-transferred, you can resume it using the --partial.

Transferring code¶

Source code is a specific type of data and should be treated as such. The best way to transfer code from one computer to another is to host the code in a source code repository using a versioning system such as git (more common) or mercurial (easier to use) and clone the repository from your laptop to the cluster.

Synchronising with a local directory¶

If you want to keep two directories (one on your laptop, and one on the cluster) in sync, you can do that with rsync using its --delete option. But that is only one-way so you need to really think in what direction you do it, and it does not scale beyond two synchronized directories.

A real option is to use Unison, a piece of software that can detect and handle conflicts (incompatible changes made to the same file in the two directories that must be kept in sync.)