Using the common filesystem

All CÉCI clusters are connected to a central storage system that is visible to all compute nodes of all clusters. This system runs on a fast, dedicated, network. It will become the home of the users in the near future, but in this first phase, it is set up as an additional home besides the default, cluster-specific, home.

This storage system is installed at two CÉCI locations and data are replicated synchronously on both locations to ensure data safety and a certain level of high availability. Moreover, on each site, a local cache is setup to mask the latencies of the network and make sure the user experience is a smooth as possible. Those caches are replicated asynchronously with the central storage, meaning that files that are written there will appear after some delay on the other clusters.

Note

There is no point in creating a file from cluster and then quickly going to another cluster and running ls rapidly waiting for the file to appear. The file will be copied on the main storage only after 15 seconds, and the metadata will be copied on the other clusters after up to 60 seconds.

It also means that if you modify the same file from two different clusters, the result is undefined.

Warning

Do not write to the same file from two different clusters at the same time. This would corrupt the file.

After some delay, the files which are not read from a cluster are removed from the cache connected to that cluster. The metadata are kept, meaning that the file appear whe you use ls, but the data are not. The data are synce’d back as soon as you try to read the file.

Note

If you run the du command on a cluster, you will only get the disk usage on the cache, not on the main central system. So you might be out of quota even if du says otherwise. And the result of du might very well be different on each cluster!

The storage is split into four distinct directories:

  • home
  • trsf
  • proj
  • soft

You, as a user, have read and write access to the first two, maybe to the third one, and only read access to the fourth one.

Home

The /CECI/home part is designed to work as the current home filesystems you have access to on the clusters. You have a 100GB quota for you to store your own software, configuration files, and small (input) data files or (output) result files. The advantage of using this home rather than the cluster-specific one is that however your configure your environment on one cluster, you will find the same environment on all clusters without the need to move and copy files around.

Note

The 100GB quota is on the local caches connected to the clusters, But your quota on the central storage is NOT 5 x 100GB. Do not try to play the system and fill your quota on all cluster at the same time ; you might be able to go beyond the quota limit, but it could result in file corruption and loss of synchronisation.

To get your current usage on the main central system, use the quota command from one of the clusters. If the quota command does not list the central storage, try first listing its contents with for intance ls $CECIHOME. It should the appear.

In the long term, this filesystem will be the default home on the clusters, but at the moment, you will need to ‘move’ there (using cd) explicitly. An environment variable named $CECIHOME points to your home on the common filesystem so you can issue cd $CECIHOME.

Note

If you have, in your CÉCI home, programs that you compiled by yourself, make sure you have used the tips and tricks mentioned in Compiling for multiple CPU architectures to make sure they will be able to run on every CPU architecture present in the CÉCI clusters.

Trfs

The /CECI/trsf part is meant for large file transfers from one cluster to another. The user quota on that partition is 1TB soft and 10TB hard with a grace period of 10 days. This means that you use up to 10TB, but as soon as you go above the 1TB limit, you will have 10 days to remove files and come back below the 1TB limit. If you fail to do so, you will not be able to write to the disk anymore until you comply. This directory is purged of old files regularly: every three months, all files created earlier than 6 months ago will be irremediably removed.

Note

Do not use that partition as a storage partition. It is meant for transfers and can only hold temporary data.

This partition can be used to transfer large files from the scratch space (see globalscratch) of a cluster to the scratch space of another one. It will be much faster than using scp or rsync directly. As this filesystem is optimized for large files, if you have a lot of small files, you must tar them otherwise you will experience very poor performances. For instance, assuming you have a directory transfer_me on the scratch of Lemaitre2 and want to move it to the scratch of NIC4:

On Lemaitre2:

cd $GLOBALSCRATCH
tar cvzf $CECITRSF/transfer_me.tgz ./transfer_me

On NIC4, a bit later:

cd $GLOBALSCRATCH
tar xvzf $CECITRSF/transfer_me.tgz

Everyone can write into the $CECITRSF directory, but the sticky bit is set so that you can only delete your own files.

Note

Even if this is a two-step operation, if will be much faster than using scp for large files as it benefits from the fast dedicated network. But you still have the option of using scp of course.

An environment variable named $CECITRSF points to that location.

Proj

The /CECI/proj partition will be dedicated to “projects”, i.e. groups of researchers working towards a common goal and sharing files. Projects will be created upon motivated request, with a specific quota and a specific duration.

Note

Projects are not enabled yet. Details on how to proceed will be announced.

Soft

The /CECI/soft partition will hold all software that will be available on all clusters. They will be compiled for each processor type that is present in one of the CÉCI clusters. On each node, the version specific to the hardware of the node will be selected by the module.

Behind the scenes

The CÉCI shared storage is based on two main storage systems hosted in Liège and Louvain-la-Neuve. Those storage systems are synchronously replicated, meaning that any file written to one of them is automatically written to the other one. They are connected to five smaller storage systems that serve as buffers/caches through a dedicated 10GBps network. Those caches are located on each site and are tightly connected to the cluster compute nodes.

../../_images/cecigpfs.png