Using the common filesystem

All CÉCI clusters are connected to a central storage system that is visible to all compute nodes of all clusters. This system runs on a fast, dedicated, network. It will become the home of the users in the near future, but in this first phase, it is set up as an additional home besides the default, cluster-specific, home.

This storage system is installed at two CÉCI locations (at ULiège and UCL) and data are replicated synchronously on both locations to ensure data safety and a certain level of high availability. Moreover, on each site, a local cache is setup to mask the latencies of the network and make sure the user experience is as smooth as possible. Those caches are replicated asynchronously with the central storage, meaning that files that are written there will appear after some delay on the other clusters.

Note

There is no point in creating a file from a cluster and then quickly going to another cluster and running ls rapidly waiting for the file to appear. The file will be copied on the main storage only after 15 seconds, and the metadata will be copied on the other clusters after up to 60 seconds.

It also means that if you modify the same file from two different clusters, the result is undefined.

Warning

Do not write to the same file from two different clusters at the same time. This would corrupt the file.

After some delay, the files which are not read from a cluster are removed from the cache connected to that cluster. The metadata are kept, meaning that the files appear when you use ls but the data are not. The data will be synced back as soon as you try to access the file.

The common storage is split into four distinct directories:

You, as a user, have read and write access to the first two, maybe to the third one, and only read access to the fourth one.

Note

If you run the du command on a cluster, you will only get the disk usage on the cache, not on the main central system! To find out the used space on the common filesystem folders always add the --apparent-size option:

du --apparent-size -sh  /common/storage/path

For more details on this topic keep reading the following section.

Home

The /CECI/home partition is designed to work as the current home filesystems you have access to on the clusters. You have a 100GB quota for you to store your own software, configuration files, and small (input) data files or (output) result files. The advantage of using this home rather than the cluster-specific one is that no matter how you configure your environment on one cluster, you will find the same environment on all clusters without the need to move and copy files around.

Note

The 100GB quota is on the local caches connected to the clusters, But your quota on the central storage is NOT 5 x 100GB. Do not try to play the system and fill your quota on all the clusters at the same time ; you might be able to go beyond the quota limit, but it could result in file corruption and loss of synchronisation.

To get your current usage on the main central system, use the quota command from one of the clusters. If the quota command does not list the central storage, try first listing its contents with for instance ls $CECIHOME. It should then appear.

In the long term, this filesystem will be the default home on the clusters, but at the moment, you will need to move there (using cd) explicitly. An environment variable named $CECIHOME points to your home on the common filesystem so you can issue cd $CECIHOME.

Note

If you have, in your CÉCI home, programs that you compiled by yourself, make sure you have used the tips and tricks mentioned in Compiling for multiple CPU architectures to make sure they will be able to run on every CPU architecture present in the CÉCI clusters.

To make clearer the previous explanations about the complex solution running the common storage, we make a special consideration about the usage of the du command to monitor the space on disk taken by files and directories.

When copying, for example, a 6MB file from one of the cluster’s home folder to $CECIHOME and perform a du -sh you will obtain:

mylogin@nic4: $ cp ~/file.dat $CECIHOME/
mylogin@nic4: $ du -sh $CECIHOME/file.dat
6.0M   $CECIHOME/file.dat

In order to avoid the inherent problems due to wide-area interconnects latencies, the rest of the caches at the different clusters, are served only the metadata of the files created on one of them. That is to say, your file.dat file will be actually copied, synced and stored on both main storages at ULiège and UCL, but the rest of the caches will only have the information that the file exist on the common space, the actual contents will be transferred only on demand.

Then, if after the previous steps you login on another CÉCI cluster and perform du -sh you will see:

mylogin@dragon1: $ du -sh $CECIHOME/file.dat
0     $CECIHOME/file.dat

If in this cluster you access the file, i.e. open with an editor, do a cat or less, etc. then the actual data contained on the file will be transferred from one of the main storages to the cache on this cluster, afterwards you should get:

mylogin@dragon1: $ du -sh $CECIHOME/file.dat
6.0M   $CECIHOME/file.dat

To summarize, you should never rely on du -sh to verify if a file is properly stored or copied, and this is valid in general for any kind of filesystem. A more appropriate action would be to check for file consistency with a hash tool and verify to get the same output on different clusters:

mylogin@dragon1: $ md5sum $CECIHOME/file.dat
da6a0d097e307ac52ed9b4ad551801fc  $CECIHOME/file.dat

In the case that you want to know which is the approximate space taken by a file or directory on $CECIHOME then add the --apparent-size option to du:

mylogin@dragon1: $ du -sh --apparent-size $CECIHOME/file.dat
6.0M   $CECIHOME/file.dat

the output should be closely the same on all the clusters, independently if files were accessed or not. Notice that the man page for du defines the tool as du - estimate file space usage.

In case this info is very different among clusters, please submit a ticket on the CÉCI Support page.

Trfs

The /CECI/trsf partition is meant for large file transfers from one cluster to another. The user quota on that partition is 1TB soft and 10TB hard with a grace period of 10 days. This means that you use up to 10TB, but as soon as you go above the 1TB limit, you will have 10 days to remove files and come back below the 1TB limit. If you fail to do so, you will not be able to write to the disk anymore until you comply. This directory is purged of old files regularly, each time the purge is performed all files created earlier than 6 months ago will be irremediably removed.

Note

Do not use that partition as a storage partition. It is meant for transfers and can only hold temporary data.

This partition can be used to transfer large files from the scratch space (see globalscratch) of a cluster to the scratch space of another one. It will be much faster than using scp or rsync directly. As this filesystem is optimized for large files, if you have a lot of small files, you must tar them otherwise you will experience very poor performances. For instance, assuming you have a directory transfer_me on the scratch of Lemaitre2 and want to move it to the scratch of NIC4:

On Lemaitre2:

cd $GLOBALSCRATCH
tar cvzf $CECITRSF/transfer_me.tgz ./transfer_me

On NIC4, a bit later:

cd $GLOBALSCRATCH
tar xvzf $CECITRSF/transfer_me.tgz

Everyone can write into the $CECITRSF directory, but the sticky bit is set so that you can only delete your own files.

Note

Even if this is a two-step operation, if will be much faster than using scp for large files as it benefits from the fast dedicated network. But you still have the option of using scp of course.

An environment variable named $CECITRSF points to that location.

Proj

The /CECI/proj partition will be dedicated to projects, i.e. groups of researchers working towards a common goal and sharing files. Projects will be created upon motivated request, with a specific quota and a specific duration.

Note

Currently, if you want to enable a project for your group please contact ceci-logist@lists.ulg.ac.be to do the request.

Soft

The /CECI/soft partition will hold all the software that will be available on all clusters. They will be compiled for each processor type that is present in one of the CÉCI clusters. On each node, the version specific to the hardware of the node will be selected by the module.

Behind the scenes

The CÉCI shared storage is based on two main storage systems hosted in Liège and Louvain-la-Neuve using the GPFS filesystem developed by IBM. Those two storage systems are synchronously mirrored, meaning that any file written to one of them is automatically written to the other one.

In addition, to keep the data on the common partitions synchronized among the different geographical locations of the CÉCI clusters, it is used the Active File Management (AFM) implementation. Within this setup a global namespace is defined to share a single filesystem among different gateway nodes which are physically deployed on each of the CÉCI universities.

These five smaller storage systems that serve as buffers/caches and the main ones are interconnected through a dedicated 10GBps network. Those caches located on each site and are tightly connected to the cluster compute nodes.

../../_images/cecigpfs.png