Using the common filesystem¶
All CÉCI clusters are connected to a central storage system that is visible to all compute nodes of all clusters. This system runs on a fast, dedicated, network. It will become the home of the users in the near future, but in this first phase, it is set up as an additional home besides the default, cluster-specific, home.
This storage system is installed at two CÉCI locations (at ULiège and UCL) and data are replicated synchronously on both locations to ensure data safety and a certain level of high availability. Moreover, on each site, a local cache is setup to mask the latencies of the network and make sure the user experience is as smooth as possible. Those caches are replicated asynchronously with the central storage, meaning that files that are written there will appear after some delay on the other clusters.
There is no point in creating a file from a cluster and then quickly
going to another cluster and running
ls rapidly waiting for the file to
appear. The file will be copied on the main storage only after 15 seconds, and
the metadata will be copied on the other clusters after up to 60 seconds.
It also means that if you modify the same file from two different clusters, the result is undefined.
Do not write to the same file from two different clusters at the same time. This would corrupt the file.
After some delay, the files which are not read from a cluster are removed from
the cache connected to that cluster. The metadata are kept, meaning that the
files appear when you use
ls but the data are not. The data will be synced
back as soon as you try to access the file.
The common storage is split into four distinct directories:
You, as a user, have read and write access to the first two, maybe to the third one, and only read access to the fourth one.
If you run the
du command on a cluster, you will only get the disk
usage on the cache, not on the main central system!
To find out the used space on the common filesystem folders always add the
du --apparent-size -sh /common/storage/path
For more details on this topic keep reading the following section.
/CECI/home partition is designed to work as the current home filesystems you have access to on the clusters. You have a 100GB quota for you to store your own software, configuration files, and small (input) data files or (output) result files. The advantage of using this home rather than the cluster-specific one is that no matter how you configure your environment on one cluster, you will find the same environment on all clusters without the need to move and copy files around.
The 100GB quota is on the local caches connected to the clusters, But your quota on the central storage is NOT 5 x 100GB. Do not try to play the system and fill your quota on all the clusters at the same time ; you might be able to go beyond the quota limit, but it could result in file corruption and loss of synchronisation.
To get your current usage on the main central system, use the
from one of the clusters. If the
quota command does not list the central
storage, try first listing its contents with for instance
ls $CECIHOME. It
should then appear.
In the long term, this filesystem will be the default home on the clusters, but at the moment, you will need to move there (using
cd) explicitly. An environment variable named
$CECIHOME points to your home on the common filesystem so you can issue
If you have, in your CÉCI home, programs that you compiled by yourself, make sure you have used the tips and tricks mentioned in Compiling for multiple CPU architectures to make sure they will be able to run on every CPU architecture present in the CÉCI clusters.
To make clearer the previous explanations about the complex solution running
the common storage, we make a special consideration about the usage of the
du command to monitor the space on disk taken by files and directories.
When copying, for example, a 6MB file from one of the cluster’s home folder to
$CECIHOME and perform a
du -sh you will obtain:
mylogin@nic4: $ cp ~/file.dat $CECIHOME/ mylogin@nic4: $ du -sh $CECIHOME/file.dat 6.0M $CECIHOME/file.dat
In order to avoid the inherent problems due to wide-area interconnects
latencies, the rest of the caches at the different clusters, are served only
the metadata of the files created on one of them. That is to say, your
file.dat file will be actually copied, synced and stored on both main
storages at ULiège and UCL, but the rest of the caches will only have the
information that the file exist on the common space, the actual
contents will be transferred only on demand.
Then, if after the previous steps you login on another CÉCI cluster and perform
du -sh you will see:
mylogin@dragon1: $ du -sh $CECIHOME/file.dat 0 $CECIHOME/file.dat
If in this cluster you access the file, i.e. open with an editor, do a
less, etc. then the actual data contained on the file will be
transferred from one of the main storages to the cache on this cluster,
afterwards you should get:
mylogin@dragon1: $ du -sh $CECIHOME/file.dat 6.0M $CECIHOME/file.dat
To summarize, you should never rely on
du -sh to verify if a file is
properly stored or copied, and this is valid in general for any kind of
filesystem. A more appropriate action would be to check for file consistency
with a hash tool and verify to get the same output on different clusters:
mylogin@dragon1: $ md5sum $CECIHOME/file.dat da6a0d097e307ac52ed9b4ad551801fc $CECIHOME/file.dat
In the case that you want to know which is the approximate space taken by a
file or directory on
$CECIHOME then add the
--apparent-size option to
mylogin@dragon1: $ du -sh --apparent-size $CECIHOME/file.dat 6.0M $CECIHOME/file.dat
the output should be closely the same on all the clusters, independently if
files were accessed or not. Notice that the man page for
du defines the
tool as du - estimate file space usage.
In case this info is very different among clusters, please submit a ticket on the CÉCI Support page.
/CECI/trsf partition is meant for large file transfers from one cluster to another. The user quota on that partition is 1TB soft and 10TB hard with a grace period of 10 days. This means that you use up to 10TB, but as soon as you go above the 1TB limit, you will have 10 days to remove files and come back below the 1TB limit. If you fail to do so, you will not be able to write to the disk anymore until you comply. This directory is purged of old files regularly, each time the purge is performed all files created earlier than 6 months ago will be irremediably removed.
Do not use that partition as a storage partition. It is meant for transfers and can only hold temporary data.
This partition can be used to transfer large files from the scratch space (see globalscratch) of a cluster to the scratch space of another one. It will be much faster than using
rsync directly. As this filesystem is optimized for large files, if you have a lot of small files, you must
tar them otherwise you will experience very poor performances. For instance, assuming you have a directory
transfer_me on the scratch of Lemaitre2 and want to move it to the scratch of NIC4:
cd $GLOBALSCRATCH tar cvzf $CECITRSF/transfer_me.tgz ./transfer_me
On NIC4, a bit later:
cd $GLOBALSCRATCH tar xvzf $CECITRSF/transfer_me.tgz
Everyone can write into the
$CECITRSF directory, but the sticky bit is set so that you can only delete your own files.
Even if this is a two-step operation, if will be much faster than using
scp for large files as it benefits from the fast dedicated network. But you still have the option of using
scp of course.
An environment variable named
$CECITRSF points to that location.
/CECI/proj partition will be dedicated to projects, i.e. groups of researchers working towards a common goal and sharing files. Projects will be created upon motivated request, with a specific quota and a specific duration.
Currently, if you want to enable a project for your group please contact email@example.com to do the request.
/CECI/soft partition will hold all the software that will be available on all clusters. They will be compiled for each processor type that is present in one of the CÉCI clusters. On each node, the version specific to the hardware of the node will be selected by the module.
Behind the scenes¶
The CÉCI shared storage is based on two main storage systems hosted in Liège and Louvain-la-Neuve using the GPFS filesystem developed by IBM. Those two storage systems are synchronously mirrored, meaning that any file written to one of them is automatically written to the other one.
In addition, to keep the data on the common partitions synchronized among the different geographical locations of the CÉCI clusters, it is used the Active File Management (AFM) implementation. Within this setup a global namespace is defined to share a single filesystem among different gateway nodes which are physically deployed on each of the CÉCI universities.
These five smaller storage systems that serve as buffers/caches and the main ones are interconnected through a dedicated 10GBps network. Those caches located on each site and are tightly connected to the cluster compute nodes.