Skip to content

Glossary

HPC cluster

A high-performance cluster is a collection of servers, called "nodes" in this context, interconnected by a fast network. A typical cluster consists in one or two management nodes, one or two login nodes ("frontends"), several storage nodes, and a large collection of compute nodes. Users of a HPC cluster must submit jobs to the managing software, called resource manager/job scheduler that decides when and where their computations run.

Compute node

A server in a cluster dedicated to performing the computations. Compute nodes host compute units (processor), volatile memory (RAM), local persistent storage (disk)

Processor

A compute node can host one or multiple processors, fitted in the sockets of the motherboard. These are the chips that perform the actual computation. They are made of multiple cores and caches.

Core

One of the subsystems of a processor that can independently run instructions that are part of a software process. It comprises arithmetic and logic units (ALU) and floating-point units (FPU) and some caches memory and is often split into two hardware threads.

Hardware thread

Independent intruction processing channel inside a core. All threads in a core share some of the computations units of the core. See SMT and Hyper-Threading for more information.

Cache

Small and very fast working memory sitting on the processor, used to hold data transferred to or from the main volatile memory (RAM).

RAM

Acronym for "Random Access Memory", this is the main working memory of a node, holding the data produced or consumed by the computation process during its lifetime.

Disk

Persistent storage attached to a node. Some disks will hold the operating system (OS), other will hold the user data. Multiple technologies exist: HDD, SSD, NVMe with different capacities/performance/cost tradeoffs.

Operating system

The piece of software that runs the base services on a node and interfaces the hardware resources with the software processes.

Process

Running instance of a program started by the user, the operating system or the job scheduler. A process lives in RAM, can be made of multiple threads, each of which is ideally executed by a cores.

Software thread

Execution context (flow) of a sequence of instructions that can be run independently from the rest of the program (the other threads) while sharing the same adress space. Threads in a program are created through mechanisms such as pthreads or OpenMP.

Program

Sequence of instructions in either human-readable format (source code) or machine-readable (compiled format (binary). Instructions can refer to mathematical operations, memory transfers, disk access, network connections, etc.

Compiler

Program that produces a binary exectuable from human-readable source code written in a programming language, by contrast with an interpreter.

Interpreter

Program that executes human-readable source code written in a programming language directly by calling the corresponding functionalities in its own code, by contrast with a compiler.

Job

Sequence of instructions treated as a single and distinct unit. More specifically, a job is a sequence of steps, each step consisting in multiple parallel tasks. A job is described by a job submission script and submitted to the job scheduler.

Step

A invocation of a program along with its arguments through the srun Slurm command. The srun command will swpan as many tasks as requested, monitor them, report resource usage, and forward incoming UNIX signals. Each step of a job will have a distinct entry in the accounting for this job.

Task

A running instance of a program started by srun (or mpirun). Multiple tasks of the same steps will run in parallel, on possibly distinct nodes and can itself use multiple CPUs in parallel.

CPU

Central Processing Unit. In general, a CPU is synonym to processor, but in this context, it must be understood as a single allocatable unit for computation. Depending on the node configuration and on the parametrization of the job scheduler, it will most often correspond to a core or a hardware thread

Job submission script

Shell script (text file containing shell commands) that describes a job (what commands to run, which program to start, etc.) along with its resource requirements and additional parameters (such as a job name, an email address for notifications, etc.)

Resource requirements

A list of resources that are needed by a job, such as number of CPUs, amount of RAM, possibly GPUs, software licences, etc. and a maximum duration for the usage of the those ressources (Wall time limit).

Serial job

Job whose steps consist in only one step using one CPU.

Parallel job

Job whose steps consist in either multiple tasks (distributed-memory job) or a single tasks using multiple CPUs (shared-memory job). How well a job performs as more and more CPUs are made available to it is called scalablilty.

Wall time

The time during which a program runs, from start to finish, "as measured by the clock on the wall". Jobs on clusters are subject to maximum walltime to allow fair sharing of the cluster.

Job scheduler

Piece of software that accepts, schedules and dispatches, jobs to resources based on their resource requests and priority.

Scheduling

Action of deciding which job can use which resources and when, based on resource availability, job priorities and backfill opportunities. Jobs that can currently use resources are said to be running, the others are pending.

Job priority

Number associated with a job that is used to decided the ordering in which jobs are considered for resource allocation. The priority can be computed based on multiple criteria such as fairshare, job size, queue time, quality of service, etc.

Backfill

Scheduling policy by which a job with a lower priority can start before a job with higher priority, provided it does not delay that higher-priority job's start time. This can be either because it requires a completely distinct set of resources, or because it will free the resources before the predicted time of the higher-priority job.

Fairshare

Fairshare is a measure of how far away from a fair usage of the resources a user or an account is, based on its cluster share. Users who used the cluster a lot in the recent past (i.e. have consumed many CPU.hours) have a lower fairshare than users who did not. Fairshare is adapted in realtime as the running jobs consume resources.

CPU.hour

Unit of computing resource usage corresponding to using a full CPU for one hour, or, equivalently, two CPUs for half an hour, etc. The concept is similar for node.hour, node.day, GPU.hour, etc.

Account

A Slurm account is an administrative concept that allows tracking and organising resource usage among user organisation levels (departments, units, etc.)

Share

Portion of the total available resource a cluster can offer that is "promised" to a user or an account, often based on administrative concerns.

Quality of service

Functionality by which a user can request particular privileges for a job, such as for instance a priority boost.

Shared-memory job

Job whose parallel threads or processes can all address the same memory, or at least a common portion of memory. Such jobs can only run on a single node.

Distributed-memory job

Job whose parallel tasks each have their own memory space. Such jobs can spread across multiple nodes provided they rely on a mechanism for data communication between tasks, which can be through the network XXX or via the disks.

Shared-memory programming

Parallel programming paradigm where all theads or process are able to read and/or write in the same memory space, for instance with threads spawed from the same process, or processes using a shared memory segment. Shared-memory programming is typically done in a shared-memory job, but can also happen in a distributed-memory job provided a PGAS library such as Unified Parallel C, Coarray Fotran or OpenSHMEM is used.

Distributed-memory programming

Parallel prgramming paradigm where processes exchange messages rather than sharing memory. If the exchange mechanism is able to travel through networks, as is the case with MPI, the processes can live in distinct nodes, if it is not the case, such as with named pipes or UNIX sockets

OpenMP

Standard for shared-memory programming abstracts explicit threads and offer parallel programming constructs such as a parallel for loop. Many optimised libraries are built using OpenMP.

MPI

Message Passing Interface. De-facto standard for distributed-memory programming. Multiple implementations exist for various languages, some are open-source, some are sold by hardware manufacturers or compiler vendors.

GPU

Graphical Processing Unit. Hardware component initially designed to connect to the computer screen, later evolved to off-load heavy computations from the processor. The compute power of a GPU is much larger than that of a processor, but the way it is programmed is very different. Specific libraries such as CUDA, ROCm, OpenACC, OpenCL must be used.

Optimised library

Software library that can be included into a program and offer access to optimised linear algebrae, signal processing, statistical, etc. functionalities. Open-source examples are OpenBLAS, BLIS, or Atlas. Commercial libraries include for instance MKL. Like compilers and other software, they are most often organised in environment modules, or system modules.

Environment module

Mechanism that allows modifying the environment (environment variable, alias, etc.) in which programs are started by the shell. By modifying variables such as $PATH or $LD_LIBRARY_PATH, users can choose which of the installed software they want to use. Modules are often organised into releases.

Release

In the context of modules, a release is a set of modules related to software whose versions and toolchain have been carefully chosen so as to be compatible.

Toolchain

Set of compiler, optimised libraries and MPI modules that are used together to compile software.

Shell

Command-line interface to the operating system that interpret user commands such as starting a program, copying a file, etc. Examples include Bash and Zsh.*

CLI

Command-line interface. User interface that involves the keyboard to enter command in response to a prompt in a terminal window, by contrast with a TUI (text user interface), that lets users interact with menu items or widgets in a terminal through the keyboard or a GUI (grahipcal user interface) that allows users to click on buttons and menu items with the mouse.

Terminal

Piece of software that manages the interactions between the shell and the console of the user. The shell can be local to the machine or running on a remote server and connected to through SSH.

Console

Computer equipment consisting in a keyboard and a screen.

SSH

Secure Shell. Protocole by which users can connect to remote servers (for instance a login node of a cluster) using their login and a combination of authentication factors (password, key, hardware token, etc.).

File

A piece of data (text, number sequence, image, etc.) stored in a filesystem and primarily identified by its name. Along the name, other metadata information is stored by the filesystem such as owernship, permissions, or creation date. Files are organised in a hierarchy of directories.

Directory

Cataloging structure that contains files or to other directories. Also called folder in some contexts. In HPC cluster, each user has a home directory where they can write files and create sub-directories. It is by default their current working directory when they connect to the cluster.

Filesystem

Data structure and assorted mechanism to store and retrieve files. A filesystem local to a (compute) node can only be accessed from that node. By contrast, a network filesystem (e.g. NFS) is hosted on a (storage) node and exported to all other noeds, and a parallel filesystem (e.g. BeeGFS) is hosted on multiple (storage) nodes and exported to all other nodes.

File permission

Meta data associated with a file, that defines what type of access (read, write, execute) the owner, its group, and the others, have on the file.

Profiler

Software that analyses other software to find out which functionality or operation takes the most time.

Debugger

Software that helps analysing other software to find bugs and problems.

Environment Variables

A shell variable that holds values that can be specified by the user, by the environment module system, or by Slurm, that can alter the way sofware behaves.

Scalability

How well a job or program can use increasing amounts of computing resources. A program that strongly scales takes less and less time as more computing power is used. A program that weakly scales takes roughtly the same time when more data and more computing power are used in the same proportions.