Using Ollama inside a job¶

Contributed by francesco.ricci@uclouvain.be

Here is a step-by-step procedure to set up and run an LLM on the CECI clusters and use/query it from your laptop.

The workflow is similaro to using JupyterLab
Using Ollama here, but any other LLM server should work similarly
This tutorial is about Lyra, just make sure to use a cluster with GPUs

Setup procedure¶

Before starting, follow the documentation to configure ssh_config to easily connect to the clusters.
Connect to a cluster with GPUs available:
```
ssh lyra
```

Locate the ollama module and/or the ollama Singularity/Apptained file

ml spider ollama
ls /srv/apps/shared/containers/ | grep -i ollama

Either prepare a submission script that runs the Ollama server. Here is an example for Lyra, "module" version ...

Module version:

#!/bin/bash
#
#SBATCH --job-name=ollama
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch          # Lyra (or --partition=gpu for Hercules / Dragon2)

# --- resources ------------------------------------------------------
#SBATCH --gpus=1                   # one GPU per node on Lyra
#SBATCH --time=02:00:00            # walltime D‑HH:MM:SS
#SBATCH --mem=48G

# --- environment ----------------------------------------------------
module purge
module load releases/2024a ollama

# --- run ------------------------------------------------------------

export PORT=11434 # 11434 is the default Ollama port
                  # please chose another port if another
                  # Ollama job is already running on the
                  # node.
export OLLAMA_HOST="0.0.0.0:$PORT" 

echo "http://$(hostname -i):$PORT" > ollama.log

ollama serve &>> ollama.log

..or prepare a submission script that runs the Ollama server. Here is an example for Lyra, "Apptainer" version:

#!/bin/bash
#
#SBATCH --job-name=ollama
#SBATCH --output=%j_%x.out
#SBATCH --partition=batch          # Lyra (or --partition=gpu for Hercules / Dragon2)

# --- resources ------------------------------------------------------
#SBATCH --gpus=1                   # one GPU per node on Lyra
#SBATCH --time=02:00:00            # walltime D‑HH:MM:SS
#SBATCH --mem=48G

# --- environment ----------------------------------------------------
module purge

# --- run ------------------------------------------------------------

apptainer exec --nv /srv/apps/shared/containers/Ollama.sif bash -c '
export PORT=11434 # 11434 is the default Ollama port
                  # please chose another port if another
                  # Ollama job is already running on the
                  # node.
export OLLAMA_HOST="0.0.0.0:$PORT"
echo $(hostname -I) > ollama.log
ollama serve &>> ollama.log
'

Be mindful when setting the walltime. Avoid keeping this job running if you are not actually using the LLM. Set a small number of hours for testing, and/or kill the job when you stop using the LLM.

Submit the script, named submit_ollama.sh in this example:
```
sbatch submit_ollama.sh
```
- Mind that only when this job is actually running will you be able to use the LLM.
Look at the IP of the node that is running the Ollama server:
```
head -n1 ollama.log
```
This will output something like http://10.0.7.40:11434
On your local machine/laptop, install the sshuttle Python package. Most Linux distribution have a package you can install easily, and Brew can be used on MacOS. Otherwise you can try:
```
pip install sshuttle
```
Run sshuttle to open the SSH forward/tunnel:
```
sshuttle -r lyra 10.0.7.1/24
```
- Check in the table here which IPs to use depending on the cluster.
- Mind that you'll need to enter your local password and be among the sudo users to run this command.
- Also, the SSH key passphrase might be asked if your SSH agent is not running.
Alternative — SSH local port forwarding (works on Windows, macOS and Linux):

You can also forward a single port from the compute node to your laptop using SSH's local port forwarding. This often works on Windows where sshuttle may not be available.

Example (replace <IP> and <PORT> with the values from ollama.log, e.g. 10.0.7.40 and 11434):
```
ssh -L 8080:<IP>:<PORT> lyra
# example: ssh -L 8080:10.0.7.40:11434 lyra
```
Then open the server in your browser at http://localhost:8080 on your laptop.

You can leave this SSH session open while you use the service; use Ctrl+C to close the tunnel when done.
Now everything is set up, so you should be able to see the Ollama server from your laptop. You can test by opening the IP from ollama.log in your browser:

http://10.0.7.40:11434

The page should say "Ollama is running".

Mind: this will work correctly only when the job submitted above is actually running. Please check the status of your job with:
```
squeue --me --start
```
You can use Slurm's emails functionalitites to be notified when your job starts.

Now you can start the LLM model you want and be able to send and receive messages from it. Here is an example using the LangChain Python package:

from langchain_ollama import ChatOllama

ollama = ChatOllama(
    model="gpt-oss:20b",
    model_provider="ollama",
    base_url="http://10.0.7.40:11434/",
    temperature=0.1,
)

Note: you need to pull the LLM model you want from the Ollama repository before running it, for example:

ollama pull gpt-oss:20b

See Ollama documentation for more information.

Troubleshooting¶

To check if Ollama is finding the GPUs, look in ollama.log for something like the following:

name=CUDA0 description="NVIDIA RTX 6000 Ada Generation" libdirs=ollama,cuda_v12
driver=12.8 pci_id=01:00.0 type=discrete total="48.0 GiB" available="47.4 GiB"

Be sure your job is running on the cluster:

squeue --me

Security¶

Warning

This offers no protection; anybody who knows which port is used can connect to the running instance of Ollama. Ollama developers are working towards adding support for UNIX sockets which would allow better control but at the time of writing this document, it is not finalized yet. In the meantime, you can set #SBATCH --exclusive and define export OLLAMA_HOST="localhost:$PORT" but then you need to setup an SSH tunnel directly to the compute node, using sshuttle will not be sufficient.