Using Ollama inside a job¶
Contributed by francesco.ricci@uclouvain.be
Here is a step-by-step procedure to set up and run an LLM on the CECI clusters and use/query it from your laptop.
- The workflow is similaro to using JupyterLab
- Using Ollama here, but any other LLM server should work similarly
- This tutorial is about Lyra, just make sure to use a cluster with GPUs
Setup procedure¶
- Before starting, follow the documentation to configure
ssh_configto easily connect to the clusters. -
Connect to a cluster with GPUs available:
ssh lyra -
Locate the
ollamamodule and/or theollamaSingularity/Apptained fileml spider ollama ls /srv/apps/shared/containers/ | grep -i ollama -
Either prepare a submission script that runs the Ollama server. Here is an example for Lyra, "module" version ...
Module version:
#!/bin/bash # #SBATCH --job-name=ollama #SBATCH --output=%j_%x.out #SBATCH --partition=batch # Lyra (or --partition=gpu for Hercules / Dragon2) # --- resources ------------------------------------------------------ #SBATCH --gpus=1 # one GPU per node on Lyra #SBATCH --time=02:00:00 # walltime D‑HH:MM:SS #SBATCH --mem=48G # --- environment ---------------------------------------------------- module purge module load releases/2024a ollama # --- run ------------------------------------------------------------ export PORT=11434 # 11434 is the default Ollama port # please chose another port if another # Ollama job is already running on the # node. export OLLAMA_HOST="0.0.0.0:$PORT" echo "http://$(hostname -i):$PORT" > ollama.log ollama serve &>> ollama.log -
..or prepare a submission script that runs the Ollama server. Here is an example for Lyra, "Apptainer" version:
Be mindful when setting the walltime. Avoid keeping this job running if you are not actually using the LLM. Set a small number of hours for testing, and/or kill the job when you stop using the LLM.#!/bin/bash # #SBATCH --job-name=ollama #SBATCH --output=%j_%x.out #SBATCH --partition=batch # Lyra (or --partition=gpu for Hercules / Dragon2) # --- resources ------------------------------------------------------ #SBATCH --gpus=1 # one GPU per node on Lyra #SBATCH --time=02:00:00 # walltime D‑HH:MM:SS #SBATCH --mem=48G # --- environment ---------------------------------------------------- module purge # --- run ------------------------------------------------------------ apptainer exec --nv /srv/apps/shared/containers/Ollama.sif bash -c ' export PORT=11434 # 11434 is the default Ollama port # please chose another port if another # Ollama job is already running on the # node. export OLLAMA_HOST="0.0.0.0:$PORT" echo $(hostname -I) > ollama.log ollama serve &>> ollama.log ' -
Submit the script, named
submit_ollama.shin this example:sbatch submit_ollama.sh- Mind that only when this job is actually running will you be able to use the LLM.
-
Look at the IP of the node that is running the Ollama server:
This will output something likehead -n1 ollama.loghttp://10.0.7.40:11434 -
On your local machine/laptop, install the
sshuttlePython package. Most Linux distribution have a package you can install easily, and Brew can be used on MacOS. Otherwise you can try:pip install sshuttle -
Run
sshuttleto open the SSH forward/tunnel:sshuttle -r lyra 10.0.7.1/24- Check in the table here which IPs to use depending on the cluster.
- Mind that you'll need to enter your local password and be among the sudo users to run this command.
- Also, the SSH key passphrase might be asked if your SSH agent is not running.
Alternative — SSH local port forwarding (works on Windows, macOS and Linux):
You can also forward a single port from the compute node to your laptop using SSH's local port forwarding. This often works on Windows where
sshuttlemay not be available.Example (replace
<IP>and<PORT>with the values fromollama.log, e.g.10.0.7.40and11434):ssh -L 8080:<IP>:<PORT> lyra # example: ssh -L 8080:10.0.7.40:11434 lyraThen open the server in your browser at
http://localhost:8080on your laptop.You can leave this SSH session open while you use the service; use Ctrl+C to close the tunnel when done.
-
Now everything is set up, so you should be able to see the Ollama server from your laptop. You can test by opening the IP from
ollama.login your browser:
http://10.0.7.40:11434
The page should say "Ollama is running".
-
Mind: this will work correctly only when the job submitted above is actually running. Please check the status of your job with:
You can use Slurm's emails functionalitites to be notified when your job starts.squeue --me --start -
Now you can start the LLM model you want and be able to send and receive messages from it. Here is an example using the LangChain Python package:
from langchain_ollama import ChatOllama ollama = ChatOllama( model="gpt-oss:20b", model_provider="ollama", base_url="http://10.0.7.40:11434/", temperature=0.1, ) -
Note: you need to pull the LLM model you want from the Ollama repository before running it, for example:
ollama pull gpt-oss:20b
- See Ollama documentation for more information.
Troubleshooting¶
- To check if Ollama is finding the GPUs, look in
ollama.logfor something like the following:
name=CUDA0 description="NVIDIA RTX 6000 Ada Generation" libdirs=ollama,cuda_v12
driver=12.8 pci_id=01:00.0 type=discrete total="48.0 GiB" available="47.4 GiB"
- Be sure your job is running on the cluster:
squeue --me
Security¶
Warning
This offers no protection; anybody who knows which port is used can connect to the running instance of Ollama.
Ollama developers are working towards adding support for UNIX sockets which would allow better control but at the time of writing this document, it is not finalized yet. In the meantime, you can set #SBATCH --exclusive and define export OLLAMA_HOST="localhost:$PORT" but then you need to setup an SSH tunnel directly to the compute node, using sshuttle will not be sufficient.