Deploying OpenAI’s GPT-OSS model on Kubernetes with Ollama

A step-by-step guide to self-hosting LLMs in a homelab

Madhan published on
4 min, 695 words

Banner

Following up on the previous article about setting up a home lab with AI capabilities, in this article we will cover the practical steps of running a large language model gpt-oss on a Kubernetes cluster using Ollama and all running within a Proxmox environment.

Setting up Ollama

The deployment of Ollama within a k8s environment is simplified by using the otwld/ollama helm chart. This chart manages various parameters, including GPU enablement for Nvidia, ingress configuration, and pulling various models from the library.

Configure the values.yml

First, create a values.yml file to define the specific configuration. And it includes the instruction on how to set up Ollama, enabling GPU support, and specifies which model to download.

ollama:
    gpu:
    enabled: true
    type: "nvidia"
    number: 1
    models:
    pull:
        - gpt-oss:20b
ingress:
    annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
    enabled: true
    className: "nginx"
    hosts:
    - host: ollama.local.com
        paths:
        - path: /
            pathType: Prefix

Install the Helm Chart

With the values.yml file created, run the following helm command to install Ollama.

helm install ollama otwld/ollama \
    --namespace ollama \
    --create-namespace \
    -f values.yml

Verifying the Installation

After the installation completed, verify that all components are running by listing the resources in the ollama namespace.

kubectl get all -n ollama

Should see an output similar to this,

NAME                        READY   STATUS    RESTARTS   AGE
pod/ollama-8699ddb5-c6n9q   1/1     Running   0          31m

NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/ollama   ClusterIP   10.233.62.132   <none>        11434/TCP   31m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ollama   1/1     1            1           31m

NAME                              DESIRED   CURRENT   READY   AGE
replicaset.apps/ollama-8699ddb5   1         1         1       31m

To access the Ollama API using the hostname defined in the ingress, add the following entry to local /etc/hosts file, replacing with the IP address of Kubernetes ingress controller.

INGRESS_IP_ADDRESS   ollama.local.com

Testing the Deployed Model

Test the model by sending a request to the Ollama API endpoint using curl.

curl http://ollama.local.com/api/generate -d '{
    "model": "gpt-oss:20b",
    "prompt": "Why is the sky blue?",
    "stream": false
}'

If everything is configured correctly, a JSON response from the model will be sent.

Monitoring resources

Checking the Pod Logs

The pod logs provides the detailed information about the server’s status, including GPU detection and model loading.

time=2025-08-10T12:58:29.372Z level=INFO source=types.go:130 msg="inference compute" id=GPU-075d926f-8a44-1552-2cbc-276dc5bfd68d library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5070 Ti" total="15.5 GiB" available="15.3 GiB"
time=2025-08-10T13:02:21.739Z level=INFO source=ggml.go:378 msg="offloaded 22/25 layers to GPU"

Monitoring GPU Usage

While a query is being processed, SSH into the GPU enabled worker node and monitor the resource utilization in real time using the following command.

watch -n 1 nvidia-smi

Every 1.0s: nvidia-smi                                   worker4: Sun Aug 10 13:46:00 2025

Sun Aug 10 13:46:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   41C    P1             45W /  300W |   12678MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2506171      C   /usr/bin/ollama                       12668MiB |
+-----------------------------------------------------------------------------------------+

The above output shows the Ollama process consuming GPU resource.

* * * *

Originally published on Medium

Reference:

[1] https://github.com/otwld/ollama-helm

[2] https://github.com/openai/gpt-oss

[3] https://enterprise-support.nvidia.com/s/article/Useful-nvidia-smi-Queries-2