Deploy OpenClaw on Kubernetes
An enterprise-grade deployment with high availability, auto-scaling, and declarative infrastructure. This guide walks through the full Kubernetes resource stack: namespace, secrets, persistent storage, deployment, service, ingress, and horizontal pod autoscaling.
Estimated time: 2 hours | Cost: $50-200/month | Difficulty: Advanced
Prerequisites
Before you begin, make sure you have:
- A Kubernetes cluster -- any of the following:
- Managed: AWS EKS, Google GKE, or Azure AKS
- Self-hosted: k3s, kubeadm, or Rancher
- Local development: minikube or kind (for testing only)
- kubectl installed and configured to communicate with your cluster (
kubectl cluster-infoshould succeed) - Helm v3 installed (optional, for the Helm chart section)
- An API key for your LLM provider (Anthropic, OpenAI, etc.)
- An Ingress controller installed in the cluster (e.g., ingress-nginx) if you want external HTTP/HTTPS access
Step 1: Namespace and Secrets
Create a dedicated namespace to isolate OpenClaw resources:
kubectl create namespace openclaw
Store your API keys as a Kubernetes Secret. Never put API keys directly in Deployment manifests or ConfigMaps:
kubectl create secret generic openclaw-secrets \
--namespace openclaw \
--from-literal=ANTHROPIC_API_KEY='your-anthropic-api-key-here' \
--from-literal=OPENAI_API_KEY='your-openai-api-key-here'
Verify the secret was created:
kubectl get secrets -n openclaw
Tip: For production clusters, consider using an external secrets manager like AWS Secrets Manager, HashiCorp Vault, or the External Secrets Operator instead of plain Kubernetes Secrets. Kubernetes Secrets are base64-encoded (not encrypted) by default.
Step 2: Deployment Manifest
Create a file named openclaw-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw
namespace: openclaw
labels:
app: openclaw
spec:
replicas: 2
selector:
matchLabels:
app: openclaw
template:
metadata:
labels:
app: openclaw
spec:
containers:
- name: openclaw
image: openclaw/openclaw:latest # Pin to a specific tag in production
ports:
- containerPort: 3111
name: http
env:
- name: OPENCLAW_PORT
value: "3111"
- name: OPENCLAW_HOST
value: "0.0.0.0"
- name: OPENCLAW_LOG_LEVEL
value: "info"
envFrom:
- secretRef:
name: openclaw-secrets
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
volumeMounts:
- name: openclaw-data
mountPath: /home/openclaw/.openclaw
volumes:
- name: openclaw-data
persistentVolumeClaim:
claimName: openclaw-pvc
Key decisions in this manifest:
| Field | Value | Rationale |
|---|---|---|
replicas | 2 | Baseline HA -- survives a single pod failure |
resources.requests.memory | 256Mi | Minimum memory the scheduler reserves per pod |
resources.limits.memory | 512Mi | Hard ceiling to prevent a runaway process from starving other workloads |
resources.requests.cpu | 250m | One quarter of a CPU core guaranteed |
resources.limits.cpu | 500m | Burst up to half a core |
livenessProbe | /health, 20s interval | Restarts the pod if it becomes unresponsive |
readinessProbe | /health, 10s interval | Removes the pod from Service endpoints until it is ready to accept traffic |
Step 3: Persistent Storage
OpenClaw stores configuration, skill data, and local state on disk. A PersistentVolumeClaim ensures this data survives pod restarts.
Create a file named openclaw-pvc.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openclaw-pvc
namespace: openclaw
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: gp3 # Adjust for your cluster (gp2/gp3 on EKS, standard on GKE, managed-csi on AKS)
Note:
ReadWriteOncemeans the volume can be mounted by pods on a single node. If you need multi-node access (e.g., replicas on different nodes reading the same data), useReadWriteManywith a storage class that supports it (EFS on AWS, Filestore on GCP), or redesign the application to use a shared database instead of local files.
Step 4: Service and Ingress
ClusterIP Service
Expose the OpenClaw pods to other resources inside the cluster.
Create a file named openclaw-service.yaml:
apiVersion: v1
kind: Service
metadata:
name: openclaw
namespace: openclaw
labels:
app: openclaw
spec:
type: ClusterIP
selector:
app: openclaw
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
Ingress with TLS
Route external traffic to the Service. This example assumes you are using the ingress-nginx controller and cert-manager for automatic TLS certificates.
Create a file named openclaw-ingress.yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: openclaw
namespace: openclaw
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "86400"
nginx.ingress.kubernetes.io/proxy-send-timeout: "86400"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
# WebSocket support
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
spec:
ingressClassName: nginx
tls:
- hosts:
- openclaw.yourdomain.com # Replace with your domain
secretName: openclaw-tls
rules:
- host: openclaw.yourdomain.com # Replace with your domain
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: openclaw
port:
number: 80
No domain? Skip the Ingress and use
kubectl port-forward svc/openclaw -n openclaw 3111:80for local access during development.
Step 5: Deploy
Apply all the manifests in order:
# Apply the PVC first so the volume is ready before the Deployment references it
kubectl apply -f openclaw-pvc.yaml
# Apply the Deployment, Service, and Ingress
kubectl apply -f openclaw-deployment.yaml
kubectl apply -f openclaw-service.yaml
kubectl apply -f openclaw-ingress.yaml # Skip if you have no Ingress controller
Verify everything is running:
# Check pod status
kubectl get pods -n openclaw
# Watch pods come up in real time
kubectl get pods -n openclaw -w
# Check the Service endpoints
kubectl get endpoints openclaw -n openclaw
# View logs from a specific pod
kubectl logs -n openclaw -l app=openclaw --tail=50
# Describe a pod if it is stuck in Pending or CrashLoopBackOff
kubectl describe pod -n openclaw -l app=openclaw
A healthy deployment looks like this:
NAME READY STATUS RESTARTS AGE
openclaw-6d4f8b7c9f-abc12 1/1 Running 0 2m
openclaw-6d4f8b7c9f-def34 1/1 Running 0 2m
Step 6: Scaling and High Availability
Horizontal Pod Autoscaler
Automatically scale the number of pods based on CPU utilization. Requires the Metrics Server to be installed in your cluster (most managed clusters include it by default).
Create a file named openclaw-hpa.yaml:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: openclaw
namespace: openclaw
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: openclaw
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Pod Disruption Budget
Ensure at least one pod is always available during voluntary disruptions (node drains, cluster upgrades):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: openclaw
namespace: openclaw
spec:
minAvailable: 1
selector:
matchLabels:
app: openclaw
Anti-affinity rules
Spread replicas across different nodes so a single node failure does not take down all pods. Add this to the Deployment's spec.template.spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- openclaw
topologyKey: kubernetes.io/hostname
We use preferredDuringScheduling (soft rule) instead of requiredDuringScheduling so the scheduler can still place pods if you have fewer nodes than replicas.
Step 7: Helm Chart (Optional)
For teams that deploy OpenClaw across multiple environments (dev, staging, production), a Helm chart parameterizes the manifests above.
A minimal values.yaml:
# values.yaml
replicaCount: 2
image:
repository: openclaw/openclaw
tag: "1.0.0" # Pin a specific version
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
ingress:
enabled: false
hostname: openclaw.yourdomain.com
tls: true
clusterIssuer: letsencrypt-prod
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
persistence:
enabled: true
size: 5Gi
storageClass: gp3
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
secrets:
anthropicApiKey: "" # Set via --set or a secrets file, never commit to Git
openaiApiKey: ""
Install with environment-specific overrides:
helm install openclaw ./charts/openclaw \
--namespace openclaw \
--create-namespace \
--values values.yaml \
--set secrets.anthropicApiKey='your-key-here'
Upgrade after a configuration or image change:
helm upgrade openclaw ./charts/openclaw \
--namespace openclaw \
--values values.yaml
Monitoring
Prometheus ServiceMonitor
If you run the Prometheus Operator (kube-prometheus-stack), create a ServiceMonitor to automatically scrape OpenClaw metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: openclaw
namespace: openclaw
labels:
release: kube-prometheus-stack # Must match your Prometheus Operator's label selector
spec:
selector:
matchLabels:
app: openclaw
endpoints:
- port: http
path: /metrics
interval: 30s
Grafana dashboard
Import or create a Grafana dashboard that visualizes:
- Request rate and latency (from OpenClaw's
/metricsendpoint) - Pod CPU and memory usage (from Kubernetes metrics)
- Restart counts and probe failures
- HPA scaling events
Log aggregation with Loki
If you run Grafana Loki (part of the Grafana stack), logs from OpenClaw pods are automatically collected by Promtail or the Grafana Agent. Query them in Grafana with:
{namespace="openclaw", app="openclaw"}
Filter for errors:
{namespace="openclaw", app="openclaw"} |= "error"
For clusters without Loki, use kubectl logs:
# Tail logs from all OpenClaw pods simultaneously
kubectl logs -n openclaw -l app=openclaw -f --tail=100
# Logs from a specific pod
kubectl logs -n openclaw openclaw-6d4f8b7c9f-abc12 --tail=200
Cost Considerations
Kubernetes is powerful but not cheap. Here is a realistic cost breakdown for managed clusters:
| Component | EKS (AWS) | GKE (Google) | AKS (Azure) |
|---|---|---|---|
| Control plane | $73/month | $73/month (Standard) | Free (Basic) |
| Worker nodes (2x t3.medium / e2-medium / B2s) | ~$60/month | ~$50/month | ~$60/month |
| Load balancer | ~$18/month | ~$18/month | ~$18/month |
| Persistent storage (5 GB) | ~$1/month | ~$1/month | ~$1/month |
| Total estimate | ~$152/month | ~$142/month | ~$79/month |
Is Kubernetes overkill for you? If you are a single user or a small team running one OpenClaw instance, the answer is probably yes. The Ubuntu Server guide above gives you the same result at a fraction of the cost. Kubernetes makes sense when you need multi-tenant isolation, auto-scaling across many agents, zero-downtime deployments, or integration with an existing Kubernetes-based platform.
For lower-cost Kubernetes, consider:
- k3s on a single VPS: A lightweight Kubernetes distribution that runs on a $6-12/month VPS. You lose HA but gain the Kubernetes API and ecosystem.
- GKE Autopilot: Pay only for pod resources, no node management. Can be cheaper for bursty workloads.
- AKS with the free control plane: Azure does not charge for the Kubernetes control plane on the Basic tier, saving ~$73/month versus EKS or GKE Standard.
Troubleshooting
Pods stuck in Pending
kubectl describe pod -n openclaw -l app=openclaw
Common causes:
- Insufficient resources: The cluster does not have a node with enough free CPU or memory. Scale up your node pool or reduce the resource requests.
- PVC not bound: The PersistentVolumeClaim cannot find a matching PersistentVolume. Check
kubectl get pvc -n openclawand verify thestorageClassNamematches an available StorageClass (kubectl get sc).
Pods in CrashLoopBackOff
kubectl logs -n openclaw -l app=openclaw --previous
Common causes:
- Missing secrets: The
openclaw-secretsSecret does not exist or is missing expected keys. Verify withkubectl get secret openclaw-secrets -n openclaw -o yaml. - Invalid API key: OpenClaw starts but immediately fails authentication with the LLM provider.
- Health check failure: If the
/healthendpoint is not implemented or returns an error, the liveness probe kills the pod repeatedly. Temporarily remove the probes to debug.
Ingress returns 404 or 503
# Verify the Ingress resource exists and has an address
kubectl get ingress -n openclaw
# Check that the Service has endpoints (backing pods)
kubectl get endpoints openclaw -n openclaw
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
Common causes:
- The
hostin the Ingress does not match the domain you are requesting - The Service selector does not match the pod labels
- The Ingress controller is not installed or is in a different namespace
HPA not scaling
kubectl get hpa -n openclaw
If the TARGETS column shows <unknown>/70%, the Metrics Server is not installed or not reporting metrics. Install it:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
On some clusters (especially local ones like minikube), you may need to add --kubelet-insecure-tls to the Metrics Server deployment args.