Building a Self-Healing Kubernetes Cluster with Custom Operators
Building a Self-Healing Kubernetes Cluster with Custom Operators
Our Kubernetes cluster used to wake me up at 3 AM about twice a week. Pod crashed. Database connection pool exhausted. Memory leak in the API service. Every time, the fix was the same: restart the pod, clear some cache, adjust a resource limit.
After the fifth time fixing the exact same issue by hand, I built a custom Kubernetes operator to do it for me. That was eight months ago. I haven’t been paged for that issue since.
Self-healing infrastructure isn’t about AI or machine learning. It’s about codifying the boring, repetitive fixes you do manually and letting the cluster handle them automatically. Custom operators are how you do this in Kubernetes.
Why Built-in Kubernetes Self-Healing Isn’t Enough
Kubernetes already has self-healing features:
- ReplicaSets restart crashed pods
- Liveness probes detect unhealthy containers
- Resource limits prevent runaway processes
These handle the simple cases. But production systems fail in complex ways that basic probes can’t detect:
Connection pool exhaustion. Your app is running, health check returns 200, but it can’t handle new requests because all database connections are stuck. Kubernetes thinks everything is fine.
Memory leaks. Your pod slowly consumes more memory over days. It hasn’t hit the limit yet, so Kubernetes doesn’t restart it. But it’s getting slower and will eventually OOM.
Cascading failures. One service fails, causing increased load on another, which fails, causing a cascade. By the time Kubernetes notices, half your cluster is down.
Configuration drift. Someone manually scaled a deployment during an incident. The HPA is fighting with the manual replica count. Resources are wasted but nothing is “broken.”
External dependency failures. Your database is slow. Your app keeps timing out and restarting. Kubernetes keeps trying, making the problem worse instead of backing off.
These problems need custom logic. That’s what operators provide.
What a Kubernetes Operator Actually Is
An operator is just a control loop that watches Kubernetes resources and takes actions based on what it sees.
The pattern is simple:
- Watch for events (pod crashed, metric threshold exceeded, resource created)
- Evaluate if action is needed (is this the third crash in 5 minutes?)
- Take action (restart pod, scale up, send alert)
- Repeat
Kubernetes itself is built this way. The ReplicaSet controller watches pods and maintains the desired replica count. The HPA controller watches metrics and adjusts replicas. Your custom operator does the same thing for your specific failure modes.
Real Example: The Database Connection Pool Operator
Let me show you a real operator I built that solved an actual production problem.
The problem: Our API service would occasionally exhaust its database connection pool. Health checks passed (the pod was running), but it couldn’t serve requests. Manual fix: restart the pod.
The operator: Watch for this specific failure pattern and restart automatically.
Here’s the core logic using the Kubernetes Go client:
package main
import (
"context"
"fmt"
"time"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
type ConnectionPoolOperator struct {
clientset *kubernetes.Clientset
namespace string
}
func (o *ConnectionPoolOperator) Run(ctx context.Context) {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
o.checkAndHeal()
}
}
}
func (o *ConnectionPoolOperator) checkAndHeal() {
// Get all pods with our app label
pods, err := o.clientset.CoreV1().Pods(o.namespace).List(
context.TODO(),
metav1.ListOptions{
LabelSelector: "app=api-service",
},
)
if err != nil {
fmt.Printf("Error listing pods: %v\n", err)
return
}
for _, pod := range pods.Items {
// Check if pod has connection pool exhaustion symptoms
if o.hasConnectionPoolIssue(pod) {
fmt.Printf("Detected connection pool issue in pod %s, restarting\n", pod.Name)
// Delete the pod (ReplicaSet will recreate it)
err := o.clientset.CoreV1().Pods(o.namespace).Delete(
context.TODO(),
pod.Name,
metav1.DeleteOptions{},
)
if err != nil {
fmt.Printf("Error deleting pod: %v\n", err)
}
// Record event for visibility
o.recordEvent(pod, "ConnectionPoolExhausted", "Automatically restarted pod due to connection pool exhaustion")
}
}
}
func (o *ConnectionPoolOperator) hasConnectionPoolIssue(pod corev1.Pod) bool {
// Check multiple signals:
// 1. Pod is running (not already crashed)
if pod.Status.Phase != corev1.PodRunning {
return false
}
// 2. Recent restart count is low (not in crash loop)
for _, status := range pod.Status.ContainerStatuses {
if status.RestartCount > 3 {
return false // Already being restarted, don't interfere
}
}
// 3. Check custom metrics (this is where you'd integrate with Prometheus)
// For example: active_connections / max_connections > 0.95
metrics := o.getMetricsForPod(pod)
if metrics.ConnectionPoolUtilization > 0.95 {
return true
}
return false
}
func (o *ConnectionPoolOperator) getMetricsForPod(pod corev1.Pod) PodMetrics {
// Query Prometheus for pod metrics
// This is simplified - real implementation would use Prometheus client
return PodMetrics{
ConnectionPoolUtilization: 0.97, // Example value
}
}
func (o *ConnectionPoolOperator) recordEvent(pod corev1.Pod, reason, message string) {
event := &corev1.Event{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s.%d", pod.Name, time.Now().Unix()),
Namespace: o.namespace,
},
InvolvedObject: corev1.ObjectReference{
Kind: "Pod",
Name: pod.Name,
Namespace: pod.Namespace,
UID: pod.UID,
},
Reason: reason,
Message: message,
Type: "Normal",
FirstTimestamp: metav1.Time{Time: time.Now()},
LastTimestamp: metav1.Time{Time: time.Now()},
}
o.clientset.CoreV1().Events(o.namespace).Create(
context.TODO(),
event,
metav1.CreateOptions{},
)
}
type PodMetrics struct {
ConnectionPoolUtilization float64
}
func main() {
// Create in-cluster config
config, err := rest.InClusterConfig()
if err != nil {
panic(err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err)
}
operator := &ConnectionPoolOperator{
clientset: clientset,
namespace: "production",
}
fmt.Println("Starting connection pool operator...")
operator.Run(context.Background())
}
This operator runs in the cluster, checks pods every 30 seconds, and automatically restarts any pod showing connection pool exhaustion symptoms.
Key decisions in this design:
-
Check multiple signals: Don’t restart based on one metric. We check that the pod is running, not already in a crash loop, and has high connection pool utilization.
-
Record events: Every automated action creates a Kubernetes event. You can see what the operator did with
kubectl get events. -
Simple is better: This runs every 30 seconds. That’s fine. You don’t need real-time event streaming for most self-healing scenarios.
The Memory Leak Detector Operator
Here’s another real operator that detects memory leaks before they cause OOM kills:
func (o *MemoryLeakOperator) detectMemoryLeak(pod corev1.Pod) bool {
// Get memory usage history for this pod
history := o.getMemoryHistory(pod)
// Need at least 12 data points (6 hours at 30-second intervals)
if len(history) < 12 {
return false
}
// Calculate trend: is memory consistently increasing?
var increases int
for i := 1; i < len(history); i++ {
if history[i] > history[i-1] {
increases++
}
}
// If memory increased in 80% of samples, likely a leak
if float64(increases)/float64(len(history)) > 0.8 {
// Check if we're approaching the limit
currentMemory := history[len(history)-1]
memoryLimit := o.getMemoryLimit(pod)
if currentMemory > memoryLimit*0.85 {
return true
}
}
return false
}
This detects slow memory leaks by tracking memory usage over time. If memory consistently increases and approaches the limit, restart the pod before it OOMs.
Why this matters: OOM kills are violent. They can corrupt data, leave connections hanging, and cause cascading failures. Gracefully restarting before the OOM is much safer.
Using Kubebuilder for Complex Operators
For simple operators, writing raw Go is fine. But for complex logic, use Kubebuilder.
Kubebuilder generates the boilerplate for watching resources, handling events, and managing state. You just write the reconciliation logic.
Here’s a Kubebuilder operator that implements circuit breaking for external dependencies:
// api/v1/circuitbreaker_types.go
type CircuitBreaker struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec CircuitBreakerSpec `json:"spec,omitempty"`
Status CircuitBreakerStatus `json:"status,omitempty"`
}
type CircuitBreakerSpec struct {
// Target deployment to protect
TargetDeployment string `json:"targetDeployment"`
// Failure threshold (errors per minute)
FailureThreshold int `json:"failureThreshold"`
// How long to keep circuit open
OpenDuration metav1.Duration `json:"openDuration"`
// Action to take when circuit opens
Action string `json:"action"` // "scale-to-zero" or "route-to-fallback"
}
type CircuitBreakerStatus struct {
State string `json:"state"` // "closed", "open", "half-open"
ErrorRate int `json:"errorRate"`
LastTripped metav1.Time `json:"lastTripped,omitempty"`
}
// controllers/circuitbreaker_controller.go
func (r *CircuitBreakerReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var cb v1.CircuitBreaker
if err := r.Get(ctx, req.NamespacedName, &cb); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Get current error rate from metrics
errorRate := r.getErrorRate(cb.Spec.TargetDeployment)
// Update status
cb.Status.ErrorRate = errorRate
// State machine logic
switch cb.Status.State {
case "closed":
if errorRate > cb.Spec.FailureThreshold {
// Trip the circuit
cb.Status.State = "open"
cb.Status.LastTripped = metav1.Now()
// Take action
if cb.Spec.Action == "scale-to-zero" {
r.scaleDeployment(cb.Spec.TargetDeployment, 0)
}
r.recordEvent(cb, "CircuitTripped", "Error threshold exceeded")
}
case "open":
// Check if it's time to try half-open
openDuration := time.Since(cb.Status.LastTripped.Time)
if openDuration > cb.Spec.OpenDuration.Duration {
cb.Status.State = "half-open"
if cb.Spec.Action == "scale-to-zero" {
r.scaleDeployment(cb.Spec.TargetDeployment, 1)
}
}
case "half-open":
if errorRate < cb.Spec.FailureThreshold/2 {
// Success, close the circuit
cb.Status.State = "closed"
r.scaleDeployment(cb.Spec.TargetDeployment, -1) // Restore original scale
r.recordEvent(cb, "CircuitClosed", "Service recovered")
} else if errorRate > cb.Spec.FailureThreshold {
// Still failing, back to open
cb.Status.State = "open"
cb.Status.LastTripped = metav1.Now()
r.scaleDeployment(cb.Spec.TargetDeployment, 0)
}
}
// Update the status
if err := r.Status().Update(ctx, &cb); err != nil {
return ctrl.Result{}, err
}
// Requeue after 30 seconds
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
Deploy this with a custom resource:
apiVersion: healing.example.com/v1
kind: CircuitBreaker
metadata:
name: api-circuit-breaker
spec:
targetDeployment: api-service
failureThreshold: 50 # errors per minute
openDuration: 5m
action: scale-to-zero
Now when your API service starts failing (maybe the database is down), the circuit breaker automatically scales it to zero, preventing it from making the problem worse. After 5 minutes, it tries one replica to see if the issue is resolved.
The Limitations Nobody Talks About
Custom operators are powerful, but they’re not magic. Here’s what can go wrong:
Operators can make things worse. I’ve seen operators get into fight loops where they keep restarting pods faster than they can recover. Always include backoff logic and circuit breakers in your operators themselves.
They’re another thing to maintain. You’re writing code that runs in production and has the power to delete pods. It needs tests, monitoring, and on-call support just like your applications.
They can hide problems. If your operator automatically restarts pods with memory leaks, you might never fix the actual leak. Self-healing shouldn’t replace root cause analysis.
They need good observability. You must be able to see what your operators are doing. Log every action. Create metrics. Record Kubernetes events.
What You Should Actually Build
Don’t build operators for everything. Start with the problems that wake you up repeatedly:
Start here:
- Restart pods with specific error patterns
- Clear caches when they get too large
- Rotate credentials before they expire
- Scale down unused resources
Don’t start here:
- Complex ML-based anomaly detection
- Predicting failures before they happen
- Fully autonomous incident response
- Anything that requires perfect accuracy
Simple operators that handle well-understood failure modes are incredibly valuable. Complex operators that try to be too smart cause more problems than they solve.
My Actual Production Operators
Here’s what I run:
- Connection pool operator: Restarts pods with exhausted connection pools (shown above)
- Certificate rotation operator: Rotates TLS certificates 7 days before expiry
- Stale cache operator: Clears Redis caches when they exceed size thresholds
- Zombie pod operator: Deletes pods stuck in Terminating state for > 5 minutes
- Cost optimization operator: Scales down dev/staging deployments outside business hours
These five operators have eliminated about 70% of our routine manual interventions. They’re simple, focused, and they just work.
The Bottom Line
Self-healing Kubernetes isn’t about building Skynet. It’s about codifying the boring fixes you do manually and letting the cluster handle them.
Start small. Pick one problem that wakes you up repeatedly. Write an operator that fixes it automatically. Test it thoroughly. Deploy it. Monitor it.
Then pick the next problem.
After a year of this, you’ll have a cluster that fixes most issues before you even notice them. You’ll sleep better. Your team will be more productive. And you’ll wonder why you spent so long fixing the same problems by hand.
The goal isn’t zero human intervention. The goal is human intervention only for the interesting problems, not the ones you’ve solved twenty times before.