devops-interview-handbook

Scenario: Kubernetes Troubleshooting

Problem Statement

A production Kubernetes cluster is experiencing issues: pods are not starting, services are unreachable, and nodes are reporting as NotReady. You need to systematically troubleshoot and resolve these issues.

Environment

Symptoms

  1. Node Issues:
    • 3 nodes showing NotReady status
    • kubectl get nodes shows:
      NAME                          STATUS     ROLES    AGE   VERSION
      ip-10-0-1-10.ec2.internal     Ready      <none>   30d   v1.28.0
      ip-10-0-2-20.ec2.internal     NotReady   <none>   30d   v1.28.0
      ip-10-0-3-30.ec2.internal     NotReady   <none>   30d   v1.28.0
      
  2. Pod Issues:
    • Multiple pods in Pending state
    • Some pods in CrashLoopBackOff
    • Pods cannot be scheduled
  3. Service Issues:
    • Services returning 503 errors
    • Endpoints not found
    • DNS resolution failing

Step-by-Step Troubleshooting

Step 1: Check Node Status

# Get node status
kubectl get nodes

# Describe problematic nodes
kubectl describe node ip-10-0-2-20.ec2.internal

# Check node conditions
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: .status.conditions}'

What to Look For:

Common Issues:

Step 2: Check Node Resources

# Check node resource usage
kubectl top nodes

# Check node capacity and allocatable
kubectl describe node ip-10-0-2-20.ec2.internal | grep -A 5 "Allocated resources"

# Check for resource pressure
kubectl get nodes -o custom-columns=NAME:.metadata.name,\
  MEMORY-PRESSURE:.status.conditions[?(@.type=='MemoryPressure')].status,\
  DISK-PRESSURE:.status.conditions[?(@.type=='DiskPressure')].status

Actions:

Step 3: Check kubelet Status

# SSH to node (if possible)
ssh ec2-user@ip-10-0-2-20.ec2.internal

# Check kubelet status
sudo systemctl status kubelet

# Check kubelet logs
sudo journalctl -u kubelet -n 100 --no-pager

# Check kubelet configuration
cat /var/lib/kubelet/config.yaml

Common kubelet Issues:

Step 4: Check Container Runtime

# On node, check container runtime
sudo systemctl status containerd
# or
sudo systemctl status docker

# Check runtime logs
sudo journalctl -u containerd -n 100

# Test container runtime
sudo crictl images
sudo crictl ps -a

Common Issues:

Step 5: Check Pod Scheduling

# Check pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Describe pending pod
kubectl describe pod <pod-name> -n <namespace>

# Check events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

Common Scheduling Issues:

Step 6: Check Network Issues

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test DNS from pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check service endpoints
kubectl get endpoints -A

# Check if endpoints are empty
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -l <selector> --show-labels

Common Network Issues:

Step 7: Check API Server and Control Plane

# Check API server status (if you have access)
kubectl get componentstatuses
# or
kubectl get cs

# Check API server logs (if on master)
kubectl logs -n kube-system kube-apiserver-<node-name>

# Test API server connectivity
kubectl cluster-info

Common Control Plane Issues:

Step 8: Check Storage Issues

# Check PVCs
kubectl get pvc -A

# Check PVs
kubectl get pv

# Check storage classes
kubectl get storageclass

# Describe pending PVC
kubectl describe pvc <pvc-name> -n <namespace>

Common Storage Issues:

Resolution Examples

Example 1: Node NotReady - Disk Full

Symptoms:

Resolution:

# SSH to node
ssh ec2-user@ip-10-0-2-20.ec2.internal

# Check disk usage
df -h

# Clean up Docker/containerd images
sudo docker system prune -a --volumes
# or for containerd
sudo crictl rmi --prune

# Clean up old logs
sudo journalctl --vacuum-time=7d

# If still full, increase disk size (AWS)
# 1. Create snapshot
# 2. Resize EBS volume
# 3. Extend filesystem
sudo growpart /dev/nvme0n1 1
sudo resize2fs /dev/nvme0n1p1

# Restart kubelet
sudo systemctl restart kubelet

# Verify node status
kubectl get nodes

Example 2: Pods Pending - No Available Nodes

Symptoms:

Resolution:

# Check why pods can't be scheduled
kubectl describe pod <pod-name> | grep -A 10 "Events:"

# Common reasons:
# 1. Resource constraints
kubectl describe node | grep -A 5 "Allocated resources"

# 2. Node selectors
kubectl get pods -o jsonpath='{.spec.nodeSelector}'

# 3. Taints
kubectl describe node | grep Taint

# Solutions:
# Option 1: Add more nodes
# Option 2: Remove node selector or add label
kubectl label node <node-name> <key>=<value>

# Option 3: Remove taint (if appropriate)
kubectl taint nodes <node-name> <key>-<effect>-

# Option 4: Add toleration to pod

Example 3: Service Unreachable - No Endpoints

Symptoms:

Resolution:

# Check service selector
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'

# Check if pods match selector
kubectl get pods -l app=web --show-labels

# Common issue: Selector doesn't match pod labels
# Fix: Update service or pod labels

# Update service selector
kubectl patch svc <service-name> -p '{"spec":{"selector":{"app":"web"}}}'

# Or update pod labels
kubectl label pods <pod-name> app=web

# Verify endpoints
kubectl get endpoints <service-name>

Example 4: DNS Not Working

Symptoms:

Resolution:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# If not running, check why
kubectl describe pod -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml

# Restart CoreDNS if needed
kubectl delete pod -n kube-system -l k8s-app=kube-dns

# Verify DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

Prevention Strategies

1. Resource Management

# Set resource requests and limits
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Node Health Monitoring

3. Cluster Autoscaling

# Enable cluster autoscaler
# Automatically adds nodes when needed

4. Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web

5. Regular Maintenance

Troubleshooting Checklist

Key Takeaways