Kubernetes Manifest Hardening Review

Review Kubernetes YAML manifests for security misconfigurations, resource footguns, and production readiness gaps.

18 views

Cursor

kubernetesk8syamlsecuritydevopscontainersdeploymentinfrastructure

How to Use

Save to .cursor/rules/k8s-manifest-review.mdc with glob pattern: k8s/**/*.yaml, k8s/**/*.yml, manifests/**/*.yaml, deploy/**/*.yaml, charts/**/*.yaml. The rule activates automatically when editing Kubernetes manifests matching those paths. To verify, open any Kubernetes YAML file in a matched directory and ask Cursor to review it. Check Cursor Settings > Rules to confirm the rule is loaded.

Agent Definition

## Purpose

Review Kubernetes manifests (Deployments, StatefulSets, Services, Ingresses, CronJobs, NetworkPolicies) for security misconfigurations, resource mismanagement, and reliability gaps that cause outages or vulnerabilities in production. Focus on problems that linters like kube-linter miss or under-report.

## When to apply

Apply to any .yaml or .yml file under k8s/, manifests/, deploy/, charts/, or any file containing apiVersion and kind fields matching Kubernetes resource types.

## Review rules

## Security context

- Every container must set securityContext explicitly. Omitting it inherits the node default, which is usually root.
- runAsNonRoot: true is required. If the image needs root, flag it as Critical and require justification in a comment.
- readOnlyRootFilesystem: true unless the workload writes to specific paths. If writes are needed, use emptyDir or PVC mounts for those paths only.
- Drop ALL capabilities, then add back only what is needed. allowPrivilegeEscalation: false always.
- Never set privileged: true outside of system-level DaemonSets (CNI plugins, log collectors). Flag as Critical otherwise.
- hostNetwork, hostPID, hostIPC: flag as Critical unless the workload is a node-level agent.

## Resource limits

- Every container must have both requests and limits for cpu and memory. Missing requests causes scheduling unpredictability. Missing limits allows a single pod to OOM-kill neighbors.
- CPU limits on latency-sensitive workloads cause throttling. For web servers and API pods, prefer setting only CPU requests without CPU limits, but always set memory limits. Flag CPU limits on latency-sensitive workloads as Warning with this explanation.
- Memory requests should equal memory limits for predictable QoS class (Guaranteed). If they differ, the pod gets Burstable QoS and is more likely to be evicted.
- Resource values should use explicit units: 100m not 0.1 for CPU, 256Mi not 268435456 for memory.

## Probes

- Every long-running workload (Deployment, StatefulSet) must have readinessProbe and livenessProbe.
- livenessProbe must not point to the same endpoint as readinessProbe if that endpoint checks downstream dependencies. A database outage should not cause liveness failure and restart loops. livenessProbe should check only that the process is alive (a /healthz that returns 200 without dependency checks).
- startupProbe is required for workloads with slow initialization (JVM apps, ML model loading). Without it, livenessProbe kills the container before it finishes starting.
- initialDelaySeconds on livenessProbe without startupProbe: flag as Warning if under 10 seconds for JVM or Python workloads.
- Probe timeoutSeconds defaults to 1 second. For endpoints that query a database, this is too short. Flag default timeoutSeconds on dependency-checking probes as Suggestion.

## Labels and selectors

- Every resource must have app.kubernetes.io/name and app.kubernetes.io/version labels. These are the Kubernetes recommended labels and are used by tooling (Prometheus, Istio, ArgoCD).
- Deployment spec.selector.matchLabels must exactly match spec.template.metadata.labels for the selector keys. Mismatches cause the Deployment to select wrong pods or no pods.
- Do not use matchLabels with only one generic label like app: myapp. Include at least app.kubernetes.io/name and app.kubernetes.io/instance to avoid cross-selecting pods from another release.

## Networking

- If the namespace runs more than one service, a default-deny NetworkPolicy should exist. Flag its absence as Warning.
- Service targetPort should reference a named port from the container spec, not a raw number. Named ports survive port renumbering.
- Ingress resources must set TLS section for any host exposed externally. Plaintext ingress is Critical.

## Pod disruption and scheduling

- Production Deployments with replicas > 1 should have a PodDisruptionBudget. Without it, a node drain can take down all replicas simultaneously.
- Anti-affinity rules: Deployments with replicas > 1 should use podAntiAffinity to spread across nodes. At minimum, use preferredDuringSchedulingIgnoredDuringExecution with topologyKey kubernetes.io/hostname.
- topologySpreadConstraints are preferred over podAntiAffinity for clusters with 3+ zones. Flag podAntiAffinity without topology spread as Suggestion in multi-zone contexts.

## Secrets and config

- Never inline secret values in manifests. Use secretKeyRef, external-secrets, or sealed-secrets. Plaintext secrets in YAML are Critical.
- ConfigMap data referenced by envFrom or volumeMount should exist in the same namespace. Flag cross-namespace references as Warning (they silently fail).
- Environment variables from secrets should use secretKeyRef with a specific key, not envFrom on the entire secret. envFrom exposes all keys, increasing blast radius.

## CronJobs

- concurrencyPolicy should be set explicitly. The default (Allow) lets jobs pile up if one hangs.
- successfulJobsHistoryLimit and failedJobsHistoryLimit should be set. Defaults keep 3 and 1 respectively, which may be insufficient for debugging.
- activeDeadlineSeconds on the Job template prevents hung jobs from running forever.

## Image policy

- Never use :latest tag. It defeats rollback and makes deployments non-reproducible. Flag as Critical.
- Use digest pinning (image@sha256:...) for production workloads. Tag-only references are Warning.
- imagePullPolicy should be IfNotPresent for tagged images, Always only for mutable tags (which should not exist in production).

## Output format

For each finding, report:
- Severity: Critical, Warning, or Suggestion
- Location: file path, resource kind/name, and the specific field path (e.g., spec.template.spec.containers[0].securityContext)
- Problem: what is wrong and what will happen in production if unchanged
- Fix: the specific field or value to add or change