Home About Me

When Kubernetes Decides a Pod Has to Go: Resource Limits, Memory Pressure, and Eviction Internals

Why this matters

Resources are one of the most sensitive parts of Kubernetes in day-to-day operations. A surprising number of production incidents ultimately come down to bad resource settings, and the constant push for cost reduction usually ends up becoming a discussion about CPU and memory requests and limits.

How much CPU or memory a service should get is mostly an operational judgment built from testing and experience. The more useful question here is different:

When are those limits actually checked, who checks them, and what happens after a limit is crossed?

That matters for two reasons. First, when something goes wrong, you want to identify the cause quickly. Second, these limits affect more than runtime behavior—they also interact with scheduling and autoscaling decisions.

This is where pod eviction becomes worth studying.

A few things to know first

Before looking at the code path, it helps to already be familiar with:

  • cgroup
  • resource settings: limits and requests
  • epoll

Three practical questions

  1. When does Kubernetes check a pod's resource limits?
  2. Under what conditions does a pod get evicted?
  3. What strategy does Kubernetes use when choosing which pod to evict?

Starting from the eviction manager

A natural entry point in the kubelet code is pkg/kubelet/eviction/eviction_manager.go.

Instead of beginning with limits or requests, it makes more sense to start from eviction itself. Eviction is the action that directly affects the final state of the pod on the node. Resource limits are part of what triggers it, but eviction is the visible outcome.

The package and interface make its purpose obvious:

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</th> <th>// pkg/kubelet/eviction/types.go:57 // Manager evaluates when an eviction threshold for node stability has been met on the node. type Manager interface { // Start starts the control loop to monitor eviction thresholds at specified interval. Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) // IsUnderMemoryPressure returns true if the node is under memory pressure. IsUnderMemoryPressure() bool // IsUnderDiskPressure returns true if the node is under disk pressure. IsUnderDiskPressure() bool // IsUnderPIDPressure returns true if the node is under PID pressure. IsUnderPIDPressure() bool }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

From the method names alone, the core responsibilities are already visible. The most important one is Start, so the next step is to follow its implementation.

eviction_manager: two paths into the same decision loop

The interface is implemented by managerImpl. Its Start method shows how kubelet actually watches for resource pressure:

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32</th> <th>// pkg/kubelet/eviction/eviction_manager.go:178 // Start starts the control loop to observe and response to low compute resources. func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) { thresholdHandler := func(message string) { klog.InfoS(message) m.synchronize(diskInfoProvider, podFunc) } if m.config.KernelMemcgNotification { for _, threshold := range m.config.Thresholds { if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable { notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler) if err != nil { klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err) } else { go notifier.Start() m.thresholdNotifiers = append(m.thresholdNotifiers, notifier) } } } } // start the eviction manager monitoring go func() { for { if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil { klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", klog.KObjSlice(evictedPods)) m.waitForPodsCleanup(podCleanedUpFunc, evictedPods) } else { time.Sleep(monitoringInterval) } } }() }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

A few details stand out immediately:

  1. Kubernetes creates a NewMemoryThresholdNotifier for each configured memory-related threshold.
  2. Each notifier runs in its own goroutine: go notifier.Start().
  3. There is also a separate goroutine continuously calling m.synchronize(...).

That means the eviction manager has two ways to react:

  • event-driven checks for memory pressure
  • periodic checks through the monitoring loop

At this point, one obvious question appears: NewMemoryThresholdNotifier is clearly about memory. What about CPU?

That question is worth parking for a moment instead of breaking the reading flow. The memory path is tightly connected to eviction notifications, so it makes sense to finish that branch first and come back to CPU later.

So far, the picture looks like this:

  • kubelet has a dedicated eviction manager
  • it starts asynchronous watchers for memory thresholds
  • it also periodically synchronizes node and pod state
  • if a pod needs to be evicted, cleanup follows

The next question is: how does memory monitoring actually work?

MemoryThresholdNotifier: where the event comes from

If you only have a rough idea of cgroups, it is easy to assume this must be implemented as periodic polling: keep checking memory usage, compare it to a threshold, and fire a notification if the value is exceeded.

That would work, but it would also be a poor design from a performance perspective.

The Start method of the memory notifier is surprisingly small:

<table> <thead> <tr> <th>1 2 3 4 5 6 7</th> <th>// pkg/kubelet/eviction/memory_threshold_notifier.go:73 func (m *memoryThresholdNotifier) Start() { klog.InfoS("Eviction manager: created memoryThresholdNotifier", "notifier", m.Description()) for range m.events { m.handler(fmt.Sprintf("eviction manager: %s crossed", m.Description())) } }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

This code just listens on the events channel and invokes the handler whenever an event arrives.

So the real issue becomes: who writes into m.events?

That happens inside UpdateThreshold:

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11</th> <th>// pkg/kubelet/eviction/memory_threshold_notifier.go:80 func (m *memoryThresholdNotifier) UpdateThreshold(summary *statsapi.Summary) error { // ..... newNotifier, err := m.factory.NewCgroupNotifier(m.cgroupPath, memoryUsageAttribute, memcgThreshold.Value()) if err != nil { return err } m.notifier = newNotifier go m.notifier.Start(m.events) return nil }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

Here the key object finally appears: NewCgroupNotifier.

The code uses a NotifierFactory, which is a straightforward factory-pattern abstraction around notifier creation.

One practical detail matters when reading this code: if you're browsing it on a non-Linux machine, your IDE may jump to the unsupported implementation. For example, on macOS it is easy to land in pkg/kubelet/eviction/threshold_notifier_unsupported.go. That is not the implementation you want. The real behavior lives in the Linux-specific file:

  • pkg/kubelet/eviction/threshold_notifier_linux.go

linuxCgroupNotifier: the most important piece

This part is the real core of the mechanism. Once you understand how linuxCgroupNotifier works, you also understand a reusable Linux technique for memory monitoring with cgroups.

The file is not large. Conceptually, it has three parts:

  • initialization
  • start
  • wait

Initialization: NewCgroupNotifier

A useful code-reading trick in Go: when a function is long, temporarily ignore the repetitive if err != nil branches and focus on the main path.

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28</th> <th>// pkg/kubelet/eviction/threshold_notifier_linux.go:49 // NewCgroupNotifier returns a linuxCgroupNotifier, which performs cgroup control operations required // to receive notifications from the cgroup when the threshold is crossed in either direction. func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) { // .... var watchfd, eventfd, epfd, controlfd int var err error watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY|unix.O_CLOEXEC, 0) // .... controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY|unix.O_CLOEXEC, 0) // .... eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC) // .... epfd, err = unix.EpollCreate1(unix.EPOLL_CLOEXEC) // .... config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold) _, err = unix.Write(controlfd, []byte(config)) if err != nil { return nil, err } return &linuxCgroupNotifier{ eventfd: eventfd, epfd: epfd, stop: make(chan struct{}), }, nil }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

Without the error handling noise, the main logic is very clear:

  1. create watchfd
  2. create controlfd
  3. create eventfd
  4. create epfd — this is where epoll enters the picture
  5. write eventfd watchfd threshold into controlfd

Once you see that, the design becomes fairly intuitive. The notifier registers an event relationship with the cgroup through cgroup.event_control. When the memory usage crosses the configured threshold, the cgroup side signals back through eventfd, and userspace waits for that event through epoll.

Start: waiting for cgroup events

The Start method is equally direct:

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21</th> <th># pkg/kubelet/eviction/threshold_notifier_linux.go:110 func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) { err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{ Fd: int32(n.eventfd), Events: unix.EPOLLIN, }) // ... buf := make([]byte, eventSize) for { select { case <-n.stop: return default: } event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval) // ... _, err = unix.Read(n.eventfd, buf) // ... eventCh <- struct{}{} } }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

This is a standard epoll usage pattern:

  • use EpollCtl to register the fd
  • wait for events
  • read from eventfd
  • send a signal into eventCh

At that point, the remaining wait implementation is not especially mysterious. It is just the lower-level waiting path.

So the essence of linuxCgroupNotifier is this:

write eventfd watchfd threshold into cgroup.event_control, then use epoll to wait for notifications.

That is the mechanism kubelet uses to get notified when a memory threshold is crossed. It is also a good example of why reading production source code is valuable: sometimes you do not just learn how one product works—you pick up a practical systems-programming pattern you can reuse elsewhere.

If memory is event-driven, what about the rest?

Up to this point, the path has focused on memory. But eviction is not limited to memory alone. To see how Kubernetes decides a pod should actually be removed, we need to look at the periodic and decision-making path inside synchronize.

synchronize: how kubelet decides what to evict

The synchronize method is long, so the only sane way to read it is to follow the main flow rather than every branch.

A simple reading strategy helps:

  1. ignore nonessential conditionals on the first pass
  2. ignore debug logging
  3. read the method names first before diving into implementations
  4. summarize each block and connect the flow

At a high level, the mental model is not complicated. If you had to identify a pod to evict, you would probably do something like this:

  1. get current resource usage for all pods
  2. get the configured thresholds
  3. compare actual state to required state
  4. find the worst offender
  5. evict one pod and re-evaluate

That is basically what happens, though the real code includes several important details.

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81</th> <th>// pkg/kubelet/eviction/eviction_manager.go:233 func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { // ... // 获取所有 pod 的使用量 summary activePods := podFunc() updateStats := true summary, err := m.summaryProvider.Get(ctx, updateStats) if err != nil { klog.ErrorS(err, "Eviction manager: failed to get summary stats") return nil } // ... // 这里面最关键的就是 observations 和 thresholds 比较二者大小就知道是否满足阈值 // make observations and get a function to derive pod usage stats relative to those observations. observations, statsFunc := makeSignalObservations(summary) // determine the set of thresholds met independent of grace period thresholds = thresholdsMet(thresholds, observations, false) // node conditions report true if it has been observed within the transition period window nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now) if len(nodeConditions) > 0 { klog.V(3).InfoS("Eviction manager: node conditions - transition period not met", "nodeCondition", nodeConditions) } // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met) thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) // ... // 这里一个细节,会优先检查 local volume 是已经超限了 // evict pods if there is a resource usage violation from local volume temporary storage // If eviction happens in localStorageEviction function, skip the rest of eviction action if m.localStorageCapacityIsolation { if evictedPods := m.localStorageEviction(activePods, statsFunc); len(evictedPods) > 0 { return evictedPods } } // 关键来了,按驱逐优先级进行排序所有阈值来得到有无超限 // rank the thresholds by eviction priority sort.Sort(byEvictionPriority(thresholds)) thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds) if !foundAny { return nil } // 细节来了,会先进行一次GC,如果 GC 之后能满足条件则就不需要驱逐了 // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods. if m.reclaimNodeLevelResources(ctx, thresholdToReclaim.Signal, resourceToReclaim) { return nil } // GC 之后再来排序 pod,被优先驱逐的肯定是 “大头” // rank the running pods for eviction for the specified resource rank(activePods, statsFunc) // 最后 for 循环出第一个需要驱逐的 pod 包装一下就可以返回了 // we kill at most a single pod during each eviction interval for i := range activePods { pod := activePods[i] gracePeriodOverride := int64(0) if !isHardEvictionThreshold(thresholdToReclaim) { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc, thresholds, observations) var condition *v1.PodCondition if utilfeature.DefaultFeatureGate.Enabled(features.PodDisruptionConditions) { condition = &v1.PodCondition{ Type: v1.DisruptionTarget, Status: v1.ConditionTrue, Reason: v1.PodReasonTerminationByKubelet, Message: message, } } if m.evictPod(pod, gracePeriodOverride, message, annotations, condition) { metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc() return []*v1.Pod{pod} } } klog.InfoS("Eviction manager: unable to evict any pods from the node") return nil }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

From this flow, several important behaviors become clear.

1. kubelet first gathers current usage

The method starts by getting the active pods and requesting fresh stats from the summary provider. That summary is the basis for later comparisons.

2. thresholds are evaluated against observations

The critical comparison is between:

  • observations: actual measured resource state
  • thresholds: configured eviction signals and limits

That is the heart of the decision process.

3. grace periods matter

Kubernetes distinguishes between thresholds that are merely observed and thresholds whose grace periods have already been satisfied. That difference affects whether the pressure should immediately drive eviction behavior.

4. local storage can short-circuit the rest

A notable detail: if local temporary storage has exceeded limits, localStorageEviction gets checked earlier, and if that path already evicts something, the rest of the eviction logic is skipped for that cycle.

5. thresholds are ranked by eviction priority

If multiple thresholds are violated, kubelet sorts them and selects the reclaimable one with the highest eviction priority.

6. kubelet tries node-level reclamation before evicting pods

This is an important operational detail. Kubernetes does not jump straight to killing user pods. It first tries node-level reclamation, such as garbage collection. If that relieves the pressure enough, no pod eviction is needed.

7. only then are pods ranked

If pressure remains after reclamation attempts, kubelet ranks the running pods for eviction according to the target resource.

8. only one pod is evicted per eviction interval

This is one of the most practical takeaways from the code:

kubelet evicts at most one pod in each eviction interval.

It does not try to remove every noncompliant pod at once. That makes sense: evicting a single heavy offender may already free enough resources for the remaining pods to stay.

What metrics are actually collected?

To understand what data synchronize relies on, it helps to look at the pod stats structure:

<table> <thead> <tr> <th>1 2 3 4 5 6 7 8 9 10 11 12 13</th> <th>// vendor/k8s.io/kubelet/pkg/apis/stats/v1alpha1/types.go:107 type PodStats struct { PodRef PodReference `json:"podRef"` StartTime metav1.Time `json:"startTime"` Containers []ContainerStats `json:"containers" patchStrategy:"merge" patchMergeKey:"name"` CPU *CPUStats `json:"cpu,omitempty"` Memory *MemoryStats `json:"memory,omitempty"` Network *NetworkStats `json:"network,omitempty"` VolumeStats []VolumeStats `json:"volume,omitempty" patchStrategy:"merge" patchMergeKey:"name"` EphemeralStorage *FsStats `json:"ephemeral-storage,omitempty"` ProcessStats *ProcessStats `json:"process_stats,omitempty"` Swap *SwapStats `json:"swap,omitempty"` }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

The summary includes the expected pod-level signals:

  • CPU
  • memory
  • network
  • volumes
  • ephemeral storage
  • process stats
  • swap

Node-level statistics are also part of the summary. Those metrics together provide the input for threshold evaluation and ranking.

Direct answers to the three original questions

When are pod resource limits checked?

There are two trigger modes:

  • event-based memory checking: cgroup memory threshold notifications can trigger an immediate handler call, which leads into synchronize
  • periodic checking: kubelet also runs synchronize on a timer, controlled by monitoringInterval, which defaults to 10 seconds

So in practice, some checks happen immediately on a memory event, while others happen on the regular monitoring cadence.

When is a pod evicted?

A pod is evicted once a check determines that eviction conditions are met and node-level reclamation cannot resolve the pressure first.

In real terms, the timing depends on the check path:

  • immediately after a memory-threshold event
  • or during the next monitoring cycle

What is the eviction strategy?

The kubelet does not evict every problematic pod in one pass.

Its behavior is closer to this:

  • identify the active pressure condition
  • rank candidates
  • choose the pod that is the worst fit under the current pressure
  • evict only one pod in that interval
  • re-evaluate afterward

Operationally, that means the biggest resource consumer under the relevant pressure is the one most likely to go first.

QoS is easy to overlook, but it matters

QoS is one of those Kubernetes concepts that many people skip early on and later regret skipping.

When a node comes under resource pressure, Kubernetes does not treat all pods equally. QoS class influences survival odds.

A simplified view is enough here:

  1. BestEffort: no requests and no limits. No guarantees, easiest to kill.
  2. Guaranteed: requests == limits. Strongest protection.
  3. Burstable: requests != limits. Some guarantees, but not as strong as Guaranteed.

A few practical rules follow from that:

  1. always set both requests and limits
  2. in many cases, Burstable is the most practical default, but avoid setting requests and limits too high
  3. if losing a service would be catastrophic, consider Guaranteed

In production, increasing resources is usually easier than reducing them. That is why oversized values often persist for a long time.

Soft vs hard eviction

This is another detail that is conceptually simple but easy to miss:

  • soft eviction: has a grace period
  • hard eviction: no grace period; the pod is terminated immediately

The code path reflects this distinction when it decides whether to override the pod grace period.

What this design is worth learning from

Two implementation ideas are especially useful beyond Kubernetes itself.

1. Combining asynchronous events with periodic checks

managerImpl.Start is a solid example of how to structure a manager that has to respond to both:

  • immediate asynchronous signals
  • scheduled background monitoring

That pattern appears in many systems outside kubelet too.

2. Using cgroup notifications instead of wasteful polling

The cgroup memory notification path is elegant because it avoids turning everything into a constant loop. Register an event with the kernel, wait via epoll, and only react when something meaningful happens.

A small but clean use of the factory pattern

The NotifierFactory abstraction is also worth noting. It is a straightforward example of a factory interface used to hide platform-specific creation details.

<table> <thead> <tr> <th>1 2 3 4 5 6</th> <th>// NotifierFactory creates CgroupNotifer type NotifierFactory interface { // NewCgroupNotifier creates a CgroupNotifier that creates events when the threshold // on the attribute in the cgroup specified by the path is crossed. NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) }</th> </tr> </thead> <tbody> <tr> <td></td> <td></td> </tr> </tbody> </table>

This is one of the more practical benefits of reading mature source code: if you are unsure how a pattern should look in real software, production code usually teaches faster than abstract explanations.

A note on CPU limits: they are not as straightforward as memory

CPU behaves differently from memory, and that is why it does not fit into the same notification story so neatly.

Memory is easier to reason about as a thresholded quantity. CPU limits are tied to scheduling behavior over time slices. Kubernetes relies on CFS for CPU limiting, so once a workload reaches its quota inside the sampling period, it gets throttled.

That creates a few practical consequences:

  • CPU limits can be sensitive to short spikes
  • setting a CPU limit too low can cause frequent throttling
  • behavior can be affected by kernel-version quirks and historical bugs

So compared with memory, CPU limits tend to feel less intuitive and more fragile in practice.

For many real workloads, especially when they are not strongly CPU-bound, the safer approach is often to start with a conservative experience-based value and then tune it with load testing.

If there is one simple impression to keep in mind, it is this: CPU limits are more complicated and less precise than they first appear.