Pinterest's CPU Zombie Hunt

The Peril of Default Configurations

This article provides an excellent deep dive into a complex production issue at Pinterest, illustrating the often-unseen challenges in large-scale distributed systems. The identification of 'CPU zombies' caused by a crashlooping, unused AWS agent is a critical insight. It highlights how seemingly innocuous default configurations in base images can lead to cascading failures, particularly in memory management and kernel interactions. The detailed breakdown of how saturated CPU cores affected network interrupt handling (NAPI poll thread starvation) and led to ENA device resets is technically rich and valuable for engineers dealing with similar network performance issues in Kubernetes environments. The emphasis on moving beyond high-level dashboards to per-core analysis and the utility of tools like mpstat and perf with visualization in Flamescope are practical takeaways. The problem's root cause – a memory cgroup leak from a default agent – underscores a common pitfall: assuming third-party components are benign or inactive when not explicitly configured. The solution, while simple in retrospect, required significant diagnostic effort, reinforcing the need for deep system understanding and robust observability.

However, a limitation is the retrospective nature of the analysis. While Pinterest is rolling out advanced profiling tools like gProfiler and eBPF-based platforms, the initial resolution relied on manual, time-consuming captures. This underscores a broader industry challenge: achieving true real-time, fleet-wide observability for such subtle, intermittent issues. The article could benefit from a more explicit discussion on how the new tools, once fully deployed, would have prevented or significantly accelerated the diagnosis of this specific 'zombie' problem. Furthermore, while the ECS agent was unused by Pinterest, its presence as a default on the AWS Deep Learning AMI implies a potential widespread risk for other users of that image who might not be as diligent in their investigations. The article implicitly calls for greater transparency and potentially more granular control over default agent configurations in cloud provider AMIs. The implications for AI/ML workloads are significant, as training stability is paramount, and intermittent failures can lead to wasted compute resources and extended development cycles.

Key Points

Pinterest engineers resolved intermittent CPU starvation and production bottlenecks on their Kubernetes-based platform (PinCompute).
The issue was caused by 'CPU zombies': leaked memory cgroups left behind by a crashlooping, unused AWS ECS agent present in their default Deep Learning AMI.
This memory leak inflated the list of cgroups that the kubelet had to process, saturating individual CPU cores and impacting network interrupt handling.
The root cause was hidden by healthy aggregate CPU utilization, requiring deep-dive per-core analysis using tools like mpstat and perf.
The resolution involved disabling the ECS agent systemd unit and rebooting affected nodes to purge the accumulated cgroups.
The experience highlights the critical importance of scrutinizing default configurations in base images and investing in advanced, continuous profiling for production observability.

📖 Source: Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Pinterest's CPU Zombie Hunt

The Peril of Default Configurations

Key Points

Related Articles

AI Agents on Kubernetes: Bug Fixing Realities

Kubernetes 1.36: AI Ready, Security Hardened

GKE Supercharges AI Agents with New Sandbox & Hypercluster

Comments (0)

Related Articles

AI Agents on Kubernetes: Bug Fixing Realities
#AI#DevOps

Kubernetes 1.36: AI Ready, Security Hardened
#Kubernetes#AI

GKE Supercharges AI Agents with New Sandbox & Hypercluster
#Kubernetes#AI