Cloudflare's 10x Security Scan Leap

Alps Wang

Alps Wang

Jun 13, 2026 · 1 views

Scaling Security Insights: A Deep Dive

Cloudflare's detailed account of scaling their Security Insights scanning capacity from 10 to over 100 scans per second is a testament to meticulous engineering and problem-solving. The article effectively highlights the challenges of infrequent scans, opt-in policies, and system strain, presenting well-reasoned solutions. The introduction of batch processing in checkers, the 'slow lane'/'fast lane' approach to mitigate head-of-line blocking in Kafka, and the hybrid database insert strategy (UNNEST/COPY) are particularly insightful. The resolution of API timeouts by switching to an active-passive database connection model, directly addressing latency issues, is a crucial takeaway. Furthermore, the re-architecting of the scheduler to independently schedule zones, randomize scheduling times, and implement adaptive rate limiting showcases a sophisticated understanding of distributed systems and workload distribution. The emphasis on understanding the existing system before resorting to brute-force resource scaling is a valuable lesson for any engineering team facing similar challenges.

While the article is highly informative, a deeper dive into the quantitative impact of each specific optimization might have been beneficial. For instance, understanding the percentage of improvement attributed to batch processing versus the 'slow lane' approach would offer further clarity. Additionally, more detail on the 'adaptive rate limiting' calculation and its resilience to sudden influxes of new accounts could strengthen the narrative. The article touches upon the trade-offs of batch processing (rework on crash, increased memory), but a more explicit discussion on how these were managed or mitigated would be valuable. Nevertheless, the overarching success in achieving a stable, scalable system that enables enhanced security for millions of users is undeniable, making this a highly recommended read for engineers and architects in the field.

Key Points

  • Cloudflare achieved a 10x increase in global security scanning capacity, from 10 to over 100 scans per second, by optimizing their Security Insights system.
  • Key technical challenges included infrequent scans, opt-in policies, system strain (backlogs, API timeouts, crashes), and head-of-line blocking in Kafka.
  • Solutions involved: batch processing messages with parallel goroutines, implementing a 'slow lane'/'fast lane' consumer group strategy for Kafka, optimizing database inserts with a hybrid UNNEST/COPY approach, and resolving API latency issues by switching to an active-passive database configuration.
  • The scheduler was re-architected for uniform scan distribution through independent zone scheduling, randomized 'last_scheduled_at' times, and adaptive rate limiting.
  • The improvements enabled automatic scanning for all free accounts and increased scanning frequency for all customer tiers, enhancing system stability and allowing for new features like granular on-demand scans.

Article Image


📖 Source: Scaling Security Insights: how we achieved a 10x increase in global scanning capacity

Related Articles

Comments (0)

No comments yet. Be the first to comment!