Airbnb's OpenTelemetry Metrics: 100M Samples/Sec
Alps Wang
Apr 14, 2026 · 1 views
OpenTelemetry Powers Airbnb's Metrics at Scale
Airbnb's detailed account of migrating their high-volume metrics pipeline to OpenTelemetry, OTLP, and VictoriaMetrics is a masterclass in modern observability infrastructure. The key insight is the strategic decision to front-load collection, enabling a smoother transition for downstream tooling. The measurable CPU reduction in JVM services from 10% to under 1% by adopting OTLP over StatsD's UDP is a compelling demonstration of the protocol's efficiency and reliability. The clever use of delta temporality to mitigate memory pressure in high-cardinality services, despite the trade-off of potential data gaps, highlights a pragmatic approach to engineering challenges at hyperscale. Furthermore, the meticulous evaluation and selection of vmagent for its aggregation capabilities, horizontal sharding, and manageable codebase underscore a well-researched and technically grounded decision-making process. The problem of counter resets and the innovative 'zero injection' solution within vmagent showcases deep understanding of Prometheus semantics and a commitment to correctness.
However, while the article highlights significant gains, the initial memory pressure and increased garbage collection in high-cardinality services with OTLP warrants further investigation. Understanding the specific configurations or metric types that triggered this would be valuable for other organizations facing similar challenges. The article also implicitly points to the complexity of managing distributed systems and the ongoing evolution of observability tooling. The need for custom aggregation logic, even with solutions like vmagent, suggests that while open-source offers flexibility, it still demands significant engineering effort and expertise to tailor to specific organizational needs. The comparison with Flipkart and Shopify, while useful context, could benefit from deeper dives into their specific technical choices and challenges related to aggregation and data modeling. The reliance on Kubernetes StatefulSets for vmagent aggregators, while practical, introduces a dependency on Kubernetes orchestration. Overall, this is an exceptional case study for organizations grappling with scaling their observability infrastructure, particularly those looking to move away from legacy, proprietary, or less efficient monitoring solutions.
Key Points
- Airbnb migrated its high-volume metrics pipeline from StatsD/Veneur to OpenTelemetry Protocol (OTLP), OpenTelemetry Collector, and VictoriaMetrics' vmagent.
- The new system ingests over 100 million samples per second in production.
- Key benefits include a significant reduction in CPU time for metrics processing in JVM services (from 10% to <1%) and elimination of packet loss risk by using TCP over UDP.
- High-cardinality services experienced memory pressure, which was resolved by switching to delta temporality.
- VictoriaMetrics' vmagent was chosen for its built-in streaming aggregation, horizontal sharding, and manageable codebase, forming a two-layer aggregation architecture.
- A novel 'zero injection' technique was implemented in vmagent to address Prometheus counter reset semantics, ensuring accurate reporting of low-frequency events.
- The migration resulted in approximately an order of magnitude cost reduction compared to the previous vendor-based architecture.
- The centralized aggregation tier also serves as a general-purpose transformation layer for managing problematic metrics.

📖 Source: Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry
Related Articles
Comments (0)
No comments yet. Be the first to comment!
