Time-Series Storage: Unlock Cost & Performance

Optimizing Time-Series Data Storage

The article provides a solid foundational understanding of time-series storage design choices, effectively demonstrating the trade-offs between flat and normalized relational schemas, and introducing columnar storage with Parquet. The emphasis on measurable results using common tools like PostgreSQL and Parquet is a significant strength, making the concepts tangible. The discussion on schema evolution using JSONB and the implications of cardinality are particularly relevant for modern data engineering practices. The article successfully highlights that database-agnostic design decisions are paramount for cost and performance.

However, while the article explores PostgreSQL and Parquet, it could benefit from a more comprehensive comparison with dedicated time-series databases (TSDBs) like InfluxDB, TimescaleDB, or Prometheus, which are purpose-built for this domain and often offer more advanced optimizations out-of-the-box. While the article mentions InfluxDB and Prometheus in passing regarding cardinality, a deeper dive into how these specialized solutions handle the discussed design choices (e.g., their internal indexing, compression, and partitioning strategies) would provide a more complete picture for developers deciding on a storage solution. The limitations of PostgreSQL for massive time-series workloads, despite its flexibility, are implicitly understood but could be more explicitly stated to frame the advantages of columnar formats and dedicated TSDBs more clearly. The article also assumes a relatively static set of query patterns for indexing decisions on JSONB, which might not hold true in highly dynamic analytical environments.

Key Points

Normalizing series identity into a separate metadata table with compact IDs can reduce time-series storage by approximately 42% by avoiding repeating dimension strings.
High-cardinality fields like request IDs should be excluded from series identity to prevent storage and indexing costs from growing linearly.
Storing series dimensions as flexible JSONB with targeted indexes allows for schema evolution but requires careful indexing policies to prevent sprawl and type drift.
Time partitioning enables O(1) data expiration and partition pruning but can create write hotspots; a second axis (series identity) can distribute writes.
Downsampling significantly reduces row counts (e.g., 720x reduction from 5s to 1h resolution) by retaining full resolution only for recent data and using pre-aggregated rollups for older data.
Columnar storage, particularly with formats like Apache Parquet, offers substantial storage benefits through compression and dictionary encoding, even for flatter logical models.
Apache Iceberg provides a robust table format over Parquet on object storage, offering ACID transactions, schema evolution, and broad engine support without data copying.

📖 Source: Article: Time-Series Storage: Design Choices That Shape Cost and Performance

Time-Series Storage: Unlock Cost & Performance

Optimizing Time-Series Data Storage

Key Points

Related Articles

Netflix's Druid Cache: Smarter Queries, Faster Insights

Comments (0)