Yelp's S3 Log Management Blueprint: Scaling Access Logs Efficiently

Alps Wang

Alps Wang

Dec 14, 2025 · 7 views

Deconstructing Yelp's Log Pipeline

Yelp's approach showcases a practical solution to a common cloud challenge: managing large volumes of access logs. The innovation lies in the combination of established technologies like S3, Athena, and Glue to create a cost-effective and performant log processing pipeline. The use of Parquet for compact storage and optimized querying is a key takeaway. However, the article doesn't delve deeply into the operational complexities of maintaining such a system, such as handling schema evolution, dealing with data corruption, and the potential for increased latency during compaction. While the article highlights the benefits of their approach, it would benefit from a more detailed discussion of the trade-offs involved, especially regarding the potential for increased complexity in the overall system architecture.

Key Points

  • Yelp built a scalable and cost-efficient pipeline for processing S3 server-access logs by converting them into compact, parquet-formatted archives.
  • The architecture leverages AWS Glue Data Catalog, scheduled batch jobs, Lambda functions, and partition-projection-based tables for robust, automated log ingestion.
  • The system supports key operational use-cases like permission debugging, cost attribution, incident investigation, and data retention analysis.

Article Image


📖 Source: Yelp Publishes Blueprint for Managing S3 Server-Access Logs at Massive Scale

Comments (0)

No comments yet. Be the first to comment!