Scribe: Transporting petabytes per hour via a distributed, buffered queueing system

Our hardware infrastructure comprises millions of machines, all of which generate logs that we need to process, store, and serve. The total size of these logs is several petabytes every hour. The o…

Manolis Karpathiotakis
16 min readadvanced
--
View Original

Overview

Scribe is a distributed, buffered queueing system designed to efficiently transport petabytes of logs generated by millions of machines at Facebook. The article details Scribe's architecture, its operational capabilities, and the evolution of its components to handle high input and output rates while ensuring low latency and high throughput.

What You'll Learn

1

How to implement a distributed logging system using Scribe

2

Why Scribe can handle input rates exceeding 2.5 terabytes per second

3

When to use the Write Service for log transport to improve latency

4

How to configure log retention periods in Scribe

Prerequisites & Requirements

  • Understanding of distributed systems and logging mechanisms
  • Familiarity with Thrift and streaming APIs(optional)

Key Questions Answered

How does Scribe ensure low latency and high throughput in log processing?
Scribe achieves low latency and high throughput by utilizing a distributed architecture that allows for high input rates exceeding 2.5 terabytes per second and output rates exceeding 7 terabytes per second. It employs a Write Service that dynamically balances loads and a buffered storage system using LogDevice to ensure durability and availability.
What are the main components of Scribe's architecture?
Scribe's architecture consists of several key components including the Producer library, Scribed daemon for local log handling, Write Service for distributing logs to back-end servers, and LogDevice for durable storage. Each component plays a role in ensuring efficient log transport and processing.
What challenges does Scribe face in log storage and retrieval?
Scribe faces challenges such as maintaining log order due to logs arriving from multiple machines and ensuring write availability during network outages. It addresses these by allowing logs to be stored in memory temporarily and by implementing a flexible read service that can handle partial availability.
How does Scribe manage multitenancy among its users?
Scribe manages multitenancy by enforcing rate limits on write operations and monitoring usage patterns to prevent any single user's behavior from degrading the service quality for others. This ensures that millions of producers and consumers can operate simultaneously without interference.

Key Statistics & Figures

Input rate
exceeds 2.5 terabytes per second
This rate reflects the volume of logs generated by millions of machines at Facebook.
Output rate
exceeds 7 terabytes per second
This indicates the capacity of Scribe to deliver processed logs to various downstream systems.
Log retention period
typically a few days
This retention period is configurable based on user needs for historical analysis.

Technologies & Tools

Backend
Scribe
A distributed queueing system for log transport.
Backend
Logdevice
A distributed store for logs providing durability and availability.
API
Thrift
Used for implementing the Read Service and enabling communication between components.

Key Actionable Insights

1
To optimize log processing, implement the Write Service for direct log writes from Producers to reduce latency.
Using the Write Service can significantly enhance performance by bypassing potential bottlenecks associated with local daemons, especially during high traffic periods.
2
Consider adjusting log retention settings based on your analysis needs to balance storage costs and data availability.
Scribe typically retains logs for a few days; understanding your data access patterns can help you configure retention periods that meet your operational needs.
3
Utilize the Producer and Consumer APIs to simplify log writing and reading processes in your applications.
These APIs provide straightforward methods for interacting with Scribe, making it easier for developers to integrate logging functionality into their systems.

Common Pitfalls

1
Failing to configure appropriate write quotas can lead to service degradation.
Without proper rate limiting, one user's excessive logging can monopolize resources, affecting the performance for others.
2
Neglecting to handle log order can result in confusion during log analysis.
Logs arriving out of order can complicate debugging and monitoring efforts; using stream processing systems can help maintain order.

Related Concepts

Distributed Systems
Logging Mechanisms
Data Processing Pipelines