Logarithm: A logging engine for AI training workflows and services

Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, tha…

Partha Kanuparthy
14 min readadvanced
--
View Original

Overview

Logarithm is a serverless, multitenant logging engine developed internally at Meta to enhance AI training workflows and services. It efficiently indexes over 100 GB/s of logs in real-time, supporting thousands of queries per second, while providing strong guarantees on log freshness, completeness, and query latency.

What You'll Learn

1

How to implement logging in AI training workflows using Logarithm

2

Why real-time log indexing is crucial for debugging AI models

3

How to leverage metadata APIs for enhanced log context

Prerequisites & Requirements

  • Understanding of logging mechanisms and AI training workflows
  • Familiarity with logging libraries like Google Logging Library (glog)(optional)

Key Questions Answered

How does Logarithm support AI training debugging?
Logarithm enhances AI training debugging by ingesting and indexing high-throughput logs from both systems and model layers. It allows for detailed telemetry capture, enabling engineers to analyze failures without needing to reproduce them, thus saving GPU resources and improving debugging efficiency.
What are the performance capabilities of Logarithm?
Logarithm can index over 100 GB/s of logs in real-time and handle thousands of queries per second. It is designed to meet service-level guarantees on log freshness, completeness, durability, and query latency, making it highly efficient for large-scale logging needs.
What is the architecture of Logarithm?
Logarithm's architecture includes application processes that emit logs, a host-side agent for parsing, distributed queues for durability, ingestion clusters for processing, and query clusters for handling interactive queries. This design ensures scalability and fault tolerance.

Key Statistics & Figures

Log indexing speed
100+ GB/s
This speed applies to real-time log indexing capabilities of Logarithm.
Query handling capacity
Thousands of queries per second
This reflects Logarithm's ability to support high-throughput query demands.

Technologies & Tools

Programming Language
C++20
Logarithm is implemented in C++20, utilizing modern programming patterns for performance and maintainability.
Logging Library
Google Logging Library (glog)
This library is commonly used at Meta for emitting logs.

Key Actionable Insights

1
Utilize Logarithm's metadata API to attach contextual information to logs, such as rank IDs in distributed training jobs.
This practice can significantly enhance the clarity of logs, making it easier to identify discrepancies between different ranks during model training, thereby improving debugging efficiency.
2
Implement real-time log indexing to facilitate immediate access to log data for troubleshooting.
By indexing logs as they are generated, teams can quickly respond to issues as they arise, reducing downtime and improving the reliability of AI training workflows.

Common Pitfalls

1
Failing to attach sufficient metadata to log lines can lead to ambiguity during debugging.
Without clear context, engineers may struggle to identify the source of issues in distributed systems, particularly in complex AI training scenarios.

Related Concepts

Real-time Logging
AI Model Debugging Techniques
High-throughput Data Ingestion