Your AI SRE needs better observability, not bigger models.

Most AI SRE tools sound smart but fall apart in real incidents. Manveer Chawla explains why the problem isn’t the model, it’s the observability foundation, and how fixing that changes everything.

Overview

The article discusses the need for improved observability in AI Site Reliability Engineering (SRE) rather than relying solely on larger models. It emphasizes the importance of a robust data foundation, specifically advocating for ClickHouse as the ideal database to support AI SRE copilot functionalities, enabling better incident response and root cause analysis.

What You'll Learn

1

How to build an effective AI SRE copilot using ClickHouse

2

Why high-cardinality data is essential for root cause analysis

3

How to improve incident response times with a robust observability layer

Prerequisites & Requirements

  • Understanding of observability concepts and AI SRE roles
  • Familiarity with ClickHouse or similar databases(optional)

Key Questions Answered

What are the main reasons AI SRE tools fail in production?
AI SRE tools often fail due to short data retention, lack of high-cardinality data, and slow query performance. These limitations prevent effective root cause analysis and hinder the ability to learn from past incidents, making it difficult for AI systems to provide accurate insights.
How does ClickHouse improve observability for AI SRE?
ClickHouse enhances observability by providing long retention periods for full-fidelity logs, supporting high-cardinality data, and enabling sub-second query performance. This allows AI SRE tools to access comprehensive data quickly, facilitating better incident response and analysis.
What is the significance of a context layer in AI SRE?
A context layer is crucial for AI SRE as it provides the necessary background information that helps models understand incidents better. This includes deployment histories, service dependencies, and past incidents, which are essential for accurate root cause analysis.

Key Statistics & Figures

Retention period for logs in legacy systems
7 to 14 days
This short retention limits the ability of AI SRE tools to learn from historical incidents.
Number of queries an AI SRE model issues during an investigation
6 to 27
This highlights the need for fast query performance to maintain an effective feedback loop.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a ClickHouse-based observability stack to enhance your AI SRE capabilities.
Using ClickHouse allows for long-term data retention and high-cardinality support, which are critical for effective incident analysis and response.
2
Focus on building a rich context layer that includes deployment history and incident archives.
This context will help AI models make more informed decisions during incident investigations, improving overall response times.
3
Ensure your observability solution can handle high query volumes without incurring excessive costs.
Many traditional observability tools charge based on query volume, which can become prohibitively expensive. ClickHouse's architecture mitigates this issue.

Common Pitfalls

1
Relying on legacy observability tools that limit data retention and cardinality.
These tools often lead to incomplete data, making it difficult for AI SRE models to perform effective root cause analysis.
2
Failing to build a context layer for AI SRE.
Without a context layer, AI models may lack the necessary background to accurately interpret incidents, leading to incorrect conclusions.

Related Concepts

Observability Best Practices
AI In Incident Response
Data Retention Strategies