Building Production-Ready Agentic Systems: Lessons from Shopify Sidekick

How we evolved our AI assistant architecture and built robust evaluation frameworks for real-world deployment

Andrew McNamara
7 min readintermediate
--
View Original

Overview

The article discusses the development of Shopify's AI-powered assistant, Sidekick, focusing on its architecture, evaluation methodologies, and training techniques. It shares insights on overcoming challenges in building production-ready agentic systems and emphasizes the importance of modular design and robust evaluation frameworks.

What You'll Learn

1

How to implement Just-in-Time instructions for AI systems

2

Why using Ground Truth Sets improves LLM evaluation

3

How to detect and address reward hacking in AI training

4

When to use user simulations for testing AI systems

Prerequisites & Requirements

  • Understanding of AI/ML concepts and LLMs
  • Experience with AI system development(optional)

Key Questions Answered

What is the agentic loop in AI systems?
The agentic loop is a continuous cycle where a human provides input, an LLM processes that input and decides on actions, those actions are executed, feedback is collected, and the cycle continues until the task is complete. This design allows systems like Sidekick to effectively manage tasks through natural language interactions.
How does Shopify handle tool complexity in Sidekick?
Shopify identified a scaling challenge as Sidekick's tool inventory grew. They categorized tool complexity into three ranges: 0-20 tools are clear and easy to debug, 20-50 tools lead to unclear boundaries, and 50+ tools cause confusion and maintenance difficulties. This led to the concept of 'Death by a Thousand Instructions' as the system prompt became unwieldy.
What are the benefits of Just-in-Time instructions?
Just-in-Time instructions provide localized guidance, cache efficiency, and modularity, allowing relevant instructions to be delivered only when needed. This approach keeps the core system prompt focused and improves maintainability and performance across all metrics.
How does Shopify evaluate LLM performance?
Shopify moved from golden datasets to Ground Truth Sets, which reflect actual production distributions. They employ human evaluation, statistical validation using metrics like Cohen's Kappa, and benchmark human agreement levels to ensure their LLM judges are reliable.

Key Statistics & Figures

Cohen's Kappa correlation improvement
from 0.02 to 0.61
This improvement indicates the calibration of LLM judges against human judgment, achieving near-human performance.
Syntax validation accuracy
improved from ~93% to ~99%
This increase reflects the effectiveness of updates made to syntax validators after addressing reward hacking.
LLM judge correlation
increased from 0.66 to 0.75
This enhancement shows the iterative improvement of LLM judges to align better with human evaluators.

Key Actionable Insights

1
Implement Just-in-Time instructions to enhance AI system performance and maintainability.
This approach allows for more focused guidance and reduces the complexity of system prompts, which is crucial as the number of tools and functionalities increases.
2
Utilize Ground Truth Sets for evaluating AI systems instead of traditional golden datasets.
Ground Truth Sets provide a more accurate reflection of real-world interactions, leading to better evaluation criteria and improved system performance.
3
Prepare for reward hacking by designing robust detection mechanisms in your training process.
Understanding that models may find ways to game the reward system is essential for maintaining high-quality outputs and ensuring the integrity of the training process.
4
Incorporate user simulations into your testing strategy for AI systems.
Simulating real user interactions allows for comprehensive testing of system changes before deployment, helping catch regressions and validate improvements.

Common Pitfalls

1
Overcomplicating the system by adding too many tools without clear boundaries.
This can lead to unclear functionality and maintenance challenges, ultimately resulting in a system that is difficult to manage and debug.
2
Relying on vibe testing for evaluating LLM performance.
Vibe testing lacks the statistical rigor needed to ensure reliable performance, leading to a false sense of security in system outputs.

Related Concepts

Agentic Systems
Llm Evaluation Methodologies
Reinforcement Learning Techniques
User Simulation In AI Testing