Build better software to build software better

We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…

Overview

Slack's build pipeline team reduced build times for Quip and Slack Canvas from 60 minutes to as little as 10 minutes by applying classic software engineering principles—separation of concerns, caching, parallelization, and layering—to their Bazel-based build system. The article draws parallels between code performance optimization (caching with functools, threading) and build system optimization, demonstrating how decoupling frontend and backend builds, increasing cache granularity, and delegating parallelization to Bazel dramatically improved developer experience.

What You'll Learn

1

How to apply code-level performance optimization principles (caching, parallelization) to build systems

2

Why separation of concerns between frontend and backend builds is critical for cache hit rates in Bazel

3

How to identify and fix layering violations where build scripts duplicate orchestration already handled by Bazel

4

How to design granular, composable build units that maximize caching and parallelization effectiveness

5

Why hermetic and idempotent build steps are prerequisites for effective Bazel caching

Prerequisites & Requirements

  • Understanding of build systems and dependency graphs (directed acyclic graphs)
  • Familiarity with caching concepts (cache keys, hit rates, hermeticity, idempotency)
  • Basic familiarity with Bazel build system concepts (targets, srcs, outs)(optional)
  • Understanding of Python concurrency patterns (functools.cache, ThreadPoolExecutor)(optional)

Key Questions Answered

How do you reduce build times from 60 minutes to 10 minutes with Bazel?
Slack reduced build times by applying three key principles: separating frontend and backend build graphs to eliminate unnecessary transitive dependencies, increasing build target granularity to improve cache hit rates, and removing custom parallelization code in favor of Bazel's native parallelization. This required decoupling Python backend builds from TypeScript frontend builds and rewriting build orchestration in Starlark.
Why does coupling frontend and backend builds destroy cache effectiveness?
When the entire Python backend is a transitive dependency of frontend builds, every Python change alters the cache key for TypeScript builds. This meant Slack's cache hit rate was effectively zero—like having a cached function with 100 parameters where 2-3 always change. Severing this dependency edge alone saved 35 minutes per build because Python changes no longer triggered full frontend rebuilds.
What is a layering violation in build systems and how do you fix it?
A layering violation occurs when build scripts bundle business logic, task orchestration, and parallelization into a single unit, cutting across architectural layers. Slack's frontend builder managed its own worker processes for parallelization, competing with Bazel for CPU resources. The fix was to strip the builder down to pure business logic—building one bundle at a time—and delegate orchestration and parallelization entirely to Bazel.
What properties must build steps have for Bazel caching to work?
Build steps must be hermetic (only using explicitly declared inputs to produce outputs) and idempotent (producing the same outputs for the same inputs every time). Bazel enforces hermeticity through sandboxed execution where commands can only access declared input files. Without these properties, caching is unsound and may produce incorrect build artifacts.
How does build target granularity affect cache hit rates?
Coarse-grained targets with many inputs create large cache keys that invalidate frequently—any single input change requires a full rebuild. Fine-grained targets with smaller, focused input sets create smaller cache keys that only invalidate when their specific inputs change. Slack improved hit rates by splitting their monolithic frontend builder into individual bundle builds where each TypeScript and CSS build step is cached independently.
How did Slack verify correctness when rewriting their build system?
Slack built a comparison tool in Rust that diff'd artifacts produced by the existing build process against those produced by the new Bazel-based code. Since the original build code had no tests, the only correctness criterion was matching existing output under specific configurations. They used the differences iteratively to identify and fix logic discrepancies in the new implementation.
Why should you avoid custom parallelization inside Bazel build steps?
Custom parallelization within build scripts creates a work orchestrator inside a work orchestrator, where both Bazel and the script's worker processes contend for the same CPU resources. The script may even parallelize work that Bazel already knows is unnecessary. By delegating parallelization to Bazel, you enable parallelization across machines in a build cluster, not just local CPU cores, and eliminate resource contention.

Key Statistics & Figures

Original build time
60 minutes
Baseline build time for all cases before optimization
Best-case build time after optimization
10 minutes
When builds are cached and parallelized
Average-case build time after optimization
12 minutes
Mostly cached and parallelized builds
Worst-case build time after optimization
30 minutes
Cache miss scenario
Maximum speed improvement
6x faster
Best-case improvement compared to original 60-minute builds
Cost of frontend-backend coupling
35 minutes per build
Average time cost of the dependency edge between Python backend and TypeScript frontend, more than half the total build time
Build time after severing frontend-backend coupling
25 minutes
When the frontend was cached after decoupling from backend

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Build System
Bazel
Primary build system providing caching, parallelization, and sandboxed execution of build targets
Programming Language
Python
Backend application language and original build orchestration scripts; used for code examples demonstrating caching and parallelization concepts
Programming Language
Typescript
Frontend source language for Quip and Slack Canvas
Build Configuration Language
Starlark
Bazel's build definition language used to rewrite Python build orchestration with enforced constraints
Programming Language
Rust
Used to build an artifact comparison tool for validating correctness during build system migration
Build Tool
Webpack
Part of the original frontend build pipeline for bundling JavaScript
Programming Language
Cython
Used in backend build artifacts alongside Python
Serialization
Protobuf
Protocol buffer compilation as part of the backend build pipeline
CSS Preprocessor
Less
Frontend stylesheet source language compiled into CSS bundles

Key Actionable Insights

1
Model your build as a directed acyclic graph with exhaustively defined inputs and outputs for each step. This enables the build system to automatically determine what needs rebuilding and what can be cached. Think of each build target like a pure function with declared parameters—the more precisely you define dependencies, the better caching and parallelization will work.
This is the foundational principle that enables all other build optimizations. Without well-defined dependency edges, neither caching nor parallelization can be applied effectively by tools like Bazel.
2
Increase the granularity of your build targets to improve cache hit rates. Instead of one monolithic target that takes all sources and produces all artifacts, break it into smaller targets that each handle a specific piece. This is directly analogous to caching at the per-item level rather than per-collection in application code.
Slack's frontend builder originally took all TypeScript and CSS sources and produced all bundles. By splitting into per-bundle builds with independent TypeScript and CSS steps, they dramatically increased how often cached results could be reused.
3
Audit and sever unnecessary transitive dependencies between major subsystems in your build graph. When your frontend build depends on your entire backend, every backend change invalidates the frontend cache. Map out the actual data flow to identify which dependency edges are truly required versus artifacts of historical coupling.
Slack discovered that the dependency edge between their Python backend and TypeScript frontend was costing 35 minutes per build—more than half the total—because it forced full frontend rebuilds on any Python change.
4
Remove custom parallelization and orchestration code from your build scripts when using a build system like Bazel that handles these concerns. Strip your build scripts down to pure business logic that transforms specific inputs into specific outputs, and let the build system handle scheduling, caching, and resource allocation.
This avoids layering violations where your code and the build system compete for resources. It also makes build steps more composable and allows the build system to parallelize across machines, not just local cores.
5
Build a comparison tool to validate correctness when migrating build systems, especially when the original build code lacks tests. Diff the artifacts produced by old and new systems to iteratively find and fix discrepancies, building confidence in the migration.
Slack built a Rust tool for this purpose because the complexity of their original build code made it impossible to define correct behavior from first principles. The iterative comparison approach served as an effective substitute for unit tests.
6
Rewrite build orchestration in a constrained language like Starlark rather than in your application language. The deliberate limitations of such languages enforce separation between build logic and application code, preventing the re-entanglement of concerns that caused the original problems.
Slack's Python build scripts had deep dependencies on backend application code. Rewriting in Starlark and standard-library-only Python scripts enforced a clean boundary between build and application concerns.

Common Pitfalls

1
Creating coarse-grained cache keys that invalidate too frequently. When a build target takes in all source files as inputs, changing any single file invalidates the entire cache entry. This is like caching a function that takes a list of 100 items—adding one item means recalculating everything from scratch.
Break build targets into smaller units with focused input sets. Cache at the most granular level possible (individual bundle builds, not all bundles at once) to maximize cache hit rate.
2
Implementing custom parallelization inside build scripts that run within a build system like Bazel. This creates competing orchestrators fighting over the same CPU resources, and the build script may parallelize work that the build system already knows it doesn't need to do.
Delegate parallelization to the build system and keep build scripts focused on single-concern business logic. This also enables the build system to scale parallelization across machines rather than being limited to local cores.
3
Coupling frontend and backend build graphs through transitive dependencies when they could be independent. Slack's Python backend was a transitive source for every frontend bundle, meaning any Python change triggered a full frontend rebuild even though the frontend didn't actually depend on the backend output.
Audit your build graph for unnecessary dependency edges between major subsystems. Just because code historically evolved together doesn't mean the builds need to be coupled.
4
Assuming that simply adopting Bazel will automatically speed up your build. Without well-defined dependency graphs, hermetic build steps, and granular cache keys, Bazel's caching returns zero hits and its parallelization adds nothing over existing ad-hoc approaches.
Invest in engineering work to prepare your codebase first: define clear inputs and outputs, ensure hermeticity and idempotency, and separate concerns before expecting build system magic.
5
Mixing build orchestration code with application code, making it impossible to reason about build dependencies or refactor the build independently. Slack's build code used Python multiprocessing, async routines from their core codebase, and direct dependencies on backend modules.
Use constrained build-specific languages (like Starlark) and ensure build scripts depend only on standard libraries, enforcing a clean separation between build and application concerns.

Related Concepts

Directed Acyclic Graphs
Build System Optimization
Cache Invalidation Strategies
Separation Of Concerns
Hermetic Builds
Idempotent Build Steps
Remote Build Execution
Bazel Remote Caching
Monorepo Build Strategies
Developer Experience
CI/CD Pipeline Optimization
Build Graph Analysis
Starlark Build Rules
Software Layering Architecture
Composable Build Units