Build better software to build software better

David Reed

We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…

Slack

•

David Reed

•19 min read•advanced•

--

•View Original

CachingChefCSSCythonJavaScriptJenkinsLessPythonRustTypeScript

Overview

Slack's build pipeline team reduced build times for Quip and Slack Canvas from 60 minutes to as little as 10 minutes by applying classic software engineering principles—separation of concerns, caching, parallelization, and layering—to their Bazel-based build system. The article draws parallels between code performance optimization (caching with functools, threading) and build system optimization, demonstrating how decoupling frontend and backend builds, increasing cache granularity, and delegating parallelization to Bazel dramatically improved developer experience.

What You'll Learn

1

How to apply code-level performance optimization principles (caching, parallelization) to build systems

2

Why separation of concerns between frontend and backend builds is critical for cache hit rates in Bazel

3

How to identify and fix layering violations where build scripts duplicate orchestration already handled by Bazel

4

How to design granular, composable build units that maximize caching and parallelization effectiveness

5

Why hermetic and idempotent build steps are prerequisites for effective Bazel caching

Prerequisites & Requirements

Understanding of build systems and dependency graphs (directed acyclic graphs)
Familiarity with caching concepts (cache keys, hit rates, hermeticity, idempotency)
Basic familiarity with Bazel build system concepts (targets, srcs, outs)(optional)
Understanding of Python concurrency patterns (functools.cache, ThreadPoolExecutor)(optional)

Key Questions Answered

How do you reduce build times from 60 minutes to 10 minutes with Bazel?

Slack reduced build times by applying three key principles: separating frontend and backend build graphs to eliminate unnecessary transitive dependencies, increasing build target granularity to improve cache hit rates, and removing custom parallelization code in favor of Bazel's native parallelization. This required decoupling Python backend builds from TypeScript frontend builds and rewriting build orchestration in Starlark.

Why does coupling frontend and backend builds destroy cache effectiveness?

When the entire Python backend is a transitive dependency of frontend builds, every Python change alters the cache key for TypeScript builds. This meant Slack's cache hit rate was effectively zero—like having a cached function with 100 parameters where 2-3 always change. Severing this dependency edge alone saved 35 minutes per build because Python changes no longer triggered full frontend rebuilds.

What is a layering violation in build systems and how do you fix it?

A layering violation occurs when build scripts bundle business logic, task orchestration, and parallelization into a single unit, cutting across architectural layers. Slack's frontend builder managed its own worker processes for parallelization, competing with Bazel for CPU resources. The fix was to strip the builder down to pure business logic—building one bundle at a time—and delegate orchestration and parallelization entirely to Bazel.

What properties must build steps have for Bazel caching to work?

Build steps must be hermetic (only using explicitly declared inputs to produce outputs) and idempotent (producing the same outputs for the same inputs every time). Bazel enforces hermeticity through sandboxed execution where commands can only access declared input files. Without these properties, caching is unsound and may produce incorrect build artifacts.

How does build target granularity affect cache hit rates?

Coarse-grained targets with many inputs create large cache keys that invalidate frequently—any single input change requires a full rebuild. Fine-grained targets with smaller, focused input sets create smaller cache keys that only invalidate when their specific inputs change. Slack improved hit rates by splitting their monolithic frontend builder into individual bundle builds where each TypeScript and CSS build step is cached independently.

How did Slack verify correctness when rewriting their build system?

Slack built a comparison tool in Rust that diff'd artifacts produced by the existing build process against those produced by the new Bazel-based code. Since the original build code had no tests, the only correctness criterion was matching existing output under specific configurations. They used the differences iteratively to identify and fix logic discrepancies in the new implementation.

Why should you avoid custom parallelization inside Bazel build steps?

Custom parallelization within build scripts creates a work orchestrator inside a work orchestrator, where both Bazel and the script's worker processes contend for the same CPU resources. The script may even parallelize work that Bazel already knows is unnecessary. By delegating parallelization to Bazel, you enable parallelization across machines in a build cluster, not just local CPU cores, and eliminate resource contention.

Key Statistics & Figures

Original build time

60 minutes

Baseline build time for all cases before optimization

Best-case build time after optimization

10 minutes

When builds are cached and parallelized

Average-case build time after optimization

12 minutes

Mostly cached and parallelized builds

Worst-case build time after optimization

30 minutes

Cache miss scenario

Maximum speed improvement

6x faster

Best-case improvement compared to original 60-minute builds

Cost of frontend-backend coupling

35 minutes per build

Average time cost of the dependency edge between Python backend and TypeScript frontend, more than half the total build time

Build time after severing frontend-backend coupling

25 minutes

When the frontend was cached after decoupling from backend

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Build System

Bazel

Primary build system providing caching, parallelization, and sandboxed execution of build targets

Programming Language

Python

Backend application language and original build orchestration scripts; used for code examples demonstrating caching and parallelization concepts

Programming Language

Typescript

Frontend source language for Quip and Slack Canvas

Build Configuration Language

Starlark

Bazel's build definition language used to rewrite Python build orchestration with enforced constraints

Programming Language

Rust

Used to build an artifact comparison tool for validating correctness during build system migration

Build Tool

Webpack

Part of the original frontend build pipeline for bundling JavaScript

Programming Language

Cython

Used in backend build artifacts alongside Python

Serialization

Protobuf

Protocol buffer compilation as part of the backend build pipeline

CSS Preprocessor

Less

Frontend stylesheet source language compiled into CSS bundles

Key Actionable Insights

1
Model your build as a directed acyclic graph with exhaustively defined inputs and outputs for each step. This enables the build system to automatically determine what needs rebuilding and what can be cached. Think of each build target like a pure function with declared parameters—the more precisely you define dependencies, the better caching and parallelization will work.
This is the foundational principle that enables all other build optimizations. Without well-defined dependency edges, neither caching nor parallelization can be applied effectively by tools like Bazel.

2
Increase the granularity of your build targets to improve cache hit rates. Instead of one monolithic target that takes all sources and produces all artifacts, break it into smaller targets that each handle a specific piece. This is directly analogous to caching at the per-item level rather than per-collection in application code.
Slack's frontend builder originally took all TypeScript and CSS sources and produced all bundles. By splitting into per-bundle builds with independent TypeScript and CSS steps, they dramatically increased how often cached results could be reused.

3
Audit and sever unnecessary transitive dependencies between major subsystems in your build graph. When your frontend build depends on your entire backend, every backend change invalidates the frontend cache. Map out the actual data flow to identify which dependency edges are truly required versus artifacts of historical coupling.
Slack discovered that the dependency edge between their Python backend and TypeScript frontend was costing 35 minutes per build—more than half the total—because it forced full frontend rebuilds on any Python change.

4
Remove custom parallelization and orchestration code from your build scripts when using a build system like Bazel that handles these concerns. Strip your build scripts down to pure business logic that transforms specific inputs into specific outputs, and let the build system handle scheduling, caching, and resource allocation.
This avoids layering violations where your code and the build system compete for resources. It also makes build steps more composable and allows the build system to parallelize across machines, not just local cores.

5
Build a comparison tool to validate correctness when migrating build systems, especially when the original build code lacks tests. Diff the artifacts produced by old and new systems to iteratively find and fix discrepancies, building confidence in the migration.
Slack built a Rust tool for this purpose because the complexity of their original build code made it impossible to define correct behavior from first principles. The iterative comparison approach served as an effective substitute for unit tests.

6
Rewrite build orchestration in a constrained language like Starlark rather than in your application language. The deliberate limitations of such languages enforce separation between build logic and application code, preventing the re-entanglement of concerns that caused the original problems.
Slack's Python build scripts had deep dependencies on backend application code. Rewriting in Starlark and standard-library-only Python scripts enforced a clean boundary between build and application concerns.

Common Pitfalls

1

Creating coarse-grained cache keys that invalidate too frequently. When a build target takes in all source files as inputs, changing any single file invalidates the entire cache entry. This is like caching a function that takes a list of 100 items—adding one item means recalculating everything from scratch.

Break build targets into smaller units with focused input sets. Cache at the most granular level possible (individual bundle builds, not all bundles at once) to maximize cache hit rate.

2

Implementing custom parallelization inside build scripts that run within a build system like Bazel. This creates competing orchestrators fighting over the same CPU resources, and the build script may parallelize work that the build system already knows it doesn't need to do.

Delegate parallelization to the build system and keep build scripts focused on single-concern business logic. This also enables the build system to scale parallelization across machines rather than being limited to local cores.

3

Coupling frontend and backend build graphs through transitive dependencies when they could be independent. Slack's Python backend was a transitive source for every frontend bundle, meaning any Python change triggered a full frontend rebuild even though the frontend didn't actually depend on the backend output.

Audit your build graph for unnecessary dependency edges between major subsystems. Just because code historically evolved together doesn't mean the builds need to be coupled.

4

Assuming that simply adopting Bazel will automatically speed up your build. Without well-defined dependency graphs, hermetic build steps, and granular cache keys, Bazel's caching returns zero hits and its parallelization adds nothing over existing ad-hoc approaches.

Invest in engineering work to prepare your codebase first: define clear inputs and outputs, ensure hermeticity and idempotency, and separate concerns before expecting build system magic.

5

Mixing build orchestration code with application code, making it impossible to reason about build dependencies or refactor the build independently. Slack's build code used Python multiprocessing, async routines from their core codebase, and direct dependencies on backend modules.

Use constrained build-specific languages (like Starlark) and ensure build scripts depend only on standard libraries, enforcing a clean separation between build and application concerns.

Related Concepts

Directed Acyclic Graphs

Build System Optimization

Cache Invalidation Strategies

Separation Of Concerns

Hermetic Builds

Idempotent Build Steps

Remote Build Execution

Bazel Remote Caching

Monorepo Build Strategies

Developer Experience

CI/CD Pipeline Optimization

Build Graph Analysis

Starlark Build Rules

Software Layering Architecture

Composable Build Units