Tangle: An open-source ML experimentation platform built at Shopify scale

Tangle saves months of compute time, makes every experiment automatically reproducible, and allows teammates to share computation without coordination.

Shopify Engineering
12 min readintermediate
--
View Original

Overview

Shopify open-sources Tangle, an ML experimentation platform built to solve six common failure modes in machine learning development. The platform features visual pipeline editing, content-based caching, language-agnostic architecture, and has been battle-tested at Shopify's Search & Discovery scale, processing millions of queries across billions of products while saving over a year of compute time.

What You'll Learn

1

How to build ML pipelines visually using a drag-and-drop DAG interface instead of writing notebook code

2

Why content-based caching outperforms lineage-based caching for ML experimentation workflows

3

How to wrap existing CLI programs in any language as reusable pipeline components using YAML specifications

4

How to deploy Tangle on HuggingFace Spaces or locally with Docker for ML experimentation

5

Why platform-agnostic, language-neutral architecture eliminates dependency conflicts in ML pipelines

Prerequisites & Requirements

  • Basic understanding of ML workflows including data preparation, model training, and evaluation
  • Familiarity with directed acyclic graphs (DAGs) and pipeline concepts(optional)
  • Docker or Podman for local installation
  • uv package manager for local installation
  • HuggingFace Pro subscription ($9/month) for cloud execution(optional)

Key Questions Answered

What is Tangle and how does it solve ML experimentation problems?
Tangle is an open-source, platform-agnostic ML experimentation platform built by Shopify that addresses six common failure modes: untracked queries, unstructured notebooks, repeated data preparation, irreproducible results, slow deployment, and lack of sharing. It uses visual pipelines, content-based caching, and language-neutral components to let teams build, execute, and share ML workflows through a drag-and-drop interface.
How does content-based caching differ from lineage-based caching in ML pipelines?
Lineage-based caching invalidates all downstream components when upstream components change. Content-based caching, used by Tangle, checks actual output content hashes instead. Downstream components reuse cached results when outputs remain identical regardless of upstream changes. This means a 10-hour pipeline completes in 20 minutes when only one component changes, and cache operates globally across all team members.
How do Tangle components work with any programming language?
Tangle components wrap arbitrary containerized CLI programs that read and write files, without requiring framework-specific code modifications. Components are defined as YAML specifications describing metadata, inputs/outputs, and a templated command-line. This supports Python, Shell, JavaScript, C#, C++, Rust, Java, Go, R, or any language capable of CLI execution, enabling multi-language pipelines without compatibility issues.
How do you get started with Tangle on HuggingFace?
Visit the Tangle quick-start page on HuggingFace Spaces to start building pipelines immediately without registration. Creating pipelines is free, but running them requires a HuggingFace Pro subscription ($9/month). You can use the sample XGBoost training pipeline or build from scratch by dragging components, connecting outputs to inputs, configuring arguments, and submitting for execution.
How does Tangle handle data flow between pipeline components?
Tangle components communicate through file paths rather than in-memory objects. A producer writes to a local path, the system uploads artifacts to cloud storage (GCS, S3, etc.), and consumers read from local paths with the system transparently retrieving data. Placeholders in component specifications are replaced with actual file locations at runtime, keeping storage abstracted from component logic.
What is Tangle's execution flow when a pipeline is submitted?
When submitted, Tangle queues tasks, checks upstream dependencies, calculates execution cache keys to find reusable executions (including still-running ones), then either reuses cached results or launches containers in a cloud cluster. The orchestrator monitors container status, captures logs, updates execution state, stores output artifact metadata including size and content hash, and signals downstream tasks automatically.
How does Tangle ensure ML experiment reproducibility?
Every pipeline run is recorded with complete lineage including graph structure, execution logs, artifact metadata, and metrics. Intermediate data is immutable and never overwritten. Team members can clone any colleague's pipeline run, investigate issues, modify parameters, and resubmit. Components are versioned independently by content hash, allowing exact version references and eliminating dependency conflicts.
How do you deploy a private Tangle instance on HuggingFace?
Duplicate the Tangle HuggingFace Space to your account and provide an HF token to create a single-tenant instance. The database is stored in your own HF Space persistent storage for complete data isolation. Cloning to an organization creates a single-tenant multi-user deployment where team members can see each other's pipeline runs and share org-wide cache.

Key Statistics & Figures

Time spent on data engineering vs. algorithms in ML development
80%
Referenced from arxiv research on the 80/20 ML rule
Total compute time savings since adopting Tangle
More than 1 year
Accumulated across Shopify teams using Tangle
Pipeline completion time with content-based caching when one component changes
20 minutes
down from 10 hours
HuggingFace Pro subscription cost for running Tangle pipelines
$9/month
Required for executing pipelines on HuggingFace

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Configuration
YAML
Component specification format for defining pipeline building blocks
Containerization
Docker
Container runtime for local Tangle installation and component execution
Containerization
Podman
Alternative container runtime for local Tangle installation
Programming Language
Python
Supported component language with Lightweight Python Component Generator feature
Programming Language
Javascript
Supported component language for inline script components
ML Framework
Xgboost
Sample training pipeline provided for getting started
Cloud Platform
Huggingface Spaces
Multi-tenant hosting platform for Tangle deployment
Compute
Huggingface Jobs
Execution environment for user pipeline runs
Database
Sqlite
Per-tenant database storage in HuggingFace deployment
Cloud Storage
Gcs
Artifact storage option for pipeline outputs
Cloud Storage
S3
Artifact storage option for pipeline outputs
Cloud Platform
GCP
Supported cloud provider (already supported, needs deployment documentation)
ML Framework
Tensorflow
Referenced as example of reusable component knowledge (training loop components)
Package Manager
Uv
Required for local Tangle installation

Key Actionable Insights

1
Adopt content-based caching over lineage-based caching for ML pipelines to dramatically reduce redundant compute. When upstream components change but produce identical outputs, downstream tasks automatically reuse cached results, turning 10-hour pipeline reruns into 20-minute executions.
This is especially impactful for teams where multiple data scientists run experiments sharing common preprocessing steps, as the global cache eliminates thousands of redundant compute hours monthly.
2
Design ML pipeline components as pure functions with deterministic behavior—identical inputs should always produce identical outputs with no side effects. This enables effective caching, artifact reuse, and safe sharing across teams without unexpected state mutations.
Tangle enforces this by defining components as YAML specifications that wrap containerized CLI programs reading and writing files, ensuring complete isolation between executions.
3
Use language-agnostic component architecture to eliminate dependency conflicts in ML workflows. By wrapping existing CLI programs in container specifications rather than requiring framework-specific code, teams can mix Python, Java, JavaScript, Rust, and other languages in a single pipeline without compatibility issues.
This approach also means existing codebases can be integrated without modification, reducing the barrier to adoption and allowing gradual migration of existing workflows.
4
Implement visual DAG-based pipeline editors to make ML experimentation accessible to non-engineers. Product managers and analysts can create and run pipelines without writing code, enabling them to run experiments and track metrics independently while engineers focus on component development.
Tangle's drag-and-drop interface renders complete data flow as a directed acyclic graph, providing immediate visibility into pipeline structure without parsing notebook code.
5
Version ML components independently using content hashes rather than relying on package management systems. This allows teams to reference exact versions, mix different component versions in the same pipeline for comparison, and share specific component versions without dependency hell.
Unlike Python packages installed globally, YAML-based component specifications can be organized into libraries, indexed, searched, and safely loaded from any source including GitHub, web, or cloud storage.
6
Start with HuggingFace deployment for quick evaluation before investing in local or cloud infrastructure. The hosted multi-tenant service provides immediate access to Tangle's capabilities with HuggingFace handling storage, compute, and authentication at $9/month.
For teams needing data isolation, duplicating the Space to an organization account creates a private instance with shared org-wide cache while maintaining complete control over data.

Common Pitfalls

1
Using lineage-based caching instead of content-based caching in ML pipelines. Lineage-based approaches force all downstream components to re-execute whenever any upstream component changes, even when the actual output data hasn't changed. This leads to massive redundant compute waste.
Tangle's content-based caching checks actual output content hashes, so downstream components only re-execute when their inputs truly differ, reducing 10-hour pipelines to 20-minute runs.
2
Tracking experiments manually through notebook versioning and custom query logs. Engineers forget which notebook version, data source, or parameters they used, making reproduction impossible and wasting hours on duplicate runs.
Tangle automatically records complete lineage for every pipeline run including graph structure, execution logs, artifact metadata, and metrics, with immutable intermediate data that is never overwritten.
3
Requiring all pipeline components to be written in the same language or framework. This creates dependency conflicts and forces teams to rewrite existing tools, increasing adoption friction and limiting the ability to use the best tool for each task.
Tangle's language-agnostic approach wraps arbitrary CLI programs in container specifications, allowing Python, Java, Shell, Rust, C++, and JavaScript components to coexist in the same pipeline.
4
Building ML pipelines that only handle data processing or only handle training, requiring separate tools for end-to-end workflows. This fragmentation leads to integration complexity and lost context between pipeline stages.
Tangle supports data processing, ML training, model deployment, human evaluation, and any unconventional processing in a single pipeline, acting as glue between mismatched tools and languages.
5
Not sharing pipeline runs and components across team members. When each data scientist works in isolation with their own notebooks and scripts, teams duplicate effort and miss opportunities for collaboration and cache reuse.
Tangle's global cache operates across all users—when multiple scientists share preprocessing steps, execution happens once and all pipelines share the artifact, even for still-running executions.

Related Concepts

Directed Acyclic Graphs (dags)
ML Pipeline Orchestration
Content-addressable Storage
Container Orchestration
Mlops
Experiment Tracking And Reproducibility
Feature Engineering
Semantic Search
Product Ranking Models
Recommendation Systems
Data Pipeline Caching
Multi-tenant Architecture
Component-based Architecture
Artifact Management
Infrastructure As Code