Building Pinterest Canvas, a text-to-image foundation model

Pinterest Engineering

•

Pinterest Engineering

•10 min read•intermediate•

--

•View Original

CLIPDiffusion ModelsEmbeddingFine-tuningReinforcement LearningTransformer

Overview

The article discusses the development of Pinterest Canvas, a text-to-image foundation model aimed at enhancing existing images and products on the Pinterest platform. It details the training process, the model's architecture, and the fine-tuning techniques used to generate photorealistic backgrounds and perform inpainting tasks.

What You'll Learn

1

How to train a latent diffusion model for image generation

2

Why conditioning on existing images enhances model performance

3

How to implement inpainting techniques using product masks

4

When to apply reinforcement learning to improve image generation diversity

Prerequisites & Requirements

Understanding of latent diffusion models and image generation techniques
Familiarity with machine learning frameworks for model training(optional)

Key Questions Answered

What is Pinterest Canvas and how does it work?

Pinterest Canvas is a text-to-image foundation model designed to enhance existing images and products by generating new backgrounds. It operates using a latent diffusion model trained on over 1.5 billion high-quality text-image pairs, allowing it to create cohesive and appealing visuals based on product masks and conditioning images.

How does the fine-tuning process for background generation work?

The fine-tuning process involves training the base Pinterest Canvas model to fill in missing parts of images, known as inpainting. This is achieved by providing a target image with missing portions and a binary mask, guiding the model to generate a complete image that matches the target while preserving product integrity.

What improvements are planned for Pinterest Canvas in the future?

Future improvements include upgrading the diffusion backbone to a modern Transformer architecture for better resolution and fidelity, rethinking the binary-masking approach to allow for more dynamic background generation, and incorporating enhanced visual embeddings for improved conditioning.

Why is filtering training data important for Pinterest Canvas?

Filtering training data is crucial to ensure high-quality images that adhere to trust and safety standards. Pinterest Canvas employs rigorous data filtering to maintain a diverse dataset of over 1.5 billion text-image pairs, which helps the model generate visually appealing and inspiring outputs.

Key Statistics & Figures

Number of high-quality text-image pairs used for training

1.5 billion

This extensive dataset ensures that Pinterest Canvas can generate appealing and cohesive images.

Technologies & Tools

Machine Learning

Latent Diffusion Model

Used as the backbone for image generation in Pinterest Canvas.

Machine Learning

Variational Autoencoder (vae)

Operates in latent space for efficient image generation and decoding.

Machine Learning

Clip

Used for encoding text captions to condition the image generation process.

Machine Learning

Unified Visual Embedding (uve)

Enhances the conditioning process for generating images by providing a richer visual understanding.

Key Actionable Insights

1
Implementing a two-stage training process for your image generation models can significantly enhance performance.
By separating the initial training from the fine-tuning phase, you allow the model to adapt gradually to complex tasks like inpainting, which can lead to better results in generating realistic images.

2
Using high-quality and diverse training data is essential for developing robust machine learning models.
The article highlights that Pinterest Canvas was trained on over 1.5 billion high-quality text-image pairs, emphasizing the importance of data quality in achieving superior model performance.

3
Incorporating user feedback into model training can improve the relevance and quality of generated images.
The article mentions plans to gather customer input through Pinterest Ad Labs, which can help refine the model's outputs based on real user preferences.

Common Pitfalls

1

Failing to adequately filter training data can lead to poor model performance.

Without rigorous data filtering, models may generate low-quality or inappropriate images, which can undermine user trust and satisfaction.

2

Neglecting the importance of conditioning inputs can limit the model's effectiveness.

The article emphasizes that conditioning on existing images and using detailed captions are crucial for guiding the model to produce high-quality outputs.

Related Concepts

Image Generation Techniques

Latent Diffusion Models

Inpainting Methods

Machine Learning Model Training