Detecting Image Similarity in (Near) Real-time Using Apache Flink

Pinterest Engineering

•

Pinterest Engineering

•8 min read•advanced•

--

•View Original

ApacheAWSAWS S3TensorFlowV

Overview

The article discusses the implementation of a near-real-time image similarity detection system at Pinterest using Apache Flink. It highlights the challenges faced in processing a massive volume of images and the design choices made to achieve sub-second latency while maintaining accuracy.

What You'll Learn

1

How to implement a real-time image similarity detection pipeline using Apache Flink

2

Why using a streaming pipeline reduces latency for image processing tasks

3

How to leverage Locality-Sensitive Hashing (LSH) for efficient image similarity searches

4

When to apply machine learning models for evaluating image similarity

Prerequisites & Requirements

Understanding of image processing and machine learning concepts
Familiarity with Apache Flink and TensorFlow(optional)

Key Questions Answered

How does Pinterest detect image similarity in near-real-time?

Pinterest uses a streaming pipeline built on Apache Flink to detect image similarity in near-real-time. The system processes images as they are created, leveraging Locality-Sensitive Hashing (LSH) and a TensorFlow-based classifier to identify duplicates and similar images efficiently.

What are the main challenges in detecting duplicate images at scale?

The main challenges include handling the sheer volume of images, which is around 300 billion Pins, and the rate of image creation, which is approximately 100 images per second. Additionally, ensuring low latency while maintaining accuracy and robustness in the detection process is critical.

What is the role of the Manas search engine in the image similarity detection process?

Manas is Pinterest's custom search engine that identifies potential candidates for image similarity by matching Locality-Sensitive Hashing (LSH) terms. It is optimized for correctness over latency, ensuring that the most relevant results are returned based on overlapping terms.

How does Pinterest ensure the operability of its image similarity pipeline?

Pinterest has built various tools for debugging, monitoring, and handling failures within the image similarity pipeline. This includes special debugging data propagation, custom metrics for monitoring health, and rollback tools to revert to stable states in case of failures.

Key Statistics & Figures

Number of Pins saved across Pinterest

300B

This figure illustrates the vast scale of data that the image similarity detection system must handle.

Rate of image creation per second

~100

and 200 at peak

Average number of cluster members

6, but as high as 1.1M for a handful of clusters

This variance in cluster size poses challenges for maintaining efficient image similarity detection.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Flink

Used to build the streaming pipeline for real-time image similarity detection.

Machine Learning

Tensorflow

Employed for evaluating the set of candidate images to determine similarity.

Search Engine

Manas

Pinterest's custom search engine used for finding potential candidates through LSH term matching.

Messaging

Kafka

Used for notifications and data propagation within the pipeline.

Graph Storage

Zen

Pinterest's graph storage system used for maintaining complex relationships between images.

Key Actionable Insights

1
Implementing a streaming pipeline can significantly reduce latency in data processing tasks.
This approach is particularly beneficial for applications requiring real-time data analysis, such as image similarity detection, where delays can lead to negative user experiences.

2
Utilizing Locality-Sensitive Hashing (LSH) can enhance the efficiency of similarity searches in large datasets.
By applying LSH, you can quickly narrow down potential matches before applying more computationally expensive evaluations, which is crucial in high-volume environments.

3
Incorporating robust monitoring and alerting systems is essential for maintaining the health of complex data pipelines.
This ensures that any issues can be detected and addressed promptly, minimizing downtime and maintaining service quality.

Common Pitfalls

1

Failing to account for the scale of data can lead to performance bottlenecks.

When designing systems for large datasets, it's crucial to ensure that the architecture can handle peak loads without degradation in performance.

2

Neglecting the importance of monitoring can result in undetected failures.

Without proper monitoring, issues may go unnoticed until they significantly impact user experience, making it essential to implement robust alerting mechanisms.

Related Concepts

Real-time Data Processing

Machine Learning For Image Recognition

Locality-sensitive Hashing (lsh)

Scalable Search Engine Design