MemQ: An efficient, scalable cloud native PubSub system

Pinterest Engineering

•

Pinterest Engineering

•12 min read•intermediate•

--

•View Original

ApacheApache KafkaApache PulsarAWSAWS S3

Overview

MemQ is a new, efficient, and scalable cloud-native PubSub system developed by Pinterest, designed to handle Near Real-Time data transportation while being up to 90% more cost-effective than Apache Kafka. The article discusses its architecture, components, and advantages over traditional systems, emphasizing its ability to decouple storage and serving for improved scalability.

What You'll Learn

1

How to implement a scalable PubSub system using MemQ

2

Why separating storage and serving components enhances scalability

3

How to achieve cost savings in data transportation with MemQ

Prerequisites & Requirements

Understanding of PubSub systems and cloud architecture
Familiarity with AWS services, particularly S3(optional)

Key Questions Answered

What are the main advantages of using MemQ over Apache Kafka?

MemQ is up to 90% more cost-effective than Kafka, handles GB/s traffic, and allows for independent scaling of reads and writes without requiring expensive rebalancing. This makes it suitable for Pinterest's high-volume data transportation needs.

How does MemQ ensure data consistency and availability?

MemQ relies on Amazon S3 for storage, which guarantees that every write is replicated across at least three Availability Zones. This ensures high availability and fault tolerance, making MemQ a reliable choice for data transport.

What is the architecture of MemQ and its key components?

MemQ features a decoupled architecture with components including Clients, Brokers, a Cluster Governor, TopicProcessors, and a pluggable storage layer. This design allows for efficient data handling and scalability according to traffic demands.

How does MemQ handle data production and consumption?

MemQ uses an async dispatch model for data production, allowing non-blocking sends. For consumption, it provides a poll-based interface that retrieves data batches from the storage layer, ensuring efficient data access.

Key Statistics & Figures

Cost efficiency

up to 90% cheaper

MemQ has proven to be significantly more cost-effective than an equivalent Kafka deployment.

End-to-End latency

p99 E2E latency of 30s

This latency is achieved with AWS S3 storage, and efforts are ongoing to reduce it further.

Traffic handling capability

Handles GB/s traffic

This capability allows MemQ to efficiently support high-volume data transportation.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage

Amazon S3

Used as the primary storage layer for MemQ, providing cost-effective and fault-tolerant storage.

Notification Queue

Kafka

Currently utilized for delivering pointers to consumers for data location.

Key Actionable Insights

1
Utilize MemQ's decoupled architecture to enhance your cloud-native applications.
By separating storage and serving components, you can independently scale your application based on traffic needs, which is crucial for handling varying workloads efficiently.

2
Leverage the cost savings of MemQ to optimize your data transportation strategies.
With MemQ being up to 90% cheaper than Kafka, organizations can allocate resources more effectively, allowing for reinvestment in other critical areas of the business.

3
Implement micro-batching techniques to improve IOPS and reduce costs.
MemQ's use of micro-batching allows for lower IOPS on the storage layer, which is essential for cost-effective cloud storage solutions like Amazon S3.

Common Pitfalls

1

Overlooking the importance of decoupling storage and serving components can lead to scalability issues.

Many systems that tightly couple these components struggle under heavy loads. By adopting a decoupled architecture like MemQ, teams can ensure that their systems remain responsive and scalable.

2

Failing to consider the cost implications of IOPS can lead to budget overruns.

With cloud storage, high IOPS can significantly increase costs. MemQ's design minimizes IOPS requirements, making it a more budget-friendly option.

Related Concepts

Pubsub Systems

Cloud-native Architecture

Data Transportation Strategies

Micro-batching Techniques