Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL

Jakob Holdgaard Thomsen
14 min readintermediate
--
View Original

Overview

This article discusses the design and implementation of Schemaless, Uber Engineering's scalable datastore built on MySQL. It outlines the motivations behind creating a custom solution, the key requirements for the datastore, and the architectural decisions made to support Uber's growing data needs.

What You'll Learn

1

How to design a scalable datastore that meets specific business needs

2

Why operational trust is critical when selecting a database solution

3

How to implement triggers for data change notifications in a datastore

Prerequisites & Requirements

  • Understanding of database design principles
  • Experience with MySQL or similar databases(optional)

Key Questions Answered

What were the key requirements for Uber's new datastore?
The key requirements for Uber's new datastore included the ability to linearly add capacity, ensure write availability, notify downstream dependencies, support secondary indexes, and maintain operational trust. These requirements were essential to accommodate the rapid growth of trip data and ensure reliable operations.
How does Schemaless handle data change notifications?
Schemaless uses a feature called triggers to notify downstream systems of data changes. This allows applications to respond to changes in real-time, effectively decoupling data creation from its processing, similar to an event bus system.
What is the data model used in Schemaless?
The Schemaless data model is an append-only sparse three-dimensional persistent hash map, allowing for JSON blobs to be stored without strict schema validation. Each data entity, called a cell, is immutable and versioned, enabling efficient data management and retrieval.
What indexing strategies does Schemaless use?
Schemaless supports efficient indexing by allowing users to define indexes over fields in JSON blobs. These indexes are sharded based on a designated shard field, enabling quick access to data while maintaining performance during queries.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
When designing a scalable datastore, prioritize operational trust by choosing technologies your team is familiar with.
Operational trust is crucial for mission-critical applications, as it ensures that your team can quickly resolve issues that arise in production environments.
2
Implement triggers in your datastore to facilitate real-time data processing and notifications.
Triggers can decouple data creation from processing, allowing for more flexible and responsive application architectures.
3
Consider denormalizing data in your indexes to enhance query performance.
Denormalization can significantly speed up data retrieval times, especially in systems where read performance is critical.

Common Pitfalls

1
Failing to account for the need for write availability can lead to data loss during high traffic periods.
It's essential to design your datastore with mechanisms that prioritize write availability, especially in environments with fluctuating loads.
2
Neglecting to implement effective notification systems can result in delayed processing of critical data changes.
Without a robust notification mechanism, downstream systems may not react promptly to important data changes, impacting overall application performance.

Related Concepts

Database Scalability
Operational Trust In Systems
Event-driven Architecture