•Matt Mathew, Alexander Gulko, Lei Sun, KK Sriramadhesikan, Alan Cao, Omkar Kakade•20 min read•advanced•
--
•View OriginalOverview
This article discusses Uber's migration of its Apache Hadoop-based data lake to Google Cloud Storage (GCS) and the security measures implemented during this transition. It highlights the challenges faced, particularly in integrating existing IAM controls with GCP IAM, and the solutions developed to enhance security and scalability.
What You'll Learn
1
How to implement a layered access model for cloud data security
2
Why multi-level caching improves performance in cloud applications
3
How to manage authentication and authorization in a hybrid cloud environment
Prerequisites & Requirements
- Understanding of cloud architecture and security models
- Familiarity with Apache Hadoop and GCP services(optional)
Key Questions Answered
How did Uber migrate its data lake to Google Cloud Storage?
Uber migrated its data lake by replacing HDFS with Google Cloud Storage while maintaining the existing tech stack. This involved creating an intermediary system called Storage Access Service to handle authentication and authorization, ensuring seamless integration with GCP IAM.
What challenges did Uber face during the migration?
Uber faced challenges in adapting its existing IAM controls to work with GCP IAM, particularly in ensuring secure access to data stored in GCS while maintaining the Hadoop security model. This required bridging the differences between HDFS and GCS security models.
What is the purpose of the Storage Access Service (SAS)?
The Storage Access Service (SAS) was developed to facilitate the exchange of Kerberos and Delegation Tokens for GCP Access Tokens, enabling secure access to data in GCS while abstracting the complexities of the underlying security models from users.
How does Uber ensure performance during heavy workloads on GCP?
Uber implemented a multi-level caching strategy to improve performance and scalability, allowing the system to handle over 500,000 requests per second while keeping the request volume seen by SAS instances well below 10,000 RPS.
Key Statistics & Figures
Percentage of analytical workloads running on GCP
19%
This reflects the extent of Uber's cloud migration efforts.
Data migrated to GCS
160+ PB
This showcases the scale of Uber's data lake migration.
Requests per second (RPS) handled by SAS
500,000 RPS
This indicates the performance capability of the Storage Access Service during peak loads.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Apache Hadoop
Used for the existing data lake infrastructure before migration.
Storage
Google Cloud Storage
Replaced HDFS as the storage layer for Uber's data lake.
Caching
Redis
Used for caching access tokens to improve performance.
Key Actionable Insights
1Implementing a layered access model can significantly enhance cloud security by limiting human access to sensitive data.This approach reduces the risk of unauthorized access and simplifies compliance with security policies, making it essential for organizations migrating to the cloud.
2Utilizing multi-level caching strategies can drastically improve application performance in cloud environments.By proactively generating and caching access tokens, organizations can reduce latency and improve user experience, especially during peak usage times.
3Integrating existing IAM controls with cloud IAM requires careful planning and execution to avoid security gaps.Organizations should assess their current IAM policies and adapt them to fit the cloud model, ensuring that security measures are not compromised during migration.
Common Pitfalls
1
Failing to properly synchronize IAM controls between on-premise and cloud environments can lead to security vulnerabilities.
This often happens when organizations do not fully assess their existing IAM policies and how they translate to cloud IAM, resulting in gaps that can be exploited.
2
Over-reliance on human identities for access control in cloud environments increases the risk of unauthorized access.
Organizations should minimize human access to sensitive data and utilize automated systems for managing permissions to enhance security.
Related Concepts
Cloud Migration Strategies
IAM Integration Techniques
Performance Optimization In Cloud Applications