How Uber Uses Apache Spark

94 engineering articles about Apache Spark from Uber's engineering team

Other Uber Technologies

Apache(195)Java(112)MySQL(78)SQL(77)JSON(74)Apache Kafka(68)

Other Companies Using Apache Spark

Articles

Filter:

Uber

Intermediate

How Uber Scaled Data Replication to Move Petabytes Every Day

This article details how Uber optimized Apache Hadoop's Distcp (Distributed Copy) tool to scale their data replication infrastructure from handling 250 TB to petabytes of daily data movement.

ApacheApache Spark

Abhay Yadav, Radhika Patwari, Sanjay Sundaresan

15 min read

Has Summary

Uber

Advanced

Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations

This article details how Uber built and scaled Apache Hudi to power one of the world's largest data lakes, managing 19,500 datasets with trillions of records across a multi-hundred-petabyte reposit...

ApacheApache SparkAWSAzureGoogle CloudGoogle Cloud Storage

Prashant Wason, Balajee Nagasubramaniam, Surya Prasanna Kumar Yalla, Meenal Binwade, Xinli Shang, Jack Song

19 min read

Has Summary

Uber

Advanced

Powering Billion-Scale Vector Search with OpenSearch

The article discusses Uber's transition from traditional keyword-based search using Apache Lucene to implementing semantic vector search with Amazon OpenSearch.

ApacheApache SparkCSSEmbedding

Hao Sun, Jiasen Xu, Smit Patel, Anand Kotriwal, Xu Zhang

11 min read

Has Summary

Uber

Advanced

How Uber Indexes Streaming Data with Pull-Based Ingestion in OpenSearch™

This article discusses how Uber utilizes a pull-based ingestion model in OpenSearch™ to effectively index streaming data.

ApacheApache KafkaApache SparkAWSgRPC

Yupeng Fu, Varun Bharadwaj, Shuyi Zhang, Xu Xiong, Michael Froh

14 min read

Has Summary

Uber

Advanced

From Batch to Streaming: Accelerating Data Freshness in Uber’s Data Lake

This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency.

ApacheApache KafkaApache SparkMachine Learning

Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song

6 min read

Has Summary

Uber

Advanced

I/O Observability for Uber’s Massive Petabyte-Scale Data Lake

The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hyb...

ApacheApache SparkGoogle CloudGoogle Cloud StorageGrafanaJavaMySQLOracleSQL

Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma

10 min read

Has Summary

Uber

Advanced

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.

ApacheApache SparkJavaKubernetesMySQLOraclePySparkPythonScalaSQL

Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya

8 min read

Has Summary

Uber

Advanced

Building Uber’s Data Lake: Batch Data Replication Using HiveSync

This article discusses the architecture and implementation of Uber's HiveSync, a critical service for data replication across its massive data lake.

ApacheApache SparkGoogle CloudJavaMySQLOracle

Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan

14 min read

Has Summary

Uber

Advanced

Forecasting Models to Improve Driver Availability at Airports

This article discusses the development and implementation of forecasting models aimed at improving driver availability at airports, which are critical to Uber's ridesharing ecosystem.

ApacheApache SparkCassandraKongTransformerTransformers

Bob Zheng, Dhruv Ghulati, Manoj Panikkar, Michael (Yichuan) Cai

15 min read

Has Summary

Uber

Advanced

The Evolution of Uber’s Search Platform

The article discusses the evolution of Uber's Search Platform, highlighting its transition from Elasticsearch to an in-house solution called Sia, and ultimately to the adoption of OpenSearch.

ApacheApache KafkaApache SparkAWSElasticsearchGoogle CloudGoogle Cloud StoragegRPCJSONSQL

Yupeng Fu, Shubham Gupta, Shanshan Song, Mingmin Chen

15 min read

Has Summary

Uber

Intermediate

How Uber Migrated from Hive to Spark SQL for ETL Workloads

This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced...

ApacheApache SparkJavaJSONMySQLOracleServerlessSQL

Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya

14 min read

Has Summary

Uber

Intermediate

From Archival to Access: Config-Driven Data Pipelines

The article discusses Uber's implementation of a configuration-driven archival and retrieval framework designed to manage vast amounts of regulatory data efficiently.

ApacheApache SparkAWSMySQLOracleYAML

Abhishek Dobliyal, Aakash Bhardwaj

12 min read

Has Summary

Uber

Intermediate

Migrating Large-Scale Interactive Compute Workloads to Kubernetes Without Disruption

The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...

ApacheApache SparkCassandraDockerGoogle CloudKubernetesPySpark

Sayan Pal, Rishabh Mishra

12 min read

Has Summary

Uber

Intermediate

Uber’s Journey to Ray on Kubernetes: Resource Management

This article discusses Uber's implementation of elastic resource management on Kubernetes, focusing on enhancements made to support Ray-based job management.

ApacheApache SparkGrafanaKubernetes

Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit

10 min read

Has Summary

Uber

Intermediate

Uber’s Journey to Ray on Kubernetes: Ray Setup

Uber's blog post discusses their migration of machine learning workloads to Kubernetes using Ray, detailing the challenges faced with their previous setup and the improvements achieved with the new...

ApacheApache SparkDeep LearningGrafanaKubernetes

Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit

18 min read

Has Summary

Uber

Advanced

How Uber Uses Ray® to Optimize the Rides Business

The article discusses how Uber utilizes Ray®, a general compute engine for Python®, to enhance the efficiency of its rides business through improved machine learning model performance and optimizat...

ApacheApache SparkAWSDockerKubernetesPandasPySparkXGBoost

Kaichen Wei, Matt Walker, Peng Zhang

15 min read

Has Summary

Uber

Advanced

Streamlining Financial Precision: Uber’s Advanced Settlement Accounting System

The article discusses Uber's advanced settlement accounting system, which is crucial for managing financial transactions involving payment service providers (PSPs).

ApacheApache KafkaApache SparkCassandra

Onkar Singh, Sai Sameera Grandhi, Nagesh Kumar Mankala, Abhinav Agarwal

12 min read

Has Summary

Uber

Advanced

Open Source and In-House: How Uber Optimizes LLM Training

The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models.

ApacheApache KafkaApache SparkCometDockerGoogle CloudGPTGPT-4Hugging FaceKubernetesMistralPyTorchSQLTransformers

Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang

11 min read

Has Summary

Uber

Intermediate

Genie: Uber’s Gen AI On-Call Copilot

The article discusses Genie, Uber's generative AI on-call copilot designed to enhance communication and efficiency in on-call operations.

ApacheApache SparkCopilotEmbeddingFine-tuningPySpark

Paarth Chothani, Eduards Sidorovics, Xiyuan Feng, Nicholas Marcott, Jonathan Li, Chun Zhu, Kailiang Fu, Meghana Somasundara

11 min read

Has Summary

Uber

Intermediate

DataMesh: How Uber laid the foundations for the data lake cloud migration

The article discusses Uber's migration of its batch data platform to the cloud, focusing on the implementation of DataMesh principles.

ApacheApache SparkGoogle CloudGoogle Cloud StorageGrafanaJavaMySQLOracle

Arun Mahadeva Iyer, Abhi Khune, Sahana Bhat

11 min read

Has Summary

Uber

Advanced

Lucene: Uber’s Search Platform Version Upgrade

The article discusses Uber's upgrade of its search platform from Lucene version 7. 5. 0 to 9. 4.

ApacheApache SparkGitJavaScala

Anand Kotriwal, Aparajita Pandey, Charu Jain, Yupeng Fu

12 min read

Has Summary

Uber

Intermediate

Pinot for Low-Latency Offline Table Analytics

The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...

ApacheApache KafkaApache SparkgRPCJavaMySQLOraclePySparkScala

Ankit Sultana, Caner Balci

15 min read

Has Summary

Uber

Intermediate

Sparkle: Standardizing Modular ETL at Uber

The article discusses the Sparkle framework developed by Uber to standardize modular ETL processes, enhancing developer productivity and data quality.

ApacheApache KafkaApache SparkCassandraJavaMySQLOracleScalaSpringSpring BootSQLYAML

Dinesh Jagannathan, Sharath Bhat, Suman Voleti, Praveen Raj

8 min read

Has Summary

Uber

Advanced

Enabling Security for Hadoop Data Lake on Google Cloud Storage

This article discusses Uber's migration of its Apache Hadoop-based data lake to Google Cloud Storage (GCS) and the security measures implemented during this transition.

ApacheApache SparkCachingCQRSGoogle CloudGoogle Cloud StoragegRPCMVPOAuthRedis

Matt Mathew, Alexander Gulko, Lei Sun, KK Sriramadhesikan, Alan Cao, Omkar Kakade

20 min read

Includes Code

Has Summary

Uber

Advanced

Modernizing Logging at Uber with CLP (Part II)

This article discusses the modernization of Uber's logging infrastructure using CLP, focusing on the development of an end-to-end system for managing unstructured logs.

ApacheApache SparkElasticsearchJavaJSON

Gao Xin, Jack Luo, Kirk Rodrigues

16 min read

Has Summary

Uber

Advanced

Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform

Uber is modernizing its batch data infrastructure by migrating to Google Cloud Platform (GCP) to enhance data analytics and machine learning capabilities.

ApacheApache SparkGoogle CloudGoogle Cloud StorageSQL

Abhi Khune, Arun Mahadeva Iyer, Sahana Bhat, Matt Mathew

7 min read

Has Summary

Uber

Advanced

How Uber Accomplishes Job Counting At Scale

This article discusses how Uber counts job participation at scale, detailing the integration of Apache Pinot™ to address challenges in data processing and analysis.

ApacheApache KafkaApache Spark

Ryan Woo, Sameer Kapoor

11 min read

Has Summary

Uber

Advanced

DataK9: Auto-categorizing an exabyte of data at field level through AI/ML

The article discusses Uber's DataK9 platform, which automates the categorization of vast amounts of data at the field level using AI/ML techniques.

ApacheApache SparkYAML

Lei Sun, Mohammad Islam

23 min read

Has Summary

Uber

Advanced

From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

The article discusses Uber's evolution in machine learning (ML) through its centralized platform, Michelangelo, highlighting its transition from predictive to generative AI.

ApacheApache SparkAutoMLDeep LearningDockerGenerative AIHugging FaceKerasKubernetesPaLMPrompt EngineeringPyTorchTensorFlowXGBoost

Kai Wang, Min Cai, Joseph Wang, Eric Chen

28 min read

Has Summary

Uber

Advanced

Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

This article details Uber's migration of over a trillion entries of ledger data from DynamoDB to LedgerStore, focusing on the challenges, strategies, and outcomes of the process.

ApacheApache SparkAWSAWS S3DynamoDBJavaScala

Raghav Gautam, Erik Seaberg, Abhishek Kanhar

12 min read

Has Summary

Uber

Advanced

Scaling AI/ML Infrastructure at Uber

The article discusses Uber's journey in scaling its AI/ML infrastructure, highlighting the transition from on-premise to cloud solutions, the implementation of new technologies, and the optimizatio...

ApacheApache KafkaApache SparkGenerative AIKubernetesLLaMAMachine Learning

Nav Kankani, Rush Tehrani, Anant Vyas

10 min read

Has Summary

Uber

Intermediate

DataCentral: Uber’s Big Data Observability and Chargeback Platform

DataCentral is Uber's proprietary platform designed for Big Data observability, chargeback, and governance.

ApacheApache KafkaApache SparkAWSAWS S3Google CloudGoogle Cloud StorageMySQL

Arnav Balyan, Atul Mantri, Krishna Karri, Amruth Sampath

10 min read

Has Summary

Uber

Advanced

uVitals – An Anomaly Detection & Alerting System

uVitals is an anomaly detection and alerting system developed by Uber to enhance the reliability of its services by quickly identifying and addressing issues in multi-dimensional time series data.

ApacheApache KafkaApache SparkPandasStatsmodels

Venki Appiah, Komal Raulkar

14 min read

Has Summary

Uber

Intermediate

Real-Time Analytics for Mobile App Crashes using Apache Pinot

The article discusses how Uber utilizes Apache Pinot for real-time analytics of mobile app crashes, enhancing their ability to detect and resolve issues quickly.

ApacheApache KafkaApache SparkAWSAzureElasticsearchGoogle CloudGoogle Cloud StorageJSON

Kriti Dangi, Anil Purohit, Parijat Bansal, Rohit Yadav

17 min read

Has Summary

Uber

Advanced

Unified Session for Analytical Events

The article discusses the implementation of a Unified Session for analytical events at Uber, aimed at enhancing data consistency and analytics across various applications.

ApacheApache KafkaApache SparkRedis

Harsh Desai, Gaurav Yadav, Sahil Jindal, Satyam Shubham, Mahip Jain, Anshal Shukla, Ashok Varma

13 min read

Has Summary

Uber

Intermediate

Selective Column Reduction for DataLake Storage Cost Efficiency

The article discusses the challenges Uber faces with increasing data storage costs and presents a solution through selective column reduction in Apache Parquet™ files.

ApacheApache Spark

Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda

7 min read

Has Summary

Uber

Advanced

CheckEnv: Fast Detection of RPC Calls Between Environments Powered by Graphs

The article discusses CheckEnv, a tool developed by Uber for fast detection of remote procedure calls (RPCs) between different environments using graph technology.

ApacheApache SparkSQL

Minglei Wang, Kamyar Arbabifard

11 min read

Has Summary

Uber

Advanced

Dynamic Executor Core Resizing in Spark

The article discusses the implementation of dynamic executor core resizing in Apache Spark to address out-of-memory (OOM) exceptions.

ApacheApache SparkJava

Kalyan Sivakumar

12 min read

Has Summary

Uber

Advanced

Innovative Recommendation Applications Using Two Tower Embeddings at Uber

The article discusses the implementation of Two-Tower Embeddings (TTE) at Uber, highlighting its role in enhancing the efficiency and scalability of recommendation systems.

ApacheApache SparkArtificial IntelligenceDeep LearningEmbeddingMachine LearningPySpark

Bo Ling, Melissa Barr, Dhruva Dixith Kurra, Chun Zhu, Nicholas Marcott

18 min read

Has Summary

Uber

Intermediate

Spark Analysers: Catching Anti-Patterns In Spark Apps

The article discusses Spark Analysers, a system developed by Uber to identify anti-patterns in Spark applications.

ApacheApache SparkSQL

Vijayant Soni, Sashidhar Thallam, Sakshi Pande, Atul Mantri

10 min read

Has Summary

Uber

Advanced

Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

The article discusses how Uber implemented an incremental ETL process using Apache Hudi to manage its transactional data lake.

ApacheApache SparkGrafanaJavaScalaSQLYAML

Vinoth Govindarajan, Saketh Chintapalli, Yogesh Saswade, Aayush Bareja

16 min read

Has Summary

Uber

Intermediate

Scaling Adoption of Kerberos at Uber

The article discusses Uber's journey in scaling the adoption of Kerberos authentication across its extensive data analytics platform.

ApacheApache KafkaApache SparkDockerGit

Alexander Gulko, Matt Mathew

13 min read

Includes Code

Has Summary

Uber

Advanced

ML Education at Uber: Frameworks Inspired by Engineering Principles

The article discusses Uber's Machine Learning Education Program, which leverages engineering principles to scale ML education for its employees.

ApacheApache KafkaApache SparkMachine Learning

Brooke Carter, Melissa Barr, Michael Mui

12 min read

Has Summary

Uber

Intermediate

Uber’s Highly Scalable and Distributed Shuffle as a Service

Uber's article discusses the implementation of a highly scalable and distributed Remote Shuffle Service (RSS) designed to enhance the efficiency of data processing in Apache Spark.

ApacheApache SparkJavaKubernetesMySQLOracleRedis

Mayank Bansal, Bo Yang, Mayur Bhosale, Kai Jiang

20 min read

Has Summary

Uber

Advanced

Enabling Offline Inferences at Uber Scale

The article discusses Uber's approach to automating offline inferences using machine learning and natural language processing on support interaction data.

ApacheApache SparkDockerPySparkStreamlitXGBoost

Neeraj Dhake, Aravind Ranganathan

12 min read

Has Summary

Uber

Advanced

One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™

This article discusses the implementation of finer-grained encryption in Apache Parquet™, focusing on how it addresses data access restrictions, retention, and encryption at rest.

ApacheApache SparkAzureJavaSQL

Xinli Shang, Mohammad Islam, Pavi Subenderan, Jianchun Xu

19 min read

Has Summary

Uber

Advanced

DeepETA: How Uber Predicts Arrival Times Using Deep Learning

The article discusses DeepETA, Uber's advanced model for predicting arrival times using deep learning techniques.

ApacheApache SparkComputer VisionDeep LearningMachine LearningSelf-AttentionTensorFlowTransformerTransformersXGBoost

Xinyu Hu, Olcay Cirit, Tanmay Binaykiya, Ramit Hora

15 min read

Has Summary

Uber

Advanced

Project RADAR: Intelligent Early Fraud Detection System with Humans in the Loop

The article discusses Project RADAR, an intelligent fraud detection system developed by Uber that integrates machine learning and human expertise to identify and mitigate fraudulent activities in r...

ApacheApache SparkPySparkScalaSQL

Sergey Zelvenskiy, Garvit Harisinghani, Tiffany Yu, Edwin Ng, Robin Wei

14 min read

Has Summary

Uber

Advanced

How Uber Migrated Financial Data from DynamoDB to Docstore

This article details Uber's migration of financial data from DynamoDB to Docstore, highlighting the challenges faced and the architectural decisions made to ensure data integrity and operational ef...

ApacheApache KafkaApache SparkAWSDynamoDB

Piyush Patel, Jaydeepkumar Chovatia, Kaushik Devarajaiah

15 min read

Has Summary

Uber

Advanced

Cost-Efficient Open Source Big Data Platform at Uber

The article discusses Uber's initiatives to enhance the efficiency of its Big Data platform, focusing on cost reduction through optimizations in file formats, HDFS erasure coding, YARN scheduling i...

ApacheApache KafkaApache SparkJSONMySQLSQL

Zheng Shao, Mohammad Islam

18 min read

Has Summary