Uber logo

How Uber Uses Apache Spark

94 engineering articles about Apache Spark from Uber's engineering team

Articles

Filter:
Uber logo
Uber
Intermediate
This article details how Uber optimized Apache Hadoop's Distcp (Distributed Copy) tool to scale their data replication infrastructure from handling 250 TB to petabytes of daily data movement.
Abhay Yadav, Radhika Patwari, Sanjay Sundaresan
15 min read
Has Summary
--
Uber logo
Uber
Advanced
This article details how Uber built and scaled Apache Hudi to power one of the world's largest data lakes, managing 19,500 datasets with trillions of records across a multi-hundred-petabyte reposit...
Prashant Wason, Balajee Nagasubramaniam, Surya Prasanna Kumar Yalla, Meenal Binwade, Xinli Shang, Jack Song
19 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's transition from traditional keyword-based search using Apache Lucene to implementing semantic vector search with Amazon OpenSearch.
Hao Sun, Jiasen Xu, Smit Patel, Anand Kotriwal, Xu Zhang
11 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses how Uber utilizes a pull-based ingestion model in OpenSearch™ to effectively index streaming data.
Yupeng Fu, Varun Bharadwaj, Shuyi Zhang, Xu Xiong, Michael Froh
14 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency.
Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song
6 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hyb...
Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma
10 min read
Has Summary
--
Uber logo
Uber
Advanced
Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.
Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya
8 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the architecture and implementation of Uber's HiveSync, a critical service for data replication across its massive data lake.
Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan
14 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the development and implementation of forecasting models aimed at improving driver availability at airports, which are critical to Uber's ridesharing ecosystem.
Bob Zheng, Dhruv Ghulati, Manoj Panikkar, Michael (Yichuan) Cai
15 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses the evolution of Uber's Search Platform, highlighting its transition from Elasticsearch to an in-house solution called Sia, and ultimately to the adoption of OpenSearch.
Yupeng Fu, Shubham Gupta, Shanshan Song, Mingmin Chen
15 min read
Has Summary
--
Uber logo
Uber
Intermediate
This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced...
Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya
14 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's implementation of a configuration-driven archival and retrieval framework designed to manage vast amounts of regulatory data efficiently.
Abhishek Dobliyal, Aakash Bhardwaj
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...
Sayan Pal, Rishabh Mishra
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
This article discusses Uber's implementation of elastic resource management on Kubernetes, focusing on enhancements made to support Ray-based job management.
Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit
10 min read
Has Summary
--
Uber logo
Uber
Intermediate
Uber's blog post discusses their migration of machine learning workloads to Kubernetes using Ray, detailing the challenges faced with their previous setup and the improvements achieved with the new...
Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit
18 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses how Uber utilizes Ray®, a general compute engine for Python®, to enhance the efficiency of its rides business through improved machine learning model performance and optimizat...
Kaichen Wei, Matt Walker, Peng Zhang
15 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's advanced settlement accounting system, which is crucial for managing financial transactions involving payment service providers (PSPs).
Onkar Singh, Sai Sameera Grandhi, Nagesh Kumar Mankala, Abhinav Agarwal
12 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models.
Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang
11 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Genie, Uber's generative AI on-call copilot designed to enhance communication and efficiency in on-call operations.
Paarth Chothani, Eduards Sidorovics, Xiyuan Feng, Nicholas Marcott, Jonathan Li, Chun Zhu, Kailiang Fu, Meghana Somasundara
11 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's migration of its batch data platform to the cloud, focusing on the implementation of DataMesh principles.
Arun Mahadeva Iyer, Abhi Khune, Sahana Bhat
11 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's upgrade of its search platform from Lucene version 7. 5. 0 to 9. 4.
Anand Kotriwal, Aparajita Pandey, Charu Jain, Yupeng Fu
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...
Ankit Sultana, Caner Balci
15 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses the Sparkle framework developed by Uber to standardize modular ETL processes, enhancing developer productivity and data quality.
Dinesh Jagannathan, Sharath Bhat, Suman Voleti, Praveen Raj
8 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses Uber's migration of its Apache Hadoop-based data lake to Google Cloud Storage (GCS) and the security measures implemented during this transition.
Matt Mathew, Alexander Gulko, Lei Sun, KK Sriramadhesikan, Alan Cao, Omkar Kakade
20 min read
Includes Code
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the modernization of Uber's logging infrastructure using CLP, focusing on the development of an end-to-end system for managing unstructured logs.
Gao Xin, Jack Luo, Kirk Rodrigues
16 min read
Has Summary
--
Uber logo
Uber
Advanced
Uber is modernizing its batch data infrastructure by migrating to Google Cloud Platform (GCP) to enhance data analytics and machine learning capabilities.
Abhi Khune, Arun Mahadeva Iyer, Sahana Bhat, Matt Mathew
7 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses how Uber counts job participation at scale, detailing the integration of Apache Pinot™ to address challenges in data processing and analysis.
Ryan Woo, Sameer Kapoor
11 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's DataK9 platform, which automates the categorization of vast amounts of data at the field level using AI/ML techniques.
Lei Sun, Mohammad Islam
23 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's evolution in machine learning (ML) through its centralized platform, Michelangelo, highlighting its transition from predictive to generative AI.
Uber logo
Uber
Advanced
This article details Uber's migration of over a trillion entries of ledger data from DynamoDB to LedgerStore, focusing on the challenges, strategies, and outcomes of the process.
Raghav Gautam, Erik Seaberg, Abhishek Kanhar
12 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's journey in scaling its AI/ML infrastructure, highlighting the transition from on-premise to cloud solutions, the implementation of new technologies, and the optimizatio...
Nav Kankani, Rush Tehrani, Anant Vyas
10 min read
Has Summary
--
Uber logo
Uber
Intermediate
DataCentral is Uber's proprietary platform designed for Big Data observability, chargeback, and governance.
Arnav Balyan, Atul Mantri, Krishna Karri, Amruth Sampath
10 min read
Has Summary
--
Uber logo
Uber
Advanced
uVitals is an anomaly detection and alerting system developed by Uber to enhance the reliability of its services by quickly identifying and addressing issues in multi-dimensional time series data.
Venki Appiah, Komal Raulkar
14 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses how Uber utilizes Apache Pinot for real-time analytics of mobile app crashes, enhancing their ability to detect and resolve issues quickly.
Kriti Dangi, Anil Purohit, Parijat Bansal, Rohit Yadav
17 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses the implementation of a Unified Session for analytical events at Uber, aimed at enhancing data consistency and analytics across various applications.
Harsh Desai, Gaurav Yadav, Sahil Jindal, Satyam Shubham, Mahip Jain, Anshal Shukla, Ashok Varma
13 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses the challenges Uber faces with increasing data storage costs and presents a solution through selective column reduction in Apache Parquet™ files.
Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda
7 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses CheckEnv, a tool developed by Uber for fast detection of remote procedure calls (RPCs) between different environments using graph technology.
Minglei Wang, Kamyar Arbabifard
11 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses the implementation of dynamic executor core resizing in Apache Spark to address out-of-memory (OOM) exceptions.
Kalyan Sivakumar
12 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses the implementation of Two-Tower Embeddings (TTE) at Uber, highlighting its role in enhancing the efficiency and scalability of recommendation systems.
Bo Ling, Melissa Barr, Dhruva Dixith Kurra, Chun Zhu, Nicholas Marcott
18 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Spark Analysers, a system developed by Uber to identify anti-patterns in Spark applications.
Vijayant Soni, Sashidhar Thallam, Sakshi Pande, Atul Mantri
10 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses how Uber implemented an incremental ETL process using Apache Hudi to manage its transactional data lake.
Vinoth Govindarajan, Saketh Chintapalli, Yogesh Saswade, Aayush Bareja
16 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's journey in scaling the adoption of Kerberos authentication across its extensive data analytics platform.
Alexander Gulko, Matt Mathew
13 min read
Includes Code
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's Machine Learning Education Program, which leverages engineering principles to scale ML education for its employees.
Brooke Carter, Melissa Barr, Michael Mui
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
Uber's article discusses the implementation of a highly scalable and distributed Remote Shuffle Service (RSS) designed to enhance the efficiency of data processing in Apache Spark.
Mayank Bansal, Bo Yang, Mayur Bhosale, Kai Jiang
20 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's approach to automating offline inferences using machine learning and natural language processing on support interaction data.
Neeraj Dhake, Aravind Ranganathan
12 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the implementation of finer-grained encryption in Apache Parquet™, focusing on how it addresses data access restrictions, retention, and encryption at rest.
Xinli Shang, Mohammad Islam, Pavi Subenderan, Jianchun Xu
19 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses DeepETA, Uber's advanced model for predicting arrival times using deep learning techniques.
Uber logo
Uber
Advanced
The article discusses Project RADAR, an intelligent fraud detection system developed by Uber that integrates machine learning and human expertise to identify and mitigate fraudulent activities in r...
Sergey Zelvenskiy, Garvit Harisinghani, Tiffany Yu, Edwin Ng, Robin Wei
14 min read
Has Summary
--
Uber logo
Uber
Advanced
This article details Uber's migration of financial data from DynamoDB to Docstore, highlighting the challenges faced and the architectural decisions made to ensure data integrity and operational ef...
Piyush Patel, Jaydeepkumar Chovatia, Kaushik Devarajaiah
15 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's initiatives to enhance the efficiency of its Big Data platform, focusing on cost reduction through optimizations in file formats, HDFS erasure coding, YARN scheduling i...
Zheng Shao, Mohammad Islam
18 min read
Has Summary
--