#

Apache Spark Programming Tutorials & Engineering Articles

246 Apache Spark tutorials, guides, and engineering insights from Uber, NVIDIA, LinkedIn, and more

Apache Spark Articles & Tutorials

Filter:
Notion logo
Notion
Intermediate
The article discusses Notion's journey in scaling its vector search infrastructure, achieving a 10x increase in scale while reducing costs by 90% over two years.
Preeti Gondi, Mickey Liu, Nathan Louie, Calder Lund, Jacob Sager
10 min read
Has Summary
--
Pinterest logo
Pinterest
Intermediate
The article discusses Pinterest's transition to a next-generation database ingestion framework designed to address the limitations of legacy systems.
Pinterest Engineering
10 min read
Includes Code
Has Summary
--
Uber logo
Uber
Intermediate
This article details how Uber optimized Apache Hadoop's Distcp (Distributed Copy) tool to scale their data replication infrastructure from handling 250 TB to petabytes of daily data movement.
Abhay Yadav, Radhika Patwari, Sanjay Sundaresan
15 min read
Has Summary
--
Uber logo
Uber
Advanced
This article details how Uber built and scaled Apache Hudi to power one of the world's largest data lakes, managing 19,500 datasets with trillions of records across a multi-hundred-petabyte reposit...
Prashant Wason, Balajee Nagasubramaniam, Surya Prasanna Kumar Yalla, Meenal Binwade, Xinli Shang, Jack Song
19 min read
Has Summary
--
Pinterest logo
Pinterest
Advanced
PinLanding is a multimodal AI pipeline developed by Pinterest to generate shopping collections from billions of products.
Pinterest Engineering
8 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's transition from traditional keyword-based search using Apache Lucene to implementing semantic vector search with Amazon OpenSearch.
Hao Sun, Jiasen Xu, Smit Patel, Anand Kotriwal, Xu Zhang
11 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses Project Aether, a tool developed by NVIDIA to facilitate the migration of CPU-based Apache Spark workloads to GPU-accelerated environments on Amazon EMR.
Navin Kumar
6 min read
Includes Code
Has Summary
--
Uber logo
Uber
Advanced
This article discusses how Uber utilizes a pull-based ingestion model in OpenSearch™ to effectively index streaming data.
Yupeng Fu, Varun Bharadwaj, Shuyi Zhang, Xu Xiong, Michael Froh
14 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency.
Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song
6 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hyb...
Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma
10 min read
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses the collaboration between IBM and NVIDIA to enhance large-scale data analytics through GPU-native Velox and NVIDIA cuDF, highlighting significant performance improvements over...
Gregory Kimball
7 min read
Has Summary
--
Uber logo
Uber
Advanced
Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.
Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya
8 min read
Has Summary
--
Netflix logo
Netflix
Advanced
The article discusses how Netflix scales its Muse application to provide data-driven creative insights at a massive scale, focusing on the architectural evolution and optimizations made to handle t...
Netflix Technology Blog
10 min read
Includes Code
Has Summary
--
Stripe logo
Stripe
Intermediate
The article discusses the development of a real-time streaming analytics system for Stripe Billing, enabling customers to access subscription metrics with minimal latency.
Reed Trevelyan
8 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the architecture and implementation of Uber's HiveSync, a critical service for data replication across its massive data lake.
Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan
14 min read
Has Summary
--
Stripe logo
Stripe
Advanced
The article discusses the critical importance of maintaining consistent data across multiple systems as organizations grow.
James Beswick
11 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the development and implementation of forecasting models aimed at improving driver availability at airports, which are critical to Uber's ridesharing ecosystem.
Bob Zheng, Dhruv Ghulati, Manoj Panikkar, Michael (Yichuan) Cai
15 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses the deployment of a serverless, distributed data processing architecture using Apache Spark and NVIDIA AI on Azure.
Uber logo
Uber
Advanced
The article discusses the evolution of Uber's Search Platform, highlighting its transition from Elasticsearch to an in-house solution called Sia, and ultimately to the adoption of OpenSearch.
Yupeng Fu, Shubham Gupta, Shanshan Song, Mingmin Chen
15 min read
Has Summary
--
Uber logo
Uber
Intermediate
This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced...
Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya
14 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's implementation of a configuration-driven archival and retrieval framework designed to manage vast amounts of regulatory data efficiently.
Abhishek Dobliyal, Aakash Bhardwaj
12 min read
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses the application of Graph Neural Networks (GNNs) in enhancing fraud detection within financial services.
NVIDIA logo
NVIDIA
Advanced
The article discusses how Atgenomix SeqsLab leverages NVIDIA technologies to enhance health omics analysis for precision medicine.
Yu-Ting Lin
9 min read
Has Summary
--
Pinterest logo
Pinterest
Advanced
The article discusses how Pinterest enhances its machine learning feature iterations through an effective backfill process.
Pinterest Engineering
14 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses the use of GPU acceleration to enhance performance in Apache Spark applications, highlighting the challenges of migrating workloads from CPUs to GPUs.
Matt Ahrens
9 min read
Includes Code
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses how to accelerate Deep Learning (DL) and Large Language Model (LLM) inference using Apache Spark in cloud environments.
Uber logo
Uber
Intermediate
The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...
Sayan Pal, Rishabh Mishra
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
This article discusses Uber's implementation of elastic resource management on Kubernetes, focusing on enhancements made to support Ray-based job management.
Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit
10 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses how to accelerate Apache Parquet scans on Apache Spark using GPUs, specifically through the RAPIDS Accelerator for Apache Spark.
Matt Ahrens
7 min read
Includes Code
Has Summary
--
Uber logo
Uber
Intermediate
Uber's blog post discusses their migration of machine learning workloads to Kubernetes using Ray, detailing the challenges faced with their previous setup and the improvements achieved with the new...
Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit
18 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
This article discusses strategies for preventing GPU fragmentation in the Volcano Scheduler, focusing on an enhanced scheduling approach that integrates bin-packing with gang scheduling.
Ameya Parab
6 min read
Includes Code
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses the performance and energy efficiency of the NVIDIA Grace CPU Superchip for ETL workloads, comparing it with AMD and Intel CPUs.
Gregory Kimball
6 min read
Includes Code
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses how the NVIDIA RAPIDS Accelerator for Apache Spark enables zero code change for GPU-accelerated data processing, enhancing the performance of Apache Spark ML applications.
Erik Ordentlich
5 min read
Includes Code
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses how to read JSON Lines data using NVIDIA's cuDF library, achieving performance improvements of up to 100 times faster than traditional pandas methods.
Karthikeyan Natarajan
10 min read
Includes Code
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
The article discusses the optimization of JSON processing on Apache Spark using GPU acceleration, highlighting significant performance improvements achieved by a Fortune 100 retail company.
Matt Ahrens
8 min read
Includes Code
Has Summary
--
Uber logo
Uber
Advanced
The article discusses how Uber utilizes Ray®, a general compute engine for Python®, to enhance the efficiency of its rides business through improved machine learning model performance and optimizat...
Kaichen Wei, Matt Walker, Peng Zhang
15 min read
Has Summary
--
Pinterest logo
Pinterest
Advanced
The article discusses the transition from Apache Hadoop YARN to Apache YuniKorn for resource management in Pinterest's batch processing platform, Monarch, now rebranded as Moka.
Pinterest Engineering
10 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's advanced settlement accounting system, which is crucial for managing financial transactions involving payment service providers (PSPs).
Onkar Singh, Sai Sameera Grandhi, Nagesh Kumar Mankala, Abhinav Agarwal
12 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models.
Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang
11 min read
Has Summary
--
Pinterest logo
Pinterest
Advanced
This article discusses the implementation of Ray Batch Inference at Pinterest, highlighting its advantages over previous solutions like Apache Spark and Torch Dataloader.
Pinterest Engineering
11 min read
Includes Code
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Genie, Uber's generative AI on-call copilot designed to enhance communication and efficiency in on-call operations.
Paarth Chothani, Eduards Sidorovics, Xiyuan Feng, Nicholas Marcott, Jonathan Li, Chun Zhu, Kailiang Fu, Meghana Somasundara
11 min read
Has Summary
--
NVIDIA logo
NVIDIA
Intermediate
NVIDIA has announced that its CUDA-X platform now accelerates the Polars Data Processing Library, enhancing its performance for data analytics.
Nick Becker
3 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses Uber's migration of its batch data platform to the cloud, focusing on the implementation of DataMesh principles.
Arun Mahadeva Iyer, Abhi Khune, Sahana Bhat
11 min read
Has Summary
--
Uber logo
Uber
Advanced
The article discusses Uber's upgrade of its search platform from Lucene version 7. 5. 0 to 9. 4.
Anand Kotriwal, Aparajita Pandey, Charu Jain, Yupeng Fu
12 min read
Has Summary
--
Uber logo
Uber
Intermediate
The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...
Ankit Sultana, Caner Balci
15 min read
Has Summary
--
NVIDIA logo
NVIDIA
Advanced
The article discusses the NVIDIA GH200 Grace Hopper Superchip, highlighting its significant advancements in energy efficiency and node consolidation for Apache Spark workloads.
Uber logo
Uber
Intermediate
The article discusses the Sparkle framework developed by Uber to standardize modular ETL processes, enhancing developer productivity and data quality.
Dinesh Jagannathan, Sharath Bhat, Suman Voleti, Praveen Raj
8 min read
Has Summary
--
Airbnb logo
Airbnb
Advanced
The article discusses the migration of Airbnb's streaming processing architecture from Hadoop Yarn to Kubernetes using Apache Flink.
Ran Zhang
11 min read
Has Summary
--
Uber logo
Uber
Advanced
This article discusses Uber's migration of its Apache Hadoop-based data lake to Google Cloud Storage (GCS) and the security measures implemented during this transition.
Matt Mathew, Alexander Gulko, Lei Sun, KK Sriramadhesikan, Alan Cao, Omkar Kakade
20 min read
Includes Code
Has Summary
--
Uber logo
Uber
Advanced
This article discusses the modernization of Uber's logging infrastructure using CLP, focusing on the development of an end-to-end system for managing unstructured logs.
Gao Xin, Jack Luo, Kirk Rodrigues
16 min read
Has Summary
--