Apache Spark Programming Tutorials & Engineering Articles

246 Apache Spark tutorials, guides, and engineering insights from Uber, NVIDIA, LinkedIn, and more

Companies Using This

Uber(94)

Apache Spark Articles & Tutorials

Filter:

Notion

Intermediate

Two years of vector search at Notion: 10x scale, 1/10th cost

The article discusses Notion's journey in scaling its vector search infrastructure, achieving a 10x increase in scale while reducing costs by 90% over two years.

ApacheApache SparkAWSDynamoDBHugging FacePython

Preeti Gondi, Mickey Liu, Nathan Louie, Calder Lund, Jacob Sager

10 min read

Has Summary

Intermediate

Next Generation DB Ingestion at Pinterest

The article discusses Pinterest's transition to a next-generation database ingestion framework designed to address the limitations of legacy systems.

ApacheApache KafkaApache SparkAWSMySQLOracle

Pinterest Engineering

10 min read

Includes Code

Has Summary

Uber

Intermediate

How Uber Scaled Data Replication to Move Petabytes Every Day

This article details how Uber optimized Apache Hadoop's Distcp (Distributed Copy) tool to scale their data replication infrastructure from handling 250 TB to petabytes of daily data movement.

ApacheApache Spark

Abhay Yadav, Radhika Patwari, Sanjay Sundaresan

15 min read

Has Summary

Uber

Advanced

Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations

This article details how Uber built and scaled Apache Hudi to power one of the world's largest data lakes, managing 19,500 datasets with trillions of records across a multi-hundred-petabyte reposit...

ApacheApache SparkAWSAzureGoogle CloudGoogle Cloud Storage

Prashant Wason, Balajee Nagasubramaniam, Surya Prasanna Kumar Yalla, Meenal Binwade, Xinli Shang, Jack Song

19 min read

Has Summary

Advanced

PinLanding: Turn Billions of Products into Instant Shopping Collections with Multimodal AI

PinLanding is a multimodal AI pipeline developed by Pinterest to generate shopping collections from billions of products.

ApacheApache SparkCLIPFine-tuningGPTMachine LearningModalV

Pinterest Engineering

8 min read

Has Summary

Uber

Advanced

Powering Billion-Scale Vector Search with OpenSearch

The article discusses Uber's transition from traditional keyword-based search using Apache Lucene to implementing semantic vector search with Amazon OpenSearch.

ApacheApache SparkCSSEmbedding

Hao Sun, Jiasen Xu, Smit Patel, Anand Kotriwal, Xu Zhang

11 min read

Has Summary

NVIDIA

Advanced

Migrate Apache Spark Workloads to GPUs at Scale on Amazon EMR with Project Aether

The article discusses Project Aether, a tool developed by NVIDIA to facilitate the migration of CPU-based Apache Spark workloads to GPU-accelerated environments on Amazon EMR.

ApacheApache SparkAWSXGBoost

Navin Kumar

6 min read

Includes Code

Has Summary

Uber

Advanced

How Uber Indexes Streaming Data with Pull-Based Ingestion in OpenSearch™

This article discusses how Uber utilizes a pull-based ingestion model in OpenSearch™ to effectively index streaming data.

ApacheApache KafkaApache SparkAWSgRPC

Yupeng Fu, Varun Bharadwaj, Shuyi Zhang, Xu Xiong, Michael Froh

14 min read

Has Summary

Uber

Advanced

From Batch to Streaming: Accelerating Data Freshness in Uber’s Data Lake

This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency.

ApacheApache KafkaApache SparkMachine Learning

Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song

6 min read

Has Summary

Uber

Advanced

I/O Observability for Uber’s Massive Petabyte-Scale Data Lake

The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hyb...

ApacheApache SparkGoogle CloudGoogle Cloud StorageGrafanaJavaMySQLOracleSQL

Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma

10 min read

Has Summary

NVIDIA

Intermediate

Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF

The article discusses the collaboration between IBM and NVIDIA to enhance large-scale data analytics through GPU-native Velox and NVIDIA cuDF, highlighting significant performance improvements over...

ApacheApache SparkSQL

Gregory Kimball

7 min read

Has Summary

Uber

Advanced

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.

ApacheApache SparkJavaKubernetesMySQLOraclePySparkPythonScalaSQL

Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya

8 min read

Has Summary

Netflix

Advanced

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

The article discusses how Netflix scales its Muse application to provide data-driven creative insights at a massive scale, focusing on the architectural evolution and optimizations made to handle t...

ApacheApache SparkGraphQLJavaReactSpringSpring Boot

Netflix Technology Blog

10 min read

Includes Code

Has Summary

Stripe

Intermediate

How we built it: Real-time analytics for Stripe Billing

The article discusses the development of a real-time streaming analytics system for Stripe Billing, enabling customers to access subscription metrics with minimal latency.

ApacheApache Spark

Reed Trevelyan

8 min read

Has Summary

Uber

Advanced

Building Uber’s Data Lake: Batch Data Replication Using HiveSync

This article discusses the architecture and implementation of Uber's HiveSync, a critical service for data replication across its massive data lake.

ApacheApache SparkGoogle CloudJavaMySQLOracle

Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan

14 min read

Has Summary

Stripe

Advanced

Real-time vs batch reconciliation: Practical patterns for keeping data in sync

The article discusses the critical importance of maintaining consistent data across multiple systems as organizations grow.

ApacheApache KafkaApache SparkAWSJSON

James Beswick

11 min read

Has Summary

Uber

Advanced

Forecasting Models to Improve Driver Availability at Airports

This article discusses the development and implementation of forecasting models aimed at improving driver availability at airports, which are critical to Uber's ridesharing ecosystem.

ApacheApache SparkCassandraKongTransformerTransformers

Bob Zheng, Dhruv Ghulati, Manoj Panikkar, Michael (Yichuan) Cai

15 min read

Has Summary

NVIDIA

Advanced

Serverless Distributed Data Processing with Apache Spark and NVIDIA AI on Azure

The article discusses the deployment of a serverless, distributed data processing architecture using Apache Spark and NVIDIA AI on Azure.

ApacheApache SparkAzureDockerEmbeddingHTTPSHugging FacePythonREST APIServerlessSQLSQL Server

Alexander Spiridonov

9 min read

Includes Code

Has Summary

Uber

Advanced

The Evolution of Uber’s Search Platform

The article discusses the evolution of Uber's Search Platform, highlighting its transition from Elasticsearch to an in-house solution called Sia, and ultimately to the adoption of OpenSearch.

ApacheApache KafkaApache SparkAWSElasticsearchGoogle CloudGoogle Cloud StoragegRPCJSONSQL

Yupeng Fu, Shubham Gupta, Shanshan Song, Mingmin Chen

15 min read

Has Summary

Uber

Intermediate

How Uber Migrated from Hive to Spark SQL for ETL Workloads

This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced...

ApacheApache SparkJavaJSONMySQLOracleServerlessSQL

Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya

14 min read

Has Summary

Uber

Intermediate

From Archival to Access: Config-Driven Data Pipelines

The article discusses Uber's implementation of a configuration-driven archival and retrieval framework designed to manage vast amounts of regulatory data efficiently.

ApacheApache SparkAWSMySQLOracleYAML

Abhishek Dobliyal, Aakash Bhardwaj

12 min read

Has Summary

NVIDIA

Intermediate

Supercharging Fraud Detection in Financial Services with Graph Neural Networks (Updated)

The article discusses the application of Graph Neural Networks (GNNs) in enhancing fraud detection within financial services.

ApacheApache SparkDockerGraph Neural NetworksJSONKubernetesNeural NetworksXGBoost

Naim

10 min read

Includes Code

Has Summary

NVIDIA

Advanced

Spotlight: Atgenomix SeqsLab Scales Health Omics Analysis for Precision Medicine

The article discusses how Atgenomix SeqsLab leverages NVIDIA technologies to enhance health omics analysis for precision medicine.

ApacheApache SparkAzureSQLXGBoost

Yu-Ting Lin

9 min read

Has Summary

Advanced

How Pinterest Accelerates ML Feature Iterations via Effective Backfill

The article discusses how Pinterest enhances its machine learning feature iterations through an effective backfill process.

ApacheApache Spark

Pinterest Engineering

14 min read

Has Summary

NVIDIA

Advanced

Predicting Performance on Apache Spark with GPUs

The article discusses the use of GPU acceleration to enhance performance in Apache Spark applications, highlighting the challenges of migrating workloads from CPUs to GPUs.

ApacheApache SparkAWSAzureJSONMachine LearningOptunaSHAPSQLXGBoost

Matt Ahrens

9 min read

Includes Code

Has Summary

NVIDIA

Advanced

Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud

The article discusses how to accelerate Deep Learning (DL) and Large Language Model (LLM) inference using Apache Spark in cloud environments.

ApacheApache SparkAWSAzureDeep LearningDockerJSONNumPyPythonPyTorchSemantic SearchTensorFlowTransformers

Rishi Chandra

9 min read

Includes Code

Has Summary

Uber

Intermediate

Migrating Large-Scale Interactive Compute Workloads to Kubernetes Without Disruption

The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...

ApacheApache SparkCassandraDockerGoogle CloudKubernetesPySpark

Sayan Pal, Rishabh Mishra

12 min read

Has Summary

Uber

Intermediate

Uber’s Journey to Ray on Kubernetes: Resource Management

This article discusses Uber's implementation of elastic resource management on Kubernetes, focusing on enhancements made to support Ray-based job management.

ApacheApache SparkGrafanaKubernetes

Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit

10 min read

Has Summary

NVIDIA

Advanced

Accelerating Apache Parquet Scans on Apache Spark with GPUs

The article discusses how to accelerate Apache Parquet scans on Apache Spark using GPUs, specifically through the RAPIDS Accelerator for Apache Spark.

ApacheApache SparkSQL

Matt Ahrens

7 min read

Includes Code

Has Summary

Uber

Intermediate

Uber’s Journey to Ray on Kubernetes: Ray Setup

Uber's blog post discusses their migration of machine learning workloads to Kubernetes using Ray, detailing the challenges faced with their previous setup and the improvements achieved with the new...

ApacheApache SparkDeep LearningGrafanaKubernetes

Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit

18 min read

Has Summary

NVIDIA

Advanced

Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler

This article discusses strategies for preventing GPU fragmentation in the Volcano Scheduler, focusing on an enhanced scheduling approach that integrates bin-packing with gang scheduling.

ApacheApache SparkKubernetes

Ameya Parab

6 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Efficient ETL with Polars and Apache Spark on NVIDIA Grace CPU

The article discusses the performance and energy efficiency of the NVIDIA Grace CPU Superchip for ETL workloads, comparing it with AMD and Intel CPUs.

ApacheApache SparkPolarsPythonRapids

Gregory Kimball

6 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Accelerate Apache Spark ML on NVIDIA GPUs with Zero Code Change

The article discusses how the NVIDIA RAPIDS Accelerator for Apache Spark enables zero code change for GPU-accelerated data processing, enhancing the performance of Apache Spark ML applications.

ApacheApache SparkAWSPandasPySparkPythonSQL

Erik Ordentlich

5 min read

Includes Code

Has Summary

NVIDIA

Intermediate

JSON Lines Reading with pandas 100x Faster Using NVIDIA cuDF

The article discusses how to read JSON Lines data using NVIDIA's cuDF library, achieving performance improvements of up to 100 times faster than traditional pandas methods.

ApacheApache ArrowApache SparkDockerJSONPython

Karthikeyan Natarajan

10 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Accelerating JSON Processing on Apache Spark with GPUs

The article discusses the optimization of JSON processing on Apache Spark using GPU acceleration, highlighting significant performance improvements achieved by a Fortune 100 retail company.

ApacheApache SparkJSONSQL

Matt Ahrens

8 min read

Includes Code

Has Summary

Uber

Advanced

How Uber Uses Ray® to Optimize the Rides Business

The article discusses how Uber utilizes Ray®, a general compute engine for Python®, to enhance the efficiency of its rides business through improved machine learning model performance and optimizat...

ApacheApache SparkAWSDockerKubernetesPandasPySparkXGBoost

Kaichen Wei, Matt Walker, Peng Zhang

15 min read

Has Summary

Advanced

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

The article discusses the transition from Apache Hadoop YARN to Apache YuniKorn for resource management in Pinterest's batch processing platform, Monarch, now rebranded as Moka.

ApacheApache SparkAWSKubernetes

Pinterest Engineering

10 min read

Has Summary

Uber

Advanced

Streamlining Financial Precision: Uber’s Advanced Settlement Accounting System

The article discusses Uber's advanced settlement accounting system, which is crucial for managing financial transactions involving payment service providers (PSPs).

ApacheApache KafkaApache SparkCassandra

Onkar Singh, Sai Sameera Grandhi, Nagesh Kumar Mankala, Abhinav Agarwal

12 min read

Has Summary

Uber

Advanced

Open Source and In-House: How Uber Optimizes LLM Training

The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models.

ApacheApache KafkaApache SparkCometDockerGoogle CloudGPTGPT-4Hugging FaceKubernetesMistralPyTorchSQLTransformers

Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang

11 min read

Has Summary

Advanced

Ray Batch Inference at Pinterest (Part 3)

This article discusses the implementation of Ray Batch Inference at Pinterest, highlighting its advantages over previous solutions like Apache Spark and Torch Dataloader.

ApacheApache SparkAWSHugging FaceLarge Language ModelsLLaMAPyTorchRay TuneTensorFlow

Pinterest Engineering

11 min read

Includes Code

Has Summary

Uber

Intermediate

Genie: Uber’s Gen AI On-Call Copilot

The article discusses Genie, Uber's generative AI on-call copilot designed to enhance communication and efficiency in on-call operations.

ApacheApache SparkCopilotEmbeddingFine-tuningPySpark

Paarth Chothani, Eduards Sidorovics, Xiyuan Feng, Nicholas Marcott, Jonathan Li, Chun Zhu, Kailiang Fu, Meghana Somasundara

11 min read

Has Summary

NVIDIA

Intermediate

NVIDIA CUDA-X Now Accelerates the Polars Data Processing Library

NVIDIA has announced that its CUDA-X platform now accelerates the Polars Data Processing Library, enhancing its performance for data analytics.

ApacheApache SparkPolars

Nick Becker

3 min read

Has Summary

Uber

Intermediate

DataMesh: How Uber laid the foundations for the data lake cloud migration

The article discusses Uber's migration of its batch data platform to the cloud, focusing on the implementation of DataMesh principles.

ApacheApache SparkGoogle CloudGoogle Cloud StorageGrafanaJavaMySQLOracle

Arun Mahadeva Iyer, Abhi Khune, Sahana Bhat

11 min read

Has Summary

Uber

Advanced

Lucene: Uber’s Search Platform Version Upgrade

The article discusses Uber's upgrade of its search platform from Lucene version 7. 5. 0 to 9. 4.

ApacheApache SparkGitJavaScala

Anand Kotriwal, Aparajita Pandey, Charu Jain, Yupeng Fu

12 min read

Has Summary

Uber

Intermediate

Pinot for Low-Latency Offline Table Analytics

The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...

ApacheApache KafkaApache SparkgRPCJavaMySQLOraclePySparkScala

Ankit Sultana, Caner Balci

15 min read

Has Summary

NVIDIA

Advanced

NVIDIA GH200 Superchip Delivers Breakthrough Energy Efficiency and Node Consolidation for Apache Spark

The article discusses the NVIDIA GH200 Grace Hopper Superchip, highlighting its significant advancements in energy efficiency and node consolidation for Apache Spark workloads.

ApacheApache SparkDeep LearningMachine LearningSQLVultr

Amr Elmeleegy

7 min read

Has Summary

Uber

Intermediate

Sparkle: Standardizing Modular ETL at Uber

The article discusses the Sparkle framework developed by Uber to standardize modular ETL processes, enhancing developer productivity and data quality.

ApacheApache KafkaApache SparkCassandraJavaMySQLOracleScalaSpringSpring BootSQLYAML

Dinesh Jagannathan, Sharath Bhat, Suman Voleti, Praveen Raj

8 min read

Has Summary

Airbnb

Advanced

Apache Flink® on Kubernetes

The article discusses the migration of Airbnb's streaming processing architecture from Hadoop Yarn to Kubernetes using Apache Flink.

ApacheApache SparkAWSKubernetes

Ran Zhang

11 min read

Has Summary

Uber

Advanced

Enabling Security for Hadoop Data Lake on Google Cloud Storage

This article discusses Uber's migration of its Apache Hadoop-based data lake to Google Cloud Storage (GCS) and the security measures implemented during this transition.

ApacheApache SparkCachingCQRSGoogle CloudGoogle Cloud StoragegRPCMVPOAuthRedis

Matt Mathew, Alexander Gulko, Lei Sun, KK Sriramadhesikan, Alan Cao, Omkar Kakade

20 min read

Includes Code

Has Summary

Uber

Advanced

Modernizing Logging at Uber with CLP (Part II)

This article discusses the modernization of Uber's logging infrastructure using CLP, focusing on the development of an end-to-end system for managing unstructured logs.

ApacheApache SparkElasticsearchJavaJSON

Gao Xin, Jack Luo, Kirk Rodrigues

16 min read

Has Summary