PySpark Programming Tutorials &amp; Engineering Articles

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Advanced

Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.

ApacheApache SparkJavaKubernetesMySQLOraclePySparkPythonScalaSQL

Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya

8 min read

Has Summary

Advanced

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)

The article discusses Pinterest's transition to Moka, a next-generation data processing platform built on AWS Elastic Kubernetes Service (EKS).

AWSHelmJavaKubernetesLoad BalancerPrometheusPySparkPythonPyTorchReactTerraform

Pinterest Engineering

16 min read

Has Summary

Advanced

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

This article discusses Pinterest's transition from a Hadoop-based platform to a Kubernetes-based data processing solution named Moka.

ApacheAWSEnvoyHelmJavaKubernetesPrometheusPuppetPySparkSQLTerraform

Pinterest Engineering

19 min read

Includes Code

Has Summary

Migrating Large-Scale Interactive Compute Workloads to Kubernetes Without Disruption

Intermediate

The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...

ApacheApache SparkCassandraDockerGoogle CloudKubernetesPySpark

Sayan Pal, Rishabh Mishra

12 min read

Has Summary

Accelerate Apache Spark ML on NVIDIA GPUs with Zero Code Change

Intermediate

The article discusses how the NVIDIA RAPIDS Accelerator for Apache Spark enables zero code change for GPU-accelerated data processing, enhancing the performance of Apache Spark ML applications.

ApacheApache SparkAWSPandasPySparkPythonSQL

Erik Ordentlich

5 min read

Includes Code

Has Summary

Meta

Beginner

Aligning Velox and Apache Arrow: Towards composable data management

The article discusses the collaboration between Meta, Voltron Data, and the Apache Arrow community to align Apache Arrow with Velox, Meta's open-source execution engine.

ApacheApache ArrowApache SparkJavaPolarsPySpark

Pedro Pedreira

10 min read

Has Summary

Reduce Apache Spark ML Compute Costs with New Algorithms in Spark RAPIDS ML Library

Advanced

The article discusses the Spark RAPIDS ML library, an open-source Python package that accelerates Apache Spark ML applications using NVIDIA GPU technology.

ApacheApache SparkAWSPySparkPythonscikit-learn

Erik Ordentlich

8 min read

Includes Code

Has Summary

Intermediate

Last Mile Data Processing with Ray

The article discusses how Pinterest improved its machine learning (ML) dataset iteration speed using Ray, an open-source framework for scaling AI and ML workloads.

ApacheApache SparkMachine LearningPySparkPythonPyTorch

Pinterest Engineering

9 min read

Has Summary

Intermediate

Securely Scaling Big Data Access Controls At Pinterest

This article discusses Pinterest's implementation of a finer-grained access control (FGAC) framework to manage data access securely and efficiently within their data engineering platform.

ApacheAWSAWS EC2KubernetesOAuthPySparkScala

Pinterest Engineering

18 min read

Has Summary

Innovative Recommendation Applications Using Two Tower Embeddings at Uber

Advanced

The article discusses the implementation of Two-Tower Embeddings (TTE) at Uber, highlighting its role in enhancing the efficiency and scalability of recommendation systems.

ApacheApache SparkArtificial IntelligenceDeep LearningEmbeddingMachine LearningPySpark

Bo Ling, Melissa Barr, Dhruva Dixith Kurra, Chun Zhu, Nicholas Marcott

18 min read

Has Summary

Distributed Deep Learning Made Easy with Spark 3.4

Advanced

The article discusses the integration of distributed deep learning with Apache Spark 3. 4, highlighting new built-in APIs for both distributed model training and inference.

ApacheApache ArrowApache SparkDeep LearningHugging FaceNumPyPandasPySparkPythonPyTorchTensorFlow

Lee Yang

6 min read

Includes Code

Has Summary

New GPU Library Lowers Compute Costs for Apache Spark ML

Advanced

The article discusses the introduction of Spark RAPIDS ML, a new GPU-accelerated library for Apache Spark ML that enhances the performance and cost-effectiveness of machine learning applications.

ApacheApache SparkAWSPySparkPythonSQL

Erik Ordentlich

5 min read

Includes Code

Has Summary

Intermediate

Inside Look: Measuring Developer Productivity and Happiness at LinkedIn

The article discusses LinkedIn's approach to measuring developer productivity and happiness through the development of the Developer Insights Hub (iHub).

MVPMySQLPySparkPython

LinkedIn Engineering Team

13 min read

Includes Code

Has Summary

Smarter Retail Data Analytics with GPU Accelerated Apache Spark Workloads on Google Cloud Dataproc

Intermediate

The article discusses how retailers can enhance their data analytics capabilities using GPU-accelerated Apache Spark workloads on Google Cloud Dataproc.

ApacheApache SparkGoogle CloudGoogle Cloud StorageJSONPySparkPythonShellSQL

Saurav Agarwal

12 min read

Includes Code

Has Summary

The Complex Data Models Behind Shopify's Tax Insights Feature

Intermediate

The article discusses the complexities of tax compliance for U. S. merchants and details the development of Shopify's Tax Insights feature.

Google CloudGoogle Cloud StoragePySparkSQL

Siraj Ali

12 min read

Has Summary

Netflix

Advanced

Ready-to-go sample data pipelines with Dataflow

The article discusses the implementation of sample data pipelines using Dataflow at Netflix, focusing on bootstrapping, standardization, and automation of batch data pipelines.

ApacheApache SparkMachine LearningPySparkScalaSQLYAML

Netflix Technology Blog

17 min read

Includes Code

Has Summary

Palantir

Beginner

How Anyone Can Integrate SAP Data in Hours

The article discusses how to quickly integrate SAP data using Palantir's HyperAuto, emphasizing the automation of complex data integration tasks that typically take months.

AWSAzurePySpark

Palantir

12 min read

Has Summary

Enabling Offline Inferences at Uber Scale

Advanced

The article discusses Uber's approach to automating offline inferences using machine learning and natural language processing on support interaction data.

ApacheApache SparkDockerPySparkStreamlitXGBoost

Neeraj Dhake, Aravind Ranganathan

12 min read

Has Summary

Advanced

FastTreeSHAP: Accelerating SHAP value computation for trees

The article introduces FastTreeSHAP, an open-source Python package designed to accelerate SHAP value computations for tree-based models.

AzureCatBoostLightGBMLIMEMachine LearningPySparkPythonscikit-learnSHAPXGBoost

Jilei Yang

19 min read

Has Summary

Project RADAR: Intelligent Early Fraud Detection System with Humans in the Loop

Advanced

The article discusses Project RADAR, an intelligent fraud detection system developed by Uber that integrates machine learning and human expertise to identify and mitigate fraudulent activities in r...

ApacheApache SparkPySparkScalaSQL

Sergey Zelvenskiy, Garvit Harisinghani, Tiffany Yu, Edwin Ng, Robin Wei

14 min read

Has Summary

Advanced

Efficient Resource Management at Pinterest’s Batch Processing Platform

This article discusses Pinterest's Batch Processing Platform, Monarch, focusing on efficient resource management to ensure quality of service (QoS) while maintaining cost efficiency.

AWSAWS EC2PySparkSQL

Pinterest Engineering

16 min read

Has Summary

Airbnb

Intermediate

How Airbnb Built “Wall” to prevent data bugs

The article discusses how Airbnb developed the Wall framework to enhance data quality and prevent data bugs across its data engineering workflows.

ApachePySparkScalaSQLYAML

Subrata (Subu) Biswas

10 min read

Includes Code

Has Summary

Accelerating Volkswagen Connected Car Data Pipelines 100x Faster with NVIDIA RAPIDS

Intermediate

The article discusses how Volkswagen is leveraging NVIDIA RAPIDS to accelerate connected car data pipelines by 100x, addressing challenges such as geospatial indexing and K-Nearest Neighbors.

ApachePySparkPython

Chaitanya Kumar Dondapati

16 min read

Includes Code

Has Summary

The Evolution of Data Science Workbench

Intermediate

The article discusses the evolution of the Data Science Workbench (DSW) at Uber, highlighting its growth, challenges, and innovations over the past three years.

ApacheApache SparkMySQLPySpark

Peng Du, Taikun Liu, Sophie Wang, Hong Wang, Hongdi Li, Jin Sun

15 min read

Has Summary

Intermediate

How Pinterest Fights Spam Using Machine Learning

The article discusses how Pinterest employs machine learning to combat spam and harmful content on its platform.

Machine LearningPySparkSQL

Pinterest Engineering

5 min read

Has Summary

Double Entry Transition Tables: How We Track State Changes At Shopify

Beginner

The article discusses the implementation of double entry transition tables at Shopify to effectively track state changes for merchants using Shopify Balance.

PySparkSQL

Justin Pauley

9 min read

Includes Code

Has Summary

Intermediate

How Pinterest fights misinformation, hate speech, and self-harm content with machine learning

The article discusses how Pinterest employs machine learning to combat misinformation, hate speech, and self-harm content on its platform.

JavaMachine LearningPandasPySparkSQLTensorFlow

Pinterest Engineering

7 min read

Has Summary

No Code Workflow Orchestrator for Building Batch & Streaming Pipelines at Scale

Intermediate

The article discusses Uber's development of uWorc, a no-code workflow orchestrator designed to simplify the creation of batch and streaming data pipelines.

ApacheAWSAzureCassandraJSONMySQLPySparkSQL

Sandeep Karmakar, Sriharsha Chintalapani

11 min read

Has Summary

Horovod v0.21: Optimizing Network Utilization with Local Gradient Aggregation and Grouped Allreduce

Intermediate

Horovod v0. 21 introduces significant enhancements aimed at optimizing network utilization for distributed deep learning training.

ApacheApache SparkAWSAzureDeep LearningKerasMachine LearningPySparkPyTorchTensorFlow

Kerri Brown

8 min read

Has Summary

How to Build a Production Grade Workflow with SQL Modelling

Beginner

This article discusses the development of Seamster, a production-grade SQL modeling workflow created by Shopify to improve data reporting efficiency.

ApacheGolangPandasPySparkSQLYAML

Michelle Ark

12 min read

Includes Code

Has Summary

Palantir

Beginner

A PySpark Style Guide for Real-world Data Scientists

The article discusses the importance of establishing coding conventions for PySpark to enhance maintainability and readability in data engineering.

JuliaMATLABPySparkSQL

Palantir

8 min read

Includes Code

Has Summary

How to Build an Experiment Pipeline from Scratch

Intermediate

This article outlines the process of building an email experimentation pipeline from scratch, addressing the challenges faced by Shopify's data teams in conducting A/B tests for external channels.

PySparkSQL

Mojan Benham

10 min read

Has Summary

Making Python Data Science Enterprise-Ready with Dask

Intermediate

The article discusses how Dask, an open-source library, enhances Python's capabilities for data science and machine learning, making it suitable for enterprise-level applications.

ApacheApache SparkDaskFortranJSONNumPyPySparkPythonscikit-learnSciPySQLXGBoost

Jacob Schmitt

10 min read

Has Summary

How to Track State with Type 2 Dimensional Models

Intermediate

The article discusses how to track historical state using Type 2 dimensional models in application databases, contrasting it with the traditional Type 1 dimension approach.

MySQLPySparkRailsRubySQL

Ian Whitestone

13 min read

Includes Code

Has Summary

Advanced

Empowering Pinterest data scientists and machine learning engineers with PySpark

The article discusses how Pinterest empowered its data scientists and machine learning engineers by building a PySpark infrastructure that addresses challenges faced with existing tools like Hive a...

ApacheJenkinsKubernetesMachine LearningMVPPySparkPythonREST APIScalaSQLTensorFlow

Pinterest Engineering

7 min read

Has Summary

Monitoring Data Quality at Scale with Statistical Modeling

Intermediate

The article discusses Uber's approach to monitoring data quality at scale using statistical modeling.

PySpark

Ye Henry Li, Ritesh Agrawal, Santhosh Shanmugam, Andrea Pasqua

14 min read

Has Summary

Categorizing Products at Scale

Advanced

The article discusses the challenges and methodologies involved in categorizing products at scale on the Shopify platform, which has over 1 million business owners and billions of products.

Artificial IntelligenceComputer VisionGolangGPTHTMLPySpark

Jeet Mehta

13 min read

Includes Code

Has Summary

Merging Telemetry and Logs from Microservices at Scale with Apache Spark

Intermediate

The article discusses the challenges and solutions involved in merging telemetry and logs from microservices at scale using Apache Spark.

ApacheApache KafkaApache SparkAWSMicroservicesPySparkSQL

Niranjan Nataraja

11 min read

Includes Code

Has Summary

Uber Open Source in 2019: Community Engagement and Contributions

Intermediate

The article discusses Uber's open source initiatives in 2019, highlighting the company's contributions to the open source community, the establishment of the Open Source Program Office (OSPO), and ...

ApacheApache SparkDeep LearningDockerGitKerasKubernetesNode.jsPySparkPyTorchSQLTensorFlow

Uber

7 min read

Has Summary

Evolving Michelangelo Model Representation for Flexibility at Scale

Advanced

The article discusses the evolution of the Michelangelo model representation at Uber to enhance flexibility and scalability in machine learning model serving.

ApacheApache SparkDockerJavaMachine LearningPySparkSQLTensorFlowTransformerTransformers

Anne Holler, Michael Mui

15 min read

Has Summary

Palantir

Advanced

Rethinking the Foundry job orchestration back end: From CRUD to event-sourcing

This article discusses the transition of Palantir Foundry's job orchestration system from a CRUD-based approach to an event-sourced architecture.

CassandraCQRSGitJavaMicroservicesMySQLPySparkSQL

Robert Fink

14 min read

Includes Code

Has Summary