#
PySpark Programming Tutorials & Engineering Articles
53 PySpark tutorials, guides, and engineering insights from Uber, Pinterest, NVIDIA, and more
Companies Using This
PySpark Articles & Tutorials
Filter:
The article reflects on a decade of AI platform development at Pinterest, detailing the evolution from fragmented machine learning stacks to a unified AI platform that supports various models.
AutoMLDockerEmbeddingGenerative AIJavaKubernetesLightGBMPySparkPythonPyTorchSeedSQLTensorFlowThriftTransformer
Pinterest Engineering
22 min read
Has Summary
--
Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.
Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya
8 min read
Has Summary
--
The article discusses Pinterest's transition to Moka, a next-generation data processing platform built on AWS Elastic Kubernetes Service (EKS).
Pinterest Engineering
16 min read
Has Summary
--
This article discusses Pinterest's transition from a Hadoop-based platform to a Kubernetes-based data processing solution named Moka.
The article discusses Uber's migration of large-scale interactive compute workloads from Peloton to Kubernetes, focusing on minimizing disruption while enhancing resource management and cloud readi...
Sayan Pal, Rishabh Mishra
12 min read
Has Summary
--
The article discusses how the NVIDIA RAPIDS Accelerator for Apache Spark enables zero code change for GPU-accelerated data processing, enhancing the performance of Apache Spark ML applications.
The article discusses Meta's development of data logs, a tool designed to enhance user access to their data across platforms.
The article discusses how Uber utilizes Ray®, a general compute engine for Python®, to enhance the efficiency of its rides business through improved machine learning model performance and optimizat...
Kaichen Wei, Matt Walker, Peng Zhang
15 min read
Has Summary
--
The article discusses Genie, Uber's generative AI on-call copilot designed to enhance communication and efficiency in on-call operations.
Paarth Chothani, Eduards Sidorovics, Xiyuan Feng, Nicholas Marcott, Jonathan Li, Chun Zhu, Kailiang Fu, Meghana Somasundara
11 min read
Has Summary
--
The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...
Ankit Sultana, Caner Balci
15 min read
Has Summary
--
This article details Slack's migration from AWS EMR 5 with Spark 2 to EMR 6 with Spark 3, highlighting the challenges faced and the performance improvements achieved.
The article discusses how Notion built and scaled its data lake to manage a tenfold increase in data over three years, driven by user and content growth.
The article discusses the integration of Ray infrastructure at Pinterest, detailing the journey, challenges, and solutions implemented to optimize machine learning workflows.
Pinterest Engineering
16 min read
Includes Code
Has Summary
--
The article discusses the collaboration between Meta, Voltron Data, and the Apache Arrow community to align Apache Arrow with Velox, Meta's open-source execution engine.
Pedro Pedreira
10 min read
Has Summary
--
The article discusses the Spark RAPIDS ML library, an open-source Python package that accelerates Apache Spark ML applications using NVIDIA GPU technology.
Erik Ordentlich
8 min read
Includes Code
Has Summary
--
The article discusses how Pinterest improved its machine learning (ML) dataset iteration speed using Ray, an open-source framework for scaling AI and ML workloads.
Pinterest Engineering
9 min read
Has Summary
--
This article discusses Pinterest's implementation of a finer-grained access control (FGAC) framework to manage data access securely and efficiently within their data engineering platform.
The article discusses the implementation of Two-Tower Embeddings (TTE) at Uber, highlighting its role in enhancing the efficiency and scalability of recommendation systems.
Bo Ling, Melissa Barr, Dhruva Dixith Kurra, Chun Zhu, Nicholas Marcott
18 min read
Has Summary
--
The article discusses the integration of distributed deep learning with Apache Spark 3. 4, highlighting new built-in APIs for both distributed model training and inference.
Lee Yang
6 min read
Includes Code
Has Summary
--
The article discusses the introduction of Spark RAPIDS ML, a new GPU-accelerated library for Apache Spark ML that enhances the performance and cost-effectiveness of machine learning applications.
The article discusses LinkedIn's approach to measuring developer productivity and happiness through the development of the Developer Insights Hub (iHub).
The article discusses how retailers can enhance their data analytics capabilities using GPU-accelerated Apache Spark workloads on Google Cloud Dataproc.
Saurav Agarwal
12 min read
Includes Code
Has Summary
--
The article discusses the complexities of tax compliance for U. S. merchants and details the development of Shopify's Tax Insights feature.
Siraj Ali
12 min read
Has Summary
--
The article discusses the implementation of sample data pipelines using Dataflow at Netflix, focusing on bootstrapping, standardization, and automation of batch data pipelines.
Netflix Technology Blog
17 min read
Includes Code
Has Summary
--
The article discusses how to quickly integrate SAP data using Palantir's HyperAuto, emphasizing the automation of complex data integration tasks that typically take months.
The article discusses Uber's approach to automating offline inferences using machine learning and natural language processing on support interaction data.
The article introduces FastTreeSHAP, an open-source Python package designed to accelerate SHAP value computations for tree-based models.
Jilei Yang
19 min read
Has Summary
--
The article discusses Project RADAR, an intelligent fraud detection system developed by Uber that integrates machine learning and human expertise to identify and mitigate fraudulent activities in r...
Sergey Zelvenskiy, Garvit Harisinghani, Tiffany Yu, Edwin Ng, Robin Wei
14 min read
Has Summary
--
This article discusses Pinterest's Batch Processing Platform, Monarch, focusing on efficient resource management to ensure quality of service (QoS) while maintaining cost efficiency.
The article discusses how Airbnb developed the Wall framework to enhance data quality and prevent data bugs across its data engineering workflows.
The article discusses how Volkswagen is leveraging NVIDIA RAPIDS to accelerate connected car data pipelines by 100x, addressing challenges such as geospatial indexing and K-Nearest Neighbors.
The article discusses the evolution of the Data Science Workbench (DSW) at Uber, highlighting its growth, challenges, and innovations over the past three years.
Peng Du, Taikun Liu, Sophie Wang, Hong Wang, Hongdi Li, Jin Sun
15 min read
Has Summary
--
The article discusses how Pinterest employs machine learning to combat spam and harmful content on its platform.
Pinterest Engineering
5 min read
Has Summary
--
The article discusses the implementation of double entry transition tables at Shopify to effectively track state changes for merchants using Shopify Balance.
The article discusses how Pinterest employs machine learning to combat misinformation, hate speech, and self-harm content on its platform.
Pinterest Engineering
7 min read
Has Summary
--
The article discusses Uber's development of uWorc, a no-code workflow orchestrator designed to simplify the creation of batch and streaming data pipelines.
Horovod v0. 21 introduces significant enhancements aimed at optimizing network utilization for distributed deep learning training.
Kerri Brown
8 min read
Has Summary
--
This article discusses the development of Seamster, a production-grade SQL modeling workflow created by Shopify to improve data reporting efficiency.
The article discusses the importance of establishing coding conventions for PySpark to enhance maintainability and readability in data engineering.
This article outlines the process of building an email experimentation pipeline from scratch, addressing the challenges faced by Shopify's data teams in conducting A/B tests for external channels.
The article discusses how Dask, an open-source library, enhances Python's capabilities for data science and machine learning, making it suitable for enterprise-level applications.
The article discusses how to track historical state using Type 2 dimensional models in application databases, contrasting it with the traditional Type 1 dimension approach.
The article discusses how Pinterest empowered its data scientists and machine learning engineers by building a PySpark infrastructure that addresses challenges faced with existing tools like Hive a...
Pinterest Engineering
7 min read
Has Summary
--
The article discusses Uber's approach to monitoring data quality at scale using statistical modeling.
Ye Henry Li, Ritesh Agrawal, Santhosh Shanmugam, Andrea Pasqua
14 min read
Has Summary
--
The article discusses the challenges and methodologies involved in categorizing products at scale on the Shopify platform, which has over 1 million business owners and billions of products.
Jeet Mehta
13 min read
Includes Code
Has Summary
--
The article discusses the challenges and solutions involved in merging telemetry and logs from microservices at scale using Apache Spark.
Niranjan Nataraja
11 min read
Includes Code
Has Summary
--
The article discusses Uber's open source initiatives in 2019, highlighting the company's contributions to the open source community, the establishment of the Open Source Program Office (OSPO), and ...
Uber
7 min read
Has Summary
--
The article discusses the evolution of the Michelangelo model representation at Uber to enhance flexibility and scalability in machine learning model serving.
Anne Holler, Michael Mui
15 min read
Has Summary
--
This article discusses the transition of Palantir Foundry's job orchestration system from a CRUD-based approach to an event-sourced architecture.
The article discusses the latest updates to Horovod, a distributed deep learning framework, which now includes support for PySpark and Apache MXNet, along with features aimed at enhancing training ...
Carsten Jacobsen
7 min read
Has Summary
--