Scala Programming Tutorials &amp; Engineering Articles

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Advanced

Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.

ApacheApache SparkJavaKubernetesMySQLOraclePySparkPythonScalaSQL

Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya

8 min read

Has Summary

Migrating Airbnb’s JVM Monorepo to Bazel

Advanced

The article discusses Airbnb's migration of its JVM monorepo from Gradle to Bazel, detailing the motivations, process, and outcomes of this significant transition.

AWSGraphQLJavaKotlinScalaThrift

Thomas Bao

14 min read

Has Summary

Advanced

Optimizing LinkedIn Sales Navigator’s search pipeline with Spark

The article discusses the optimization of LinkedIn Sales Navigator’s search pipeline using Apache Spark, highlighting the transition from MapReduce to Spark and the resulting performance improvemen...

AvroScalaSQL

Chunxu Tang

14 min read

Includes Code

Has Summary

Lucene: Uber’s Search Platform Version Upgrade

Advanced

The article discusses Uber's upgrade of its search platform from Lucene version 7. 5. 0 to 9. 4.

ApacheApache SparkGitJavaScala

Anand Kotriwal, Aparajita Pandey, Charu Jain, Yupeng Fu

12 min read

Has Summary

Pinot for Low-Latency Offline Table Analytics

Intermediate

The article discusses how Uber utilizes Apache Pinot for low-latency offline table analytics, highlighting its capabilities in handling various use cases, including real-time and offline data inges...

ApacheApache KafkaApache SparkgRPCJavaMySQLOraclePySparkScala

Ankit Sultana, Caner Balci

15 min read

Has Summary

Sparkle: Standardizing Modular ETL at Uber

Intermediate

The article discusses the Sparkle framework developed by Uber to standardize modular ETL processes, enhancing developer productivity and data quality.

ApacheApache KafkaApache SparkCassandraJavaMySQLOracleScalaSpringSpring BootSQLYAML

Dinesh Jagannathan, Sharath Bhat, Suman Voleti, Praveen Raj

8 min read

Has Summary

Slack

Intermediate

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

This article details Slack's migration from AWS EMR 5 with Spark 2 to EMR 6 with Spark 3, highlighting the challenges faced and the performance improvements achieved.

ApacheAWSChefJSONMachine LearningPySparkPythonScalaSQL

Nilanjana Mukherjee

12 min read

Includes Code

Has Summary

Notion

Advanced

Building and scaling Notion’s data lake

The article discusses how Notion built and scaled its data lake to manage a tenfold increase in data over three years, driven by user and content growth.

ApacheAWSAWS RDSEmbeddingPySparkScalaSQLVector Database

XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao

13 min read

Includes Code

Has Summary

Spotify

Advanced

Data Platform Explained Part II

This article continues the exploration of Spotify's data platform, detailing its building blocks, scalability, and the community-driven approach to managing a complex data ecosystem.

Scala

Anastasia Khlebnikova (Senior Engineer) and Carol Cunha (Product Manager)

6 min read

Has Summary

Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

Advanced

This article details Uber's migration of over a trillion entries of ledger data from DynamoDB to LedgerStore, focusing on the challenges, strategies, and outcomes of the process.

ApacheApache SparkAWSAWS S3DynamoDBJavaScala

Raghav Gautam, Erik Seaberg, Abhishek Kanhar

12 min read

Has Summary

Chronon, Airbnb’s ML Feature Platform, Is Now Open Source

Intermediate

Chronon, Airbnb's ML Feature Platform, is now open source, providing tools for observability and management that simplify the complexity of data engineering for machine learning practitioners.

JavaRubyScala

Varant Zanoyan

11 min read

Includes Code

Has Summary

Intermediate

Unlocking AI Assisted Development Safely: From Idea to GA

The article discusses Pinterest's journey in implementing AI-assisted development, focusing on the balance between innovation and safety.

CopilotLarge Language ModelsScala

Pinterest Engineering

7 min read

Has Summary

Migrating Our iOS Build System from Buck to Bazel

Intermediate

The article discusses Airbnb's migration of its iOS build system from Buck to Bazel, detailing the approach taken to ensure a smooth transition with minimal disruption to developer workflows.

GolangJavaJavaScriptKotlinScalaYAML

Qing Yang

9 min read

Includes Code

Has Summary

Psyberg: Automated end to end catch up

Beginner

The article discusses Psyberg, a tool developed by Netflix to automate the end-to-end catchup of data pipelines, particularly focusing on how it manages late-arriving data and enhances workflow eff...

Scala

Netflix Technology Blog

7 min read

Has Summary

Spotify

Advanced

Introducing Voyager: Spotify’s New Nearest-Neighbor Search Library

Spotify has introduced Voyager, a new nearest-neighbor search library that significantly improves upon its predecessor, Annoy, by offering increased speed and accuracy.

Google CloudJavaKubernetesNumPyPostgreSQLScala

Peter Sobot

4 min read

Includes Code

Has Summary

Intermediate

Career stories: The math-music connection in data science

The article explores Javier's transition from a music career to data science, highlighting the intersection of math and music in his journey.

JavaMachine LearningPythonScala

LinkedIn Engineering Team

5 min read

Has Summary

Intermediate

Securely Scaling Big Data Access Controls At Pinterest

This article discusses Pinterest's implementation of a finer-grained access control (FGAC) framework to manage data access securely and efficiently within their data engineering platform.

ApacheAWSAWS EC2KubernetesOAuthPySparkScala

Pinterest Engineering

18 min read

Has Summary

Intermediate

LinkedIn Integrates Protocol Buffers With Rest.li for Improved Microservices Performance

LinkedIn has integrated Google Protocol Buffers (Protobuf) with Rest. li to enhance microservices performance, achieving significant reductions in latency and improvements in resource utilization.

CBORgRPCJavaJavaScriptJSONKotlinMessagePackMicroservicesProtocol BuffersPythonScalaSwift

Karthik Ramgopal

7 min read

Has Summary

Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

Advanced

The article discusses how Uber implemented an incremental ETL process using Apache Hudi to manage its transactional data lake.

ApacheApache SparkGrafanaJavaScalaSQLYAML

Vinoth Govindarajan, Saketh Chintapalli, Yogesh Saswade, Aayush Bareja

16 min read

Has Summary

Advanced

Career stories: Discovering Dublin

The article discusses Vasundhara's journey as an AI engineer at LinkedIn, highlighting her transition from Seoul to Dublin during the pandemic and her work on machine learning algorithms for Linked...

Artificial IntelligenceJavaPythonScala

LinkedIn Engineering Team

6 min read

Has Summary

Scaling Media Machine Learning at Netflix

Intermediate

The article discusses Netflix's efforts to scale its media machine learning infrastructure, focusing on the challenges faced by media ML practitioners and the solutions developed to optimize and st...

CassandraJavaMachine LearningScalaSpringSpring Boot

Netflix Technology Blog

12 min read

Includes Code

Has Summary

Shopify

Intermediate

3 (More) Tips for Optimizing Apache Flink Applications

This article presents three additional tips for optimizing Apache Flink applications, focusing on enhancing performance through proper parallelism, avoiding sink bottlenecks, and utilizing HybridSo...

ApacheApache KafkaScala

Kevin Lam

8 min read

Includes Code

Has Summary

Ready-to-go sample data pipelines with Dataflow

Advanced

The article discusses the implementation of sample data pipelines using Dataflow at Netflix, focusing on bootstrapping, standardization, and automation of batch data pipelines.

ApacheApache SparkMachine LearningPySparkScalaSQLYAML

Netflix Technology Blog

17 min read

Includes Code

Has Summary

How Airbnb safeguards changes in production

Advanced

The article discusses Airbnb's Safe Deploy system, focusing on its architecture and engineering choices for implementing near real-time experiments.

ApacheJavaMySQLScala

Zack Loebel-Begelman

9 min read

Has Summary

Supercharging A/B Testing at Uber

Intermediate

The article discusses Uber's journey in rebuilding its A/B testing platform, Morpheus, to address scalability and reliability challenges.

JavaScalaSwift

Sergey Gitlin, Krishna Puttaswamy, Luke Duncan, Deepak Bobbarjung, Arun Babu A S P

26 min read

Has Summary

Intermediate

Career stories: Next plays, jungle gyms, and Python

The article shares the career journey of Deepti, a biomedical engineer turned data scientist at LinkedIn, highlighting her transitions between industries and roles.

ApachePythonScala

LinkedIn Engineering Team

7 min read

Has Summary

Stripe

Advanced

Fast builds, secure builds. Choose two

The article discusses the challenges and solutions Stripe engineers face in maintaining a continuous integration (CI) system that balances speed and security.

GraphQLgRPCJavaJavaScriptPythonRubyScalaTerraformTypeScript

Sushain Cherivirala

11 min read

Includes Code

Has Summary

Meta

Intermediate

SQL Notebooks: Combining the power of Jupyter and SQL editors for data analytics

The article discusses SQL Notebooks, a tool developed at Meta that combines the functionalities of SQL IDEs and Jupyter Notebooks to enhance data analytics.

MySQLOraclePandasPlotlyScalaSQL

Guilherme Kunigami

8 min read

Includes Code

Has Summary

Shopify

Advanced

7 Tips For Optimizing Apache Flink Applications

This article provides seven actionable tips for optimizing Apache Flink applications, focusing on performance and resiliency.

ApacheGoogle CloudGoogle Cloud StorageJavaKubernetesScala

Yaroslav Tkachenko

16 min read

Includes Code

Has Summary

Project RADAR: Intelligent Early Fraud Detection System with Humans in the Loop

Advanced

The article discusses Project RADAR, an intelligent fraud detection system developed by Uber that integrates machine learning and human expertise to identify and mitigate fraudulent activities in r...

ApacheApache SparkPySparkScalaSQL

Sergey Zelvenskiy, Garvit Harisinghani, Tiffany Yu, Edwin Ng, Robin Wei

14 min read

Has Summary

Intermediate

DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

The article discusses DARWIN, LinkedIn's unified Data Science and Artificial Intelligence Workbench, designed to streamline the workflows of data scientists and AI engineers by centralizing various...

ApacheArtificial IntelligenceDockerGitKubernetesMySQLPythonReactScalaSQLTensorFlowXGBoost

Varun S.

20 min read

Has Summary

Spotify

Intermediate

Changing the Wheels on a Moving Bus — Spotify’s Event Delivery Migration

This article discusses Spotify's migration of its Event Delivery Infrastructure (EDI) to Google Cloud Platform (GCP), detailing the challenges faced, solutions implemented, and the resulting improv...

ApacheGoogle CloudKubernetesScala

Flavio Santos (Data Infrastructure Engineer) and Robert Stephenson (Senior Product Manager)

14 min read

Has Summary

Data pipeline asset management with Dataflow

Advanced

The article discusses the management of data pipeline assets at Netflix using a tool called Dataflow.

JenkinsScalaSQLYAML

Netflix Technology Blog

15 min read

Includes Code

Has Summary

How Airbnb Built “Wall” to prevent data bugs

Intermediate

The article discusses how Airbnb developed the Wall framework to enhance data quality and prevent data bugs across its data engineering workflows.

ApachePySparkScalaSQLYAML

Subrata (Subu) Biswas

10 min read

Includes Code

Has Summary

Containerizing Apache Hadoop Infrastructure at Uber

Intermediate

This article discusses Uber's journey in containerizing their Apache Hadoop infrastructure, detailing the challenges faced and the solutions implemented over two years.

ApacheApache KafkaDockerGitGolangPuppetScalaVault

Matt Mathew, Qifan Shi, Shuyi Zhang, Jackie Murchison

20 min read

Has Summary

Intermediate

From daily dashboards to enterprise grade data pipelines

This article discusses the evolution of LinkedIn's Daily Executive Dashboard (DED) from a simple dashboard to a robust enterprise-grade data pipeline.

AvroAzureJavaOraclePythonScalaSQL

Jennifer Zheng

16 min read

Has Summary

Intermediate

Improving data processing efficiency using partial deserialization of Thrift

The article discusses how Pinterest improved data processing efficiency by implementing partial deserialization of Thrift encoded data.

ApacheFlatBuffersJavaScalaSQLThrift

Pinterest Engineering

7 min read

Has Summary

Himeji: A Scalable Centralized System for Authorization at Airbnb

Intermediate

The article discusses Himeji, a scalable centralized system for authorization developed at Airbnb, which addresses challenges faced during the transition from a monolithic Ruby on Rails architectur...

ApacheApache KafkaApache SparkAWSJavaOracleRailsRubyScalaYAML

Alan Yao

10 min read

Includes Code

Has Summary