How LinkedIn Uses Apache Spark

26 engineering articles about Apache Spark from LinkedIn's engineering team

Other Companies Using Apache Spark

Articles

Filter:

Intermediate

LakeChime: A Data Trigger Service for Modern Data Lakes

LakeChime is a data trigger service designed to enhance the efficiency of data processing in modern data lakes.

ApacheApache SparkAvroJSONMySQL

Walaa Eldin Moustafa

16 min read

Includes Code

Has Summary

Intermediate

Open Sourcing OpenHouse: A Control Plane for Managing Tables in a Data Lakehouse

The article discusses the open sourcing of OpenHouse, a control plane designed for managing tables in a data lakehouse.

ApacheApache SparkDockerJPAKubernetesLarge Language ModelsMySQLSlimSpringSQL

Sumedh Sakdeo

9 min read

Includes Code

Has Summary

Advanced

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

The article discusses LinkedIn's innovative use of Apache Beam for real-time streaming processing, handling over 4 trillion events daily across more than 3,000 pipelines.

ApacheApache KafkaApache SparkgRPCJavaPython

Bingfeng Xia

16 min read

Has Summary

Intermediate

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The article introduces OpenHouse, a control plane developed at LinkedIn for managing tables in open source data lakehouse deployments.

ApacheApache SparkEnvoyKubernetesMachine LearningMySQLSQLTerraform

Sumedh Sakdeo

11 min read

Includes Code

Has Summary

Intermediate

From the Economic Graph to Economic Insights: Building the Infrastructure for Delivering Labor Market Insights from LinkedIn Data

The article discusses LinkedIn's Economic Graph Research and Insights (EGRI) team's efforts to build a robust data infrastructure for delivering labor market insights using LinkedIn data.

ApacheApache Spark

LinkedIn Engineering Team

10 min read

Has Summary

Advanced

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

This article discusses LinkedIn's implementation of unified streaming and batch pipelines using Apache Beam, achieving a significant reduction in processing time by 94%.

ApacheApache Spark

LinkedIn Engineering Team

11 min read

Includes Code

Has Summary

Advanced

Reducing Apache Spark Application Dependencies Upload by 99%

The article discusses how LinkedIn reduced the upload of Apache Spark application dependencies by 99% through the implementation of a user-level caching mechanism.

ApacheApache SparkCaching

LinkedIn Engineering Team

10 min read

Has Summary

Beginner

Near real-time features for near real-time personalization

The article discusses the implementation of near real-time personalization features at LinkedIn, focusing on how member actions can be leveraged to enhance recommendation systems without significan...

ApacheApache KafkaApache SparkSQLV

Rupesh Gupta

17 min read

Has Summary

Intermediate

Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

Project Magnet introduces push-based shuffle in Apache Spark 3. 2, enhancing shuffle scalability and reliability.

ApacheApache SparkAzureKubernetes

Venkata krishnan Sowrirajan

7 min read

Has Summary

Intermediate

Our approach to building transparent and explainable AI systems

The article discusses LinkedIn's approach to building transparent and explainable AI systems, emphasizing the importance of trust, fairness, and user understanding in AI applications.

ApacheApache SparkLIMESHAP

Kinjal Basu

8 min read

Has Summary

Advanced

Distributed tier merge: How LinkedIn tackles stragglers in search index build

The article discusses LinkedIn's approach to mitigating stragglers during the search index build process through a technique called Distributed Tier Merge (DTM).

ApacheApache Spark

Andy Li

11 min read

Has Summary

Advanced

Text analytics on LinkedIn Talent Insights using Apache Pinot

The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot.

ApacheApache SparkCSSGitHTMLJavaJavaScriptMySQLObjective-CPHPRailsRubySQL

Siddharth Teotia

17 min read

Has Summary

Intermediate

Magnet: A scalable and performant shuffle architecture for Apache Spark

The article introduces Magnet, a scalable and performant shuffle architecture designed for Apache Spark, addressing the challenges faced in shuffle operations at LinkedIn.

ApacheApache SparkAzureSQL

Min Shen

16 min read

Has Summary

Advanced

Addressing bias in large-scale AI applications: The LinkedIn Fairness Toolkit

The article discusses the LinkedIn Fairness Toolkit (LiFT), an open-source library designed to address bias in AI applications at scale.

ApacheApache SparkAvroMachine LearningScala

Sriram Vasudevan

11 min read

Has Summary

Intermediate

Bringing Project Every Member to life: Open sourcing our Spark inequality A/B testing library

The article discusses the open-sourcing of LinkedIn's Spark inequality A/B testing library, named spark-inequality-impact, aimed at measuring and reducing inequality in product design.

ApacheApache SparkMachine LearningPython

Guillaume Saint-Jacques

9 min read

Has Summary

Beginner

Spark-TFRecord: Toward full support of TFRecord in Spark

The article discusses Spark-TFRecord, a new data source for Apache Spark that aims to provide full support for the TFRecord data format used in TensorFlow.

ApacheApache SparkAvroJSONMachine LearningSQLTensorFlow

Jun Shi

5 min read

Has Summary

Intermediate

Advanced schema management for Spark applications at scale

The article discusses advanced schema management techniques for Apache Spark applications at LinkedIn, focusing on the integration of Avro schemas with the Hive Metastore to enhance type safety and...

ApacheApache SparkAvroJavaScalaSQL

Walaa Eldin Moustafa

14 min read

Has Summary

Intermediate

Data Sentinel: Automating data validation

The article discusses Data Sentinel, a platform developed at LinkedIn to automate data validation and improve data quality in production environments.

ApacheApache SparkSQL

Arun Swami

9 min read

Has Summary

Advanced

The Present and Future of Apache Hadoop: A Community Meetup at LinkedIn

The article discusses a community meetup held at LinkedIn focused on Apache Hadoop, highlighting contributions from various organizations and key presentations on topics like TensorFlow on YARN, Ha...

ApacheApache SparkAzureJavaOraclePyTorchTensorFlow

Erik Krogen

10 min read

Has Summary

Advanced

Bridging Offline and Nearline Computations with Apache Calcite

The article discusses how Apache Calcite can bridge the gap between offline and nearline computations in big data processing.

ApacheApache KafkaApache SparkAvroJavaPythonSQL

Khai Tran

12 min read

Has Summary

Advanced

Open Sourcing TonY: Native Support of TensorFlow on Hadoop

The article discusses the open sourcing of TonY, a framework designed to enable native support for TensorFlow on Hadoop.

ApacheApache SparkMachine LearningPythonTensorBoardTensorFlow

Jonathan Hung

8 min read

Has Summary

Advanced

Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity

The article discusses Dynamometer, a framework developed by LinkedIn to scale test HDFS on minimal hardware while maintaining maximum fidelity.

ApacheApache SparkJava

Erik Krogen

18 min read

Has Summary

Advanced

Getting to Know Qingbo Hu

The article introduces Qingbo Hu, a Senior Business Analytics Associate at LinkedIn, highlighting his background, projects, and interests.

ApacheApache SparkChiNatural Language Processing

Chi-Yi Kuan

5 min read

Has Summary

Intermediate

Spark Summit 2017: Research, Open Source, and Community

The article discusses the Spark Summit 2017, highlighting the contributions of LinkedIn engineers and data scientists to the Apache Spark community.

ApacheApache Spark

Carl Steinbach

4 min read

Has Summary

Intermediate

Asynchronous Processing and Multithreading in Apache Samza, Part I: Design and Architecture

This article discusses the design and architecture of asynchronous processing and multithreading in Apache Samza, highlighting its unique capabilities compared to other open-source stream processor...

ApacheApache SparkNode.js

Xinyu Liu

10 min read

Has Summary

Intermediate

Open Sourcing Photon ML

The article discusses the open sourcing of Photon ML, a machine learning library developed by LinkedIn that integrates with Apache Spark.

ApacheApache SparkMachine Learning

LinkedIn Engineering Team

7 min read

Has Summary

You've reached the end! All 26 articles loaded.

Other LinkedIn Technologies

Other Companies Using Apache Spark

Articles