How LinkedIn Uses Apache Spark
26 engineering articles about Apache Spark from LinkedIn's engineering team
Other LinkedIn Technologies
Other Companies Using Apache Spark
Articles
Filter:
LakeChime is a data trigger service designed to enhance the efficiency of data processing in modern data lakes.
Walaa Eldin Moustafa
16 min read
Includes Code
Has Summary
--
The article discusses the open sourcing of OpenHouse, a control plane designed for managing tables in a data lakehouse.
Sumedh Sakdeo
9 min read
Includes Code
Has Summary
--
The article discusses LinkedIn's innovative use of Apache Beam for real-time streaming processing, handling over 4 trillion events daily across more than 3,000 pipelines.
Bingfeng Xia
16 min read
Has Summary
--
The article introduces OpenHouse, a control plane developed at LinkedIn for managing tables in open source data lakehouse deployments.
Sumedh Sakdeo
11 min read
Includes Code
Has Summary
--
The article discusses LinkedIn's Economic Graph Research and Insights (EGRI) team's efforts to build a robust data infrastructure for delivering labor market insights using LinkedIn data.
LinkedIn Engineering Team
10 min read
Has Summary
--
This article discusses LinkedIn's implementation of unified streaming and batch pipelines using Apache Beam, achieving a significant reduction in processing time by 94%.
LinkedIn Engineering Team
11 min read
Includes Code
Has Summary
--
The article discusses how LinkedIn reduced the upload of Apache Spark application dependencies by 99% through the implementation of a user-level caching mechanism.
LinkedIn Engineering Team
10 min read
Has Summary
--
The article discusses the implementation of near real-time personalization features at LinkedIn, focusing on how member actions can be leveraged to enhance recommendation systems without significan...
Rupesh Gupta
17 min read
Has Summary
--
Project Magnet introduces push-based shuffle in Apache Spark 3. 2, enhancing shuffle scalability and reliability.
Venkata krishnan Sowrirajan
7 min read
Has Summary
--
The article discusses LinkedIn's approach to building transparent and explainable AI systems, emphasizing the importance of trust, fairness, and user understanding in AI applications.
Kinjal Basu
8 min read
Has Summary
--
The article discusses LinkedIn's approach to mitigating stragglers during the search index build process through a technique called Distributed Tier Merge (DTM).
Andy Li
11 min read
Has Summary
--
The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot.
Siddharth Teotia
17 min read
Has Summary
--
The article introduces Magnet, a scalable and performant shuffle architecture designed for Apache Spark, addressing the challenges faced in shuffle operations at LinkedIn.
Min Shen
16 min read
Has Summary
--
The article discusses the LinkedIn Fairness Toolkit (LiFT), an open-source library designed to address bias in AI applications at scale.
Sriram Vasudevan
11 min read
Has Summary
--
The article discusses the open-sourcing of LinkedIn's Spark inequality A/B testing library, named spark-inequality-impact, aimed at measuring and reducing inequality in product design.
Guillaume Saint-Jacques
9 min read
Has Summary
--
The article discusses Spark-TFRecord, a new data source for Apache Spark that aims to provide full support for the TFRecord data format used in TensorFlow.
Jun Shi
5 min read
Has Summary
--
The article discusses advanced schema management techniques for Apache Spark applications at LinkedIn, focusing on the integration of Avro schemas with the Hive Metastore to enhance type safety and...
The article discusses Data Sentinel, a platform developed at LinkedIn to automate data validation and improve data quality in production environments.
Arun Swami
9 min read
Has Summary
--
The article discusses a community meetup held at LinkedIn focused on Apache Hadoop, highlighting contributions from various organizations and key presentations on topics like TensorFlow on YARN, Ha...
Erik Krogen
10 min read
Has Summary
--
The article discusses how Apache Calcite can bridge the gap between offline and nearline computations in big data processing.
Khai Tran
12 min read
Has Summary
--
The article discusses the open sourcing of TonY, a framework designed to enable native support for TensorFlow on Hadoop.
Jonathan Hung
8 min read
Has Summary
--
The article discusses Dynamometer, a framework developed by LinkedIn to scale test HDFS on minimal hardware while maintaining maximum fidelity.
Erik Krogen
18 min read
Has Summary
--
The article introduces Qingbo Hu, a Senior Business Analytics Associate at LinkedIn, highlighting his background, projects, and interests.
Chi-Yi Kuan
5 min read
Has Summary
--
The article discusses the Spark Summit 2017, highlighting the contributions of LinkedIn engineers and data scientists to the Apache Spark community.
Carl Steinbach
4 min read
Has Summary
--
This article discusses the design and architecture of asynchronous processing and multithreading in Apache Samza, highlighting its unique capabilities compared to other open-source stream processor...
Xinyu Liu
10 min read
Has Summary
--
The article discusses the open sourcing of Photon ML, a machine learning library developed by LinkedIn that integrates with Apache Spark.
LinkedIn Engineering Team
7 min read
Has Summary
--
You've reached the end! All 26 articles loaded.