How LinkedIn Uses Apache

218 engineering articles about Apache from LinkedIn's engineering team

Other Companies Using Apache

Articles

Filter:

Intermediate

Scaling maintenance: Rethinking HDFS block placement for exabyte-scale clusters

The article discusses the challenges and solutions related to HDFS block placement in the context of maintaining exabyte-scale clusters at LinkedIn.

Apache

Ponmani Palanisamy

12 min read

Has Summary

Advanced

The evolution of the Venice ingestion pipeline

The article discusses the evolution of the Venice ingestion pipeline at LinkedIn, highlighting its architectural advancements and optimizations that enable the platform to handle over 230 million r...

ApacheAvro

Gaojie Liu

14 min read

Has Summary

Intermediate

Modernizing the LDAP and Kerberos infrastructure that secures Hadoop at LinkedIn

The article discusses the modernization of the LDAP and Kerberos infrastructure that secures Hadoop at LinkedIn, detailing the transition from a legacy setup to a highly available, automated system...

ApacheAzureHAProxy

Aswin M Prabhu

15 min read

Has Summary

Intermediate

Powering Apache Pinot ingestion with Hoptimator

The article discusses how LinkedIn utilizes Hoptimator to enhance the ingestion process for Apache Pinot, a real-time distributed OLAP datastore.

ApacheApache KafkaSQL

Ryanne Dolan

9 min read

Has Summary

Advanced

Revenue Attribution Report: how we used homomorphic encryption to enhance privacy and cut network congestion by 99%

The article discusses how LinkedIn enhanced its Revenue Attribution Report (RAR) using additive symmetric homomorphic encryption (ASHE) to improve privacy and significantly reduce network congestio...

Apache

Saikrishna Badrinarayanan

9 min read

Includes Code

Has Summary

Intermediate

Group’s anatomy: Analyzing your LinkedIn Groups with real-time insights

The article discusses the development of an analytics dashboard for LinkedIn Group admins, providing real-time insights into group growth and engagement metrics.

ApacheAzureRender

Nishtha Sharma

12 min read

Has Summary

Intermediate

LakeChime: A Data Trigger Service for Modern Data Lakes

LakeChime is a data trigger service designed to enhance the efficiency of data processing in modern data lakes.

ApacheApache SparkAvroJSONMySQL

Walaa Eldin Moustafa

16 min read

Includes Code

Has Summary

Intermediate

Scalable Automated Config-Driven Data Validation with ValiData

The article discusses ValiData, a scalable automated config-driven data validation tool used at LinkedIn to ensure the accuracy and consistency of large datasets.

ApacheAzureJSONSQL

Bharadwaj Jayaraman

15 min read

Has Summary

Intermediate

Open Sourcing OpenHouse: A Control Plane for Managing Tables in a Data Lakehouse

The article discusses the open sourcing of OpenHouse, a control plane designed for managing tables in a data lakehouse.

ApacheApache SparkDockerJPAKubernetesLarge Language ModelsMySQLSlimSpringSQL

Sumedh Sakdeo

9 min read

Includes Code

Has Summary

Advanced

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

The article discusses a hybrid bulk data processing framework developed to improve recruiting efficiency during data ownership transfers, particularly in the context of company mergers and recruite...

ApacheApache KafkaKubernetes

Aditya Hegde

12 min read

Includes Code

Has Summary

Advanced

Privacy Preserving Single Post Analytics

The article discusses the implementation of privacy-preserving analytics for individual posts on LinkedIn, focusing on how to provide useful insights to post authors while safeguarding viewer ident...

ApacheLessMVPSQL

Ryan Rogers

24 min read

Has Summary

Advanced

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

The article discusses LinkedIn's innovative use of Apache Beam for real-time streaming processing, handling over 4 trillion events daily across more than 3,000 pipelines.

ApacheApache KafkaApache SparkgRPCJavaPython

Bingfeng Xia

16 min read

Has Summary

Intermediate

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The article introduces OpenHouse, a control plane developed at LinkedIn for managing tables in open source data lakehouse deployments.

ApacheApache SparkEnvoyKubernetesMachine LearningMySQLSQLTerraform

Sumedh Sakdeo

11 min read

Includes Code

Has Summary

Intermediate

Declarative Data Pipelines with Hoptimator

The article discusses the development of Hoptimator, a declarative data pipeline orchestrator designed to streamline the creation of end-to-end data pipelines at LinkedIn.

ApacheKubernetesMySQLSQLYAML

Ryanne Dolan

10 min read

Includes Code

Has Summary

Advanced

Open Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

The article discusses the open-sourcing of AvroTensorDataset, a TensorFlow dataset designed for efficiently processing Avro data.

ApacheAvroDeep LearningPythonTensorFlow

Jonathan Hung

16 min read

Includes Code

Has Summary

Intermediate

From the Economic Graph to Economic Insights: Building the Infrastructure for Delivering Labor Market Insights from LinkedIn Data

The article discusses LinkedIn's Economic Graph Research and Insights (EGRI) team's efforts to build a robust data infrastructure for delivering labor market insights using LinkedIn data.

ApacheApache Spark

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

Scaling Salt for Remote Execution to support LinkedIn Infra growth

The article discusses how LinkedIn scaled its Salt infrastructure to support its growing needs for remote execution jobs, achieving a tenfold increase in job capacity and improved reliability.

ApacheApache KafkaAzureIrisMySQLPythonZeroMQ

LinkedIn Engineering Team

11 min read

Has Summary

Advanced

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

This article discusses LinkedIn's implementation of unified streaming and batch pipelines using Apache Beam, achieving a significant reduction in processing time by 94%.

ApacheApache Spark

LinkedIn Engineering Team

11 min read

Includes Code

Has Summary

Advanced

Reducing Apache Spark Application Dependencies Upload by 99%

The article discusses how LinkedIn reduced the upload of Apache Spark application dependencies by 99% through the implementation of a user-level caching mechanism.

ApacheApache SparkCaching

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

Hosted Search: LinkedIn Search as a managed service

The article discusses LinkedIn's Hosted Search, a fully managed cloud-based search solution designed to simplify the integration of search functionalities for application teams.

ApacheApache Kafka

LinkedIn Engineering Team

12 min read

Has Summary

Beginner

TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters

The article discusses TopicGC, a service developed by LinkedIn to clean up unused metadata in Kafka clusters.

ApacheApache Kafka

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

Career stories: Four engineering careers. One LinkedIn.

The article highlights the career journey of Shalini, an engineering senior director at LinkedIn, showcasing her experiences in various engineering roles and the supportive culture at LinkedIn that...

Apache

LinkedIn Engineering Team

7 min read

Has Summary

Intermediate

Super Tables: The road to building reliable and discoverable data products

The article discusses the concept of Super Tables at LinkedIn, which are designed to address the challenges of data discoverability, reliability, and change management in a rapidly growing data eco...

ApacheAvro

LinkedIn Engineering Team

15 min read

Has Summary

Advanced

Open Sourcing Venice – LinkedIn’s Derived Data Platform

The article discusses the open sourcing of Venice, LinkedIn's derived data platform, which supports over 1800 datasets and 300 applications.

ApacheMySQL

Félix GV

24 min read

Has Summary

Intermediate

Real-time analytics on network flow data with Apache Pinot

The article discusses how LinkedIn utilizes Apache Pinot for real-time analytics on network flow data, emphasizing the importance of observability in their infrastructure.

ApacheApache KafkaSQL

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

Feathr joins LF AI & Data Foundation

Feathr, a feature store developed by LinkedIn, has joined the LF AI & Data Foundation, which supports open-source innovation in AI and data.

ApacheAzure

Hangfei Lin

4 min read

Has Summary

Intermediate

Career stories: Next plays, jungle gyms, and Python

The article shares the career journey of Deepti, a biomedical engineer turned data scientist at LinkedIn, highlighting her transitions between industries and roles.

ApachePythonScala

LinkedIn Engineering Team

7 min read

Has Summary

Intermediate

LinkedIn’s journey to Java 11

The article details LinkedIn's migration journey from Java 8 to Java 11, emphasizing the performance improvements and challenges faced during the transition.

ApacheJavaOracle

Jesse Jie

12 min read

Has Summary

Intermediate

Towards data quality management at LinkedIn

The article discusses the importance of data quality management at LinkedIn, focusing on the challenges posed by the scale of their data operations.

Apache

LinkedIn Engineering Team

11 min read

Has Summary

Advanced

Supporting large fanout use cases at scale in Venice

The article discusses the evolution of the Venice platform to support large fanout use cases at scale, particularly focusing on optimizing performance and scalability for handling high-throughput r...

ApacheAvroHTTP/2Java

Gaojie Liu

18 min read

Has Summary

Beginner

Shifting left on governance: DataHub and schema annotations

The article discusses the importance of data governance in large organizations like LinkedIn, emphasizing the need for effective schema annotations and automation in managing vast datasets.

ApacheAvroMySQLSQLThrift

Joshua Shinavier

8 min read

Has Summary

Intermediate

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn

The article discusses the implementation of a load-balanced Brooklin Mirror Maker at LinkedIn, which efficiently replicates large-scale Kafka clusters.

ApacheApache KafkaAvroSQL

vaibhav maheshwari

14 min read

Has Summary

Advanced

Opal: Building a mutable dataset in data lake

The article discusses Opal, a system developed at LinkedIn to manage mutable datasets within a data lake.

ApacheAvroMySQLOracleSQL

Bhupendra Jain

16 min read

Has Summary

Beginner

Near real-time features for near real-time personalization

The article discusses the implementation of near real-time personalization features at LinkedIn, focusing on how member actions can be leveraged to enhance recommendation systems without significan...

ApacheApache KafkaApache SparkSQLV

Rupesh Gupta

17 min read

Has Summary

Advanced

Accelerating the LinkedIn Experience with Azure Front Door

The article discusses LinkedIn's migration to Azure Front Door (AFD) and the significant performance improvements achieved through this transition.

ApacheAzureCDNHAProxyHTTP/3SQLVault

Samir Jafferali

14 min read

Has Summary

Intermediate

DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

The article discusses DARWIN, LinkedIn's unified Data Science and Artificial Intelligence Workbench, designed to streamline the workflows of data scientists and AI engineers by centralizing various...

ApacheArtificial IntelligenceDockerGitKubernetesMySQLPythonReactScalaSQLTensorFlowXGBoost

Varun S.

20 min read

Has Summary

Advanced

2021 in review: New frontiers in innovation and scale for our Engineering teams

The article reflects on the significant technological advancements and innovations achieved by LinkedIn's Engineering teams in 2021, highlighting milestones in data storage and infrastructure scali...

ApacheGenerative AIPython

Raghu Hiremagalur

8 min read

Has Summary

Beginner

Career stories: From Argentina to Ireland to Spain

The article chronicles Juan's career journey from Argentina to Ireland and Spain, highlighting his role in building LinkedIn's EMEA engineering team and his focus on AI and data engineering.

Apache

LinkedIn Engineering Team

6 min read

Has Summary

Advanced

Career stories: A cross-country, family move

The article discusses Charles's journey of relocating his family to Silicon Valley for a role at LinkedIn, highlighting the support he received during the transition and his experiences in the tech...

ApacheApache Kafka

LinkedIn Engineering Team

7 min read

Has Summary

Intermediate

Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

Project Magnet introduces push-based shuffle in Apache Spark 3. 2, enhancing shuffle scalability and reliability.

ApacheApache SparkAzureKubernetes

Venkata krishnan Sowrirajan

7 min read

Has Summary

Intermediate

Our approach to building transparent and explainable AI systems

The article discusses LinkedIn's approach to building transparent and explainable AI systems, emphasizing the importance of trust, fairness, and user understanding in AI applications.

ApacheApache SparkLIMESHAP

Kinjal Basu

8 min read

Has Summary

Advanced

Distributed tier merge: How LinkedIn tackles stragglers in search index build

The article discusses LinkedIn's approach to mitigating stragglers during the search index build process through a technique called Distributed Tier Merge (DTM).

ApacheApache Spark

Andy Li

11 min read

Has Summary

Advanced

Scaling LinkedIn's Hadoop YARN cluster beyond 10,000 nodes

This article discusses the challenges and solutions LinkedIn faced while scaling its Hadoop YARN cluster beyond 10,000 nodes.

ApacheAzureKubernetesMySQLREST API

Keqiu H.

21 min read

Has Summary

Beginner

TonY joins LF AI & Data Foundation

The article discusses TonY's integration into the LF AI & Data Foundation, highlighting its role in facilitating distributed deep learning on Hadoop.

ApacheGoogle CloudKubernetesPyTorchTensorFlow

Keqiu H.

4 min read

Has Summary

Advanced

Text analytics on LinkedIn Talent Insights using Apache Pinot

The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot.

ApacheApache SparkCSSGitHTMLJavaJavaScriptMySQLObjective-CPHPRailsRubySQL

Siddharth Teotia

17 min read

Has Summary

Intermediate

The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System

The article discusses LinkedIn's significant advancements in scaling the Hadoop Distributed File System (HDFS), achieving the milestone of storing 1 exabyte of data and optimizing performance throu...

ApacheAzureAzure Blob StorageHTTPSJavaJenkinsV

Konstantin V. Shvachko

28 min read

Has Summary

Intermediate

How to build an effective professional network on LinkedIn: Some data-driven insights

The article provides data-driven insights on building an effective professional network on LinkedIn, emphasizing the importance of diverse connections for career mobility.

Apache

YinYin Yu, PhD

8 min read

Has Summary

Advanced

Solving for the cardinality of set intersection at scale with Pinot and Theta Sketches

This article discusses the challenges and solutions for estimating the cardinality of set intersections at scale using Apache Pinot and Theta Sketches.

ApacheJavaSQL

Vincent Wang

13 min read

Has Summary

Intermediate

Solving the data integration variety problem at scale, with Gobblin

The article discusses the challenges of data integration at scale within LinkedIn's big data ecosystem and presents Gobblin's Data Integration Library (DIL) as a solution to streamline and standard...

ApacheArtificial IntelligenceJavaJSONYAML

Chris L.

11 min read

Has Summary

Intermediate

Budget-split testing: A trustworthy and powerful approach to marketplace A/B testing

The article discusses budget-split testing as an innovative method to improve A/B testing in marketplace environments, specifically addressing issues of cannibalization bias and insufficient statis...

Apache

LinkedIn Engineering Team

6 min read

Has Summary

Other LinkedIn Technologies

Other Companies Using Apache

Articles