How LinkedIn Uses Apache
218 engineering articles about Apache from LinkedIn's engineering team
Other LinkedIn Technologies
Other Companies Using Apache
Articles
Filter:
The article discusses the challenges and solutions related to HDFS block placement in the context of maintaining exabyte-scale clusters at LinkedIn.
Ponmani Palanisamy
12 min read
Has Summary
--
The article discusses the evolution of the Venice ingestion pipeline at LinkedIn, highlighting its architectural advancements and optimizations that enable the platform to handle over 230 million r...
The article discusses the modernization of the LDAP and Kerberos infrastructure that secures Hadoop at LinkedIn, detailing the transition from a legacy setup to a highly available, automated system...
The article discusses how LinkedIn utilizes Hoptimator to enhance the ingestion process for Apache Pinot, a real-time distributed OLAP datastore.
Ryanne Dolan
9 min read
Has Summary
--
The article discusses how LinkedIn enhanced its Revenue Attribution Report (RAR) using additive symmetric homomorphic encryption (ASHE) to improve privacy and significantly reduce network congestio...
Saikrishna Badrinarayanan
9 min read
Includes Code
Has Summary
--
The article discusses the development of an analytics dashboard for LinkedIn Group admins, providing real-time insights into group growth and engagement metrics.
LakeChime is a data trigger service designed to enhance the efficiency of data processing in modern data lakes.
Walaa Eldin Moustafa
16 min read
Includes Code
Has Summary
--
The article discusses ValiData, a scalable automated config-driven data validation tool used at LinkedIn to ensure the accuracy and consistency of large datasets.
The article discusses the open sourcing of OpenHouse, a control plane designed for managing tables in a data lakehouse.
Sumedh Sakdeo
9 min read
Includes Code
Has Summary
--
The article discusses a hybrid bulk data processing framework developed to improve recruiting efficiency during data ownership transfers, particularly in the context of company mergers and recruite...
Aditya Hegde
12 min read
Includes Code
Has Summary
--
The article discusses the implementation of privacy-preserving analytics for individual posts on LinkedIn, focusing on how to provide useful insights to post authors while safeguarding viewer ident...
The article discusses LinkedIn's innovative use of Apache Beam for real-time streaming processing, handling over 4 trillion events daily across more than 3,000 pipelines.
Bingfeng Xia
16 min read
Has Summary
--
The article introduces OpenHouse, a control plane developed at LinkedIn for managing tables in open source data lakehouse deployments.
Sumedh Sakdeo
11 min read
Includes Code
Has Summary
--
The article discusses the development of Hoptimator, a declarative data pipeline orchestrator designed to streamline the creation of end-to-end data pipelines at LinkedIn.
Ryanne Dolan
10 min read
Includes Code
Has Summary
--
The article discusses the open-sourcing of AvroTensorDataset, a TensorFlow dataset designed for efficiently processing Avro data.
Jonathan Hung
16 min read
Includes Code
Has Summary
--
The article discusses LinkedIn's Economic Graph Research and Insights (EGRI) team's efforts to build a robust data infrastructure for delivering labor market insights using LinkedIn data.
LinkedIn Engineering Team
10 min read
Has Summary
--
The article discusses how LinkedIn scaled its Salt infrastructure to support its growing needs for remote execution jobs, achieving a tenfold increase in job capacity and improved reliability.
This article discusses LinkedIn's implementation of unified streaming and batch pipelines using Apache Beam, achieving a significant reduction in processing time by 94%.
LinkedIn Engineering Team
11 min read
Includes Code
Has Summary
--
The article discusses how LinkedIn reduced the upload of Apache Spark application dependencies by 99% through the implementation of a user-level caching mechanism.
LinkedIn Engineering Team
10 min read
Has Summary
--
The article discusses LinkedIn's Hosted Search, a fully managed cloud-based search solution designed to simplify the integration of search functionalities for application teams.
LinkedIn Engineering Team
12 min read
Has Summary
--
The article discusses TopicGC, a service developed by LinkedIn to clean up unused metadata in Kafka clusters.
LinkedIn Engineering Team
10 min read
Has Summary
--
The article highlights the career journey of Shalini, an engineering senior director at LinkedIn, showcasing her experiences in various engineering roles and the supportive culture at LinkedIn that...
LinkedIn Engineering Team
7 min read
Has Summary
--
The article discusses the concept of Super Tables at LinkedIn, which are designed to address the challenges of data discoverability, reliability, and change management in a rapidly growing data eco...
The article discusses the open sourcing of Venice, LinkedIn's derived data platform, which supports over 1800 datasets and 300 applications.
The article discusses how LinkedIn utilizes Apache Pinot for real-time analytics on network flow data, emphasizing the importance of observability in their infrastructure.
LinkedIn Engineering Team
10 min read
Has Summary
--
Feathr, a feature store developed by LinkedIn, has joined the LF AI & Data Foundation, which supports open-source innovation in AI and data.
The article shares the career journey of Deepti, a biomedical engineer turned data scientist at LinkedIn, highlighting her transitions between industries and roles.
The article details LinkedIn's migration journey from Java 8 to Java 11, emphasizing the performance improvements and challenges faced during the transition.
The article discusses the importance of data quality management at LinkedIn, focusing on the challenges posed by the scale of their data operations.
LinkedIn Engineering Team
11 min read
Has Summary
--
The article discusses the evolution of the Venice platform to support large fanout use cases at scale, particularly focusing on optimizing performance and scalability for handling high-throughput r...
The article discusses the importance of data governance in large organizations like LinkedIn, emphasizing the need for effective schema annotations and automation in managing vast datasets.
The article discusses the implementation of a load-balanced Brooklin Mirror Maker at LinkedIn, which efficiently replicates large-scale Kafka clusters.
vaibhav maheshwari
14 min read
Has Summary
--
The article discusses Opal, a system developed at LinkedIn to manage mutable datasets within a data lake.
The article discusses the implementation of near real-time personalization features at LinkedIn, focusing on how member actions can be leveraged to enhance recommendation systems without significan...
Rupesh Gupta
17 min read
Has Summary
--
The article discusses LinkedIn's migration to Azure Front Door (AFD) and the significant performance improvements achieved through this transition.
The article discusses DARWIN, LinkedIn's unified Data Science and Artificial Intelligence Workbench, designed to streamline the workflows of data scientists and AI engineers by centralizing various...
Varun S.
20 min read
Has Summary
--
The article reflects on the significant technological advancements and innovations achieved by LinkedIn's Engineering teams in 2021, highlighting milestones in data storage and infrastructure scali...
Raghu Hiremagalur
8 min read
Has Summary
--
The article chronicles Juan's career journey from Argentina to Ireland and Spain, highlighting his role in building LinkedIn's EMEA engineering team and his focus on AI and data engineering.
LinkedIn Engineering Team
6 min read
Has Summary
--
The article discusses Charles's journey of relocating his family to Silicon Valley for a role at LinkedIn, highlighting the support he received during the transition and his experiences in the tech...
LinkedIn Engineering Team
7 min read
Has Summary
--
Project Magnet introduces push-based shuffle in Apache Spark 3. 2, enhancing shuffle scalability and reliability.
Venkata krishnan Sowrirajan
7 min read
Has Summary
--
The article discusses LinkedIn's approach to building transparent and explainable AI systems, emphasizing the importance of trust, fairness, and user understanding in AI applications.
Kinjal Basu
8 min read
Has Summary
--
The article discusses LinkedIn's approach to mitigating stragglers during the search index build process through a technique called Distributed Tier Merge (DTM).
Andy Li
11 min read
Has Summary
--
This article discusses the challenges and solutions LinkedIn faced while scaling its Hadoop YARN cluster beyond 10,000 nodes.
Keqiu H.
21 min read
Has Summary
--
The article discusses TonY's integration into the LF AI & Data Foundation, highlighting its role in facilitating distributed deep learning on Hadoop.
Keqiu H.
4 min read
Has Summary
--
The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot.
Siddharth Teotia
17 min read
Has Summary
--
The article discusses LinkedIn's significant advancements in scaling the Hadoop Distributed File System (HDFS), achieving the milestone of storing 1 exabyte of data and optimizing performance throu...
The article provides data-driven insights on building an effective professional network on LinkedIn, emphasizing the importance of diverse connections for career mobility.
YinYin Yu, PhD
8 min read
Has Summary
--
This article discusses the challenges and solutions for estimating the cardinality of set intersections at scale using Apache Pinot and Theta Sketches.
The article discusses the challenges of data integration at scale within LinkedIn's big data ecosystem and presents Gobblin's Data Integration Library (DIL) as a solution to streamline and standard...
Chris L.
11 min read
Has Summary
--
The article discusses budget-split testing as an innovative method to improve A/B testing in marketplace environments, specifically addressing issues of cannibalization bias and insufficient statis...
LinkedIn Engineering Team
6 min read
Has Summary
--