How LinkedIn Uses Avro

43 engineering articles about Avro from LinkedIn's engineering team

Other Companies Using Avro

ClickHouse(9)

Articles

Filter:

Advanced

The evolution of the Venice ingestion pipeline

The article discusses the evolution of the Venice ingestion pipeline at LinkedIn, highlighting its architectural advancements and optimizations that enable the platform to handle over 230 million r...

ApacheAvro

Gaojie Liu

14 min read

Has Summary

Advanced

Optimizing LinkedIn Sales Navigator’s search pipeline with Spark

The article discusses the optimization of LinkedIn Sales Navigator’s search pipeline using Apache Spark, highlighting the transition from MapReduce to Spark and the resulting performance improvemen...

AvroScalaSQL

Chunxu Tang

14 min read

Includes Code

Has Summary

Intermediate

Java heap memory and garbage collection: tuning for high-performance services

This article discusses the challenges and solutions related to Java heap memory and garbage collection, specifically in the context of LinkedIn's FollowFeed service.

AvroJava

Nisheedh Raveendran

10 min read

Has Summary

Intermediate

LakeChime: A Data Trigger Service for Modern Data Lakes

LakeChime is a data trigger service designed to enhance the efficiency of data processing in modern data lakes.

ApacheApache SparkAvroJSONMySQL

Walaa Eldin Moustafa

16 min read

Includes Code

Has Summary

Advanced

Open Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

The article discusses the open-sourcing of AvroTensorDataset, a TensorFlow dataset designed for efficiently processing Avro data.

ApacheAvroDeep LearningPythonTensorFlow

Jonathan Hung

16 min read

Includes Code

Has Summary

Advanced

Upscaling LinkedIn's Profile Datastore While Reducing Costs

The article discusses LinkedIn's strategy to upscale its profile datastore while reducing operational costs.

AvroLessOracle

LinkedIn Engineering Team

18 min read

Has Summary

Intermediate

Super Tables: The road to building reliable and discoverable data products

The article discusses the concept of Super Tables at LinkedIn, which are designed to address the challenges of data discoverability, reliability, and change management in a rapidly growing data eco...

ApacheAvro

LinkedIn Engineering Team

15 min read

Has Summary

Advanced

Supporting large fanout use cases at scale in Venice

The article discusses the evolution of the Venice platform to support large fanout use cases at scale, particularly focusing on optimizing performance and scalability for handling high-throughput r...

ApacheAvroHTTP/2Java

Gaojie Liu

18 min read

Has Summary

Beginner

Shifting left on governance: DataHub and schema annotations

The article discusses the importance of data governance in large organizations like LinkedIn, emphasizing the need for effective schema annotations and automation in managing vast datasets.

ApacheAvroMySQLSQLThrift

Joshua Shinavier

8 min read

Has Summary

Intermediate

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn

The article discusses the implementation of a load-balanced Brooklin Mirror Maker at LinkedIn, which efficiently replicates large-scale Kafka clusters.

ApacheApache KafkaAvroSQL

vaibhav maheshwari

14 min read

Has Summary

Advanced

Opal: Building a mutable dataset in data lake

The article discusses Opal, a system developed at LinkedIn to manage mutable datasets within a data lake.

ApacheAvroMySQLOracleSQL

Bhupendra Jain

16 min read

Has Summary

Intermediate

Evolving LinkedIn’s analytics tech stack

The article discusses LinkedIn's transition from a proprietary analytics tech stack to an open-source big data technology stack, detailing the challenges faced and the improvements made during the ...

AvroAzureMySQL

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

From daily dashboards to enterprise grade data pipelines

This article discusses the evolution of LinkedIn's Daily Executive Dashboard (DED) from a simple dashboard to a robust enterprise-grade data pipeline.

AvroAzureJavaOraclePythonScalaSQL

Jennifer Zheng

16 min read

Has Summary

Intermediate

FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

The article introduces FastIngest, a new evolution of Apache Gobblin designed to enable low-latency data ingestion from Kafka to HDFS using the ORC file format and Apache Iceberg for metadata manag...

ApacheAvroOracle

Zihan Li

15 min read

Has Summary

Advanced

Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses

The article discusses Coral, an open-sourced SQL translation, analysis, and rewrite engine developed at LinkedIn for modern data lakehouses.

ApacheAvroGitSQLV

Walaa Eldin Moustafa

20 min read

Has Summary

Advanced

Pegasus Data Language: Evolving schema definitions for data modeling

The article discusses the introduction of Pegasus Data Language (PDL), a new schema definition language designed to replace the existing Pegasus Data Schema (PDSC) for data modeling in Rest.

AvroJavaJSONXML

Yingjie (Nicki) B.

6 min read

Has Summary

Advanced

Dagli: Faster and easier machine learning on the JVM, without the tech debt

Dagli is an open-source machine learning library designed for Java and other JVM languages, aimed at simplifying the creation of model pipelines while minimizing technical debt.

.NETAvroJavaNeural NetworksPyTorchscikit-learnTensorFlowTransformersXGBoost

Jeff Pasternack

14 min read

Has Summary

Advanced

Addressing bias in large-scale AI applications: The LinkedIn Fairness Toolkit

The article discusses the LinkedIn Fairness Toolkit (LiFT), an open-source library designed to address bias in AI applications at scale.

ApacheApache SparkAvroMachine LearningScala

Sriram Vasudevan

11 min read

Has Summary

Beginner

Spark-TFRecord: Toward full support of TFRecord in Spark

The article discusses Spark-TFRecord, a new data source for Apache Spark that aims to provide full support for the TFRecord data format used in TensorFlow.

ApacheApache SparkAvroJSONMachine LearningSQLTensorFlow

Jun Shi

5 min read

Has Summary

Advanced

Introducing Apache Pinot 0.3.0

Apache Pinot 0. 3. 0 is an open-source, distributed OLAP data store developed at LinkedIn, designed for near-real-time analytics.

ApacheAvroAzureDockerGoogle CloudGoogle Cloud StorageHelmKubernetesSQLThrift

Mayank S.

9 min read

Has Summary

Intermediate

Rapid experimentation through standardization: Typed AI features for LinkedIn’s feed

The article discusses the implementation of typed AI features in LinkedIn's feed, emphasizing the importance of standardization for rapid experimentation and continuous improvement.

AvroMachine Learning

Ian Ackerman

10 min read

Has Summary

Intermediate

Advanced schema management for Spark applications at scale

The article discusses advanced schema management techniques for Apache Spark applications at LinkedIn, focusing on the integration of Avro schemas with the Hive Metastore to enhance type safety and...

ApacheApache SparkAvroJavaScalaSQL

Walaa Eldin Moustafa

14 min read

Has Summary

Intermediate

An inside look at LinkedIn’s data pipeline monitoring system

This article provides an in-depth look at LinkedIn's data pipeline monitoring system, focusing on the challenges faced with traditional monitoring methods and how they have evolved to improve visib...

ApacheAvroFlaskMySQLOracleSQLYAML

Krishnan Raman

16 min read

Has Summary

Intermediate

How LinkedIn customizes Apache Kafka for 7 trillion messages per day

The article discusses how LinkedIn customizes Apache Kafka to handle an impressive scale of 7 trillion messages per day.

ApacheApache KafkaAvroJava

Jon Lee

10 min read

Has Summary

Advanced

DataHub: A generalized metadata search & discovery tool

DataHub is a generalized metadata search and discovery tool developed by LinkedIn to enhance the productivity of data teams.

ApacheArtificial IntelligenceAvroESLintGraphQLPrettierTypeScript

Mars Lan

17 min read

Has Summary

Intermediate

Avro2TF: An open source feature transformation engine for TensorFlow

Avro2TF is an open-source feature transformation engine designed to facilitate the conversion of data into a format compatible with TensorFlow.

AvroMachine LearningTensorFlow

Xuhong Zhang

5 min read

Has Summary

Advanced

Bridging Offline and Nearline Computations with Apache Calcite

The article discusses how Apache Calcite can bridge the gap between offline and nearline computations in big data processing.

ApacheApache KafkaApache SparkAvroJavaPythonSQL

Khai Tran

12 min read

Has Summary

Advanced

Streaming Data Pipelines with Brooklin

The article discusses Brooklin, a data ingestion service developed by LinkedIn to facilitate streaming data from various sources to multiple destinations.

ApacheAvroAWSAzureJSONKubernetesMySQLOracleThrift

Samarth Shetty

11 min read

Has Summary

Intermediate

JARVIS: Helping LinkedIn Navigate its Source Code

The article discusses JARVIS, a search system developed by LinkedIn to enhance the navigation of its source code.

AvroJavaJavaScriptPythonRubyScalaSpring

Rajeev Kumar

16 min read

Has Summary

Intermediate

Migrating to Espresso

The article discusses the migration of LinkedIn's internal service, Babylonia, from Oracle to Espresso, a distributed NoSQL database.

AvroOracleSQL

David Max

11 min read

Has Summary

Beginner

Engineering Infrastructure at Scale: Test Tracking

The article discusses the engineering infrastructure at LinkedIn that supports test tracking across various platforms, including iOS, Android, and web.

AvroJavaJavaScriptJSONObjective-CSwift

Ning Zhang

9 min read

Has Summary

Advanced

Open Sourcing Kafka Monitor

The article discusses the open sourcing of Kafka Monitor, a framework designed to monitor and test Kafka deployments.

ApacheApache KafkaAvroJava

Dong Lin

10 min read

Has Summary

Advanced

Kafka Ecosystem at LinkedIn

The article discusses the Kafka ecosystem at LinkedIn, detailing its critical role as a messaging system and the various solutions developed to enhance its functionality.

ApacheApache KafkaAvroJavaMySQL

Joel Koshy

8 min read

Has Summary

Advanced

FollowFeed: LinkedIn's Feed Made Faster and Smarter

The article discusses FollowFeed, LinkedIn's new feed infrastructure designed to enhance performance and relevance for its users.

AvroCachingJavaSQL

Ankit Gupta

25 min read

Has Summary

Intermediate

Running Kafka At Scale

The article 'Running Kafka At Scale' discusses how LinkedIn utilizes Apache Kafka as a crucial messaging system for handling vast amounts of data.

ApacheApache KafkaAvro

Todd Palino

10 min read

Includes Code

Has Summary

Advanced

Introducing Espresso - LinkedIn's hot new distributed document store

Espresso is LinkedIn's distributed, fault-tolerant NoSQL database that supports various applications, including Member Profile and InMail.

ApacheAvroJSONMySQLOracleRequest-Response

LinkedIn Engineering Team

17 min read

Has Summary

Advanced

Apache Helix: A framework for Distributed System Development

Apache Helix is a framework designed for developing distributed systems, addressing challenges such as scalability, fault tolerance, and partition management.

ApacheAvroElasticsearchOracle

Kishore Gopalakrishna

10 min read

Has Summary

Intermediate

Real time insights into LinkedIn's performance using Apache Samza

The article discusses how LinkedIn utilizes Apache Samza to gain real-time insights into its performance by processing data from numerous services and machines.

ApacheApache KafkaAssemblyAvro

LinkedIn Engineering Team

11 min read

Includes Code

Has Summary

Advanced

Announcing the Voldemort 1.6.0 Open Source Release

The article announces the release of Voldemort 1. 6. 0, a distributed key-value storage system developed at LinkedIn.

AvroJavaOracleShell

LinkedIn Engineering Team

10 min read

Includes Code

Has Summary

Advanced

The Log: What every software engineer should know about real-time data's unifying abstraction

The article discusses the significance of the log as a fundamental abstraction in real-time data systems, emphasizing its role in distributed systems, data integration, and stream processing.

AvroAWSClojureDynamoDBEvent SourcingJavaMySQLOraclePostgreSQLProtocol BuffersRedisScalaSQLThriftXML

Jay Kreps

63 min read

Has Summary

Intermediate

DataFu's Hourglass: Incremental Data Processing in Hadoop

The article discusses DataFu's Hourglass framework, which simplifies incremental data processing in Hadoop.

ApacheAvro

Matthew Hayes

15 min read

Includes Code

Has Summary

Advanced

Announcing the Voldemort 1.3 Open Source Release

The article announces the release of Voldemort 1. 3. 0, detailing significant performance improvements, new features, and enhanced operability.

ApacheAvroJavaOracle

Vinoth Chandar

9 min read

Includes Code

Has Summary

Beginner

Autometrics: Self-service metrics collection

The article discusses Autometrics, a self-service metrics collection system developed at LinkedIn to streamline the process of metrics collection and visualization.

ApacheAvroEvent BusPythonV

Grier Johnson

6 min read

Has Summary

You've reached the end! All 43 articles loaded.

Other LinkedIn Technologies

Other Companies Using Avro

Articles