How LinkedIn Uses SQL

62 engineering articles about SQL from LinkedIn's engineering team

Other Companies Using SQL

Articles

Filter:

Advanced

Optimizing LinkedIn Sales Navigator’s search pipeline with Spark

The article discusses the optimization of LinkedIn Sales Navigator’s search pipeline using Apache Spark, highlighting the transition from MapReduce to Spark and the resulting performance improvemen...

AvroScalaSQL

Chunxu Tang

14 min read

Includes Code

Has Summary

Intermediate

Powering Apache Pinot ingestion with Hoptimator

The article discusses how LinkedIn utilizes Hoptimator to enhance the ingestion process for Apache Pinot, a real-time distributed OLAP datastore.

ApacheApache KafkaSQL

Ryanne Dolan

9 min read

Has Summary

Intermediate

Scalable Automated Config-Driven Data Validation with ValiData

The article discusses ValiData, a scalable automated config-driven data validation tool used at LinkedIn to ensure the accuracy and consistency of large datasets.

ApacheAzureJSONSQL

Bharadwaj Jayaraman

15 min read

Has Summary

Intermediate

Open Sourcing OpenHouse: A Control Plane for Managing Tables in a Data Lakehouse

The article discusses the open sourcing of OpenHouse, a control plane designed for managing tables in a data lakehouse.

ApacheApache SparkDockerJPAKubernetesLarge Language ModelsMySQLSlimSpringSQL

Sumedh Sakdeo

9 min read

Includes Code

Has Summary

Advanced

Why Representation Matters in Shaping The Future of Engineering

The article emphasizes the importance of representation and diversity in engineering, particularly in the context of developing AI technologies.

Generative AIJavaPythonSQL

Sabry Tozin

6 min read

Has Summary

Advanced

The Evolution of Enforcing our Professional Community Policies at Scale

The article discusses LinkedIn's journey in evolving its professional community policies enforcement at scale, focusing on the development of its anti-abuse platform and account restriction systems.

JavaMachine LearningOracleRedisSQL

Amit M.

17 min read

Has Summary

Advanced

Privacy Preserving Single Post Analytics

The article discusses the implementation of privacy-preserving analytics for individual posts on LinkedIn, focusing on how to provide useful insights to post authors while safeguarding viewer ident...

ApacheLessMVPSQL

Ryan Rogers

24 min read

Has Summary

Advanced

Costwiz: Saving cost for LinkedIn enterprise on Azure

The article discusses Costwiz, a tool developed by LinkedIn to optimize cloud costs on Azure by monitoring resource utilization and providing actionable insights.

AzureAzure Cosmos DBAzure FunctionsAzure Resource ManagerPythonSQLSQL Server

LinkedIn Engineering Team

17 min read

Has Summary

Intermediate

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The article introduces OpenHouse, a control plane developed at LinkedIn for managing tables in open source data lakehouse deployments.

ApacheApache SparkEnvoyKubernetesMachine LearningMySQLSQLTerraform

Sumedh Sakdeo

11 min read

Includes Code

Has Summary

Intermediate

Declarative Data Pipelines with Hoptimator

The article discusses the development of Hoptimator, a declarative data pipeline orchestrator designed to streamline the creation of end-to-end data pipelines at LinkedIn.

ApacheKubernetesMySQLSQLYAML

Ryanne Dolan

10 min read

Includes Code

Has Summary

Intermediate

Career stories: Military commander turned Trust & Safety manager

The article discusses Avery's transition from a military commander to a Trust & Safety manager at LinkedIn, highlighting his journey through military IT and the support he received from his team du...

SQL

LinkedIn Engineering Team

7 min read

Has Summary

Advanced

LinkedIn’s GraphQL journey for integrations and partnerships: How we accelerated development by 90%

The article discusses LinkedIn's adoption of GraphQL to enhance API development for integrations and partnerships, achieving a 90% reduction in development time.

GitGitLabGraphQLREST APISQL

LinkedIn Engineering Team

10 min read

Has Summary

Intermediate

Real-time analytics on network flow data with Apache Pinot

The article discusses how LinkedIn utilizes Apache Pinot for real-time analytics on network flow data, emphasizing the importance of observability in their infrastructure.

ApacheApache KafkaSQL

LinkedIn Engineering Team

10 min read

Has Summary

Beginner

Shifting left on governance: DataHub and schema annotations

The article discusses the importance of data governance in large organizations like LinkedIn, emphasizing the need for effective schema annotations and automation in managing vast datasets.

ApacheAvroMySQLSQLThrift

Joshua Shinavier

8 min read

Has Summary

Intermediate

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn

The article discusses the implementation of a load-balanced Brooklin Mirror Maker at LinkedIn, which efficiently replicates large-scale Kafka clusters.

ApacheApache KafkaAvroSQL

vaibhav maheshwari

14 min read

Has Summary

Advanced

Opal: Building a mutable dataset in data lake

The article discusses Opal, a system developed at LinkedIn to manage mutable datasets within a data lake.

ApacheAvroMySQLOracleSQL

Bhupendra Jain

16 min read

Has Summary

Beginner

Near real-time features for near real-time personalization

The article discusses the implementation of near real-time personalization features at LinkedIn, focusing on how member actions can be leveraged to enhance recommendation systems without significan...

ApacheApache KafkaApache SparkSQLV

Rupesh Gupta

17 min read

Has Summary

Advanced

Accelerating the LinkedIn Experience with Azure Front Door

The article discusses LinkedIn's migration to Azure Front Door (AFD) and the significant performance improvements achieved through this transition.

ApacheAzureCDNHAProxyHTTP/3SQLVault

Samir Jafferali

14 min read

Has Summary

Intermediate

DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

The article discusses DARWIN, LinkedIn's unified Data Science and Artificial Intelligence Workbench, designed to streamline the workflows of data scientists and AI engineers by centralizing various...

ApacheArtificial IntelligenceDockerGitKubernetesMySQLPythonReactScalaSQLTensorFlowXGBoost

Varun S.

20 min read

Has Summary

Intermediate

From daily dashboards to enterprise grade data pipelines

This article discusses the evolution of LinkedIn's Daily Executive Dashboard (DED) from a simple dashboard to a robust enterprise-grade data pipeline.

AvroAzureJavaOraclePythonScalaSQL

Jennifer Zheng

16 min read

Has Summary

Advanced

Text analytics on LinkedIn Talent Insights using Apache Pinot

The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot.

ApacheApache SparkCSSGitHTMLJavaJavaScriptMySQLObjective-CPHPRailsRubySQL

Siddharth Teotia

17 min read

Has Summary

Advanced

Solving for the cardinality of set intersection at scale with Pinot and Theta Sketches

This article discusses the challenges and solutions for estimating the cardinality of set intersections at scale using Apache Pinot and Theta Sketches.

ApacheJavaSQL

Vincent Wang

13 min read

Has Summary

Advanced

Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses

The article discusses Coral, an open-sourced SQL translation, analysis, and rewrite engine developed at LinkedIn for modern data lakehouses.

ApacheAvroGitSQLV

Walaa Eldin Moustafa

20 min read

Has Summary

Intermediate

Magnet: A scalable and performant shuffle architecture for Apache Spark

The article introduces Magnet, a scalable and performant shuffle architecture designed for Apache Spark, addressing the challenges faced in shuffle operations at LinkedIn.

ApacheApache SparkAzureSQL

Min Shen

16 min read

Has Summary

Advanced

LIquid: The soul of a new graph database, Part 2

The article discusses LIquid, a new graph database, focusing on its design and implementation.

JavaOracleSQLSQL Server

Scott Meyer

15 min read

Has Summary

Advanced

Jhubbub on Helix: Stateless and elastic made easy

The article discusses the architectural improvements made to Jhubbub, LinkedIn's internal backend service for processing RSS feeds, by leveraging Apache Helix to create a stateless and elastic syst...

ApacheAzureSQL

Hunter Lee

11 min read

Has Summary

Intermediate

LIquid: The soul of a new graph database, Part 1

The article introduces LIquid, a new graph database developed by LinkedIn, designed to facilitate real-time querying of the economic graph.

SQL

Scott Meyer

14 min read

Has Summary

Advanced

Building LinkedIn Talent Insights to democratize data-driven decision making

The article discusses the development of LinkedIn Talent Insights, a tool designed to democratize data-driven decision-making in talent management.

ApacheSQL

Tim Santos

12 min read

Has Summary

Beginner

Spark-TFRecord: Toward full support of TFRecord in Spark

The article discusses Spark-TFRecord, a new data source for Apache Spark that aims to provide full support for the TFRecord data format used in TensorFlow.

ApacheApache SparkAvroJSONMachine LearningSQLTensorFlow

Jun Shi

5 min read

Has Summary

Advanced

Introducing Apache Pinot 0.3.0

Apache Pinot 0. 3. 0 is an open-source, distributed OLAP data store developed at LinkedIn, designed for near-real-time analytics.

ApacheAvroAzureDockerGoogle CloudGoogle Cloud StorageHelmKubernetesSQLThrift

Mayank S.

9 min read

Has Summary

Intermediate

Advanced schema management for Spark applications at scale

The article discusses advanced schema management techniques for Apache Spark applications at LinkedIn, focusing on the integration of Avro schemas with the Hive Metastore to enhance type safety and...

ApacheApache SparkAvroJavaScalaSQL

Walaa Eldin Moustafa

14 min read

Has Summary

Intermediate

Data Sentinel: Automating data validation

The article discusses Data Sentinel, a platform developed at LinkedIn to automate data validation and improve data quality in production environments.

ApacheApache SparkSQL

Arun Swami

9 min read

Has Summary

Intermediate

How we mapped the “skills genome” of emerging jobs

The article discusses the 'skills genome methodology' developed by LinkedIn to identify unique skills associated with emerging jobs, which are rapidly growing but may lack a large workforce.

AnsibleApacheArtificial IntelligenceDockerKerasKubernetesPythonSQLTensorFlowTerraform

Zhichun Jenny Ying

8 min read

Has Summary

Intermediate

An inside look at LinkedIn’s data pipeline monitoring system

This article provides an in-depth look at LinkedIn's data pipeline monitoring system, focusing on the challenges faced with traditional monitoring methods and how they have evolved to improve visib...

ApacheAvroFlaskMySQLOracleSQLYAML

Krishnan Raman

16 min read

Has Summary

Intermediate

Open sourcing Brooklin: Near real-time data streaming at scale

The article discusses the open-sourcing of Brooklin, a distributed service for near real-time data streaming at scale, which has been in production at LinkedIn since 2016.

AWSAzureJSONMySQLOracleSQL

Celia K.

10 min read

Has Summary

Advanced

Bridging Offline and Nearline Computations with Apache Calcite

The article discusses how Apache Calcite can bridge the gap between offline and nearline computations in big data processing.

ApacheApache KafkaApache SparkAvroJavaPythonSQL

Khai Tran

12 min read

Has Summary

Advanced

Managing Distributed Tasks with Helix Task Framework

The article discusses the Helix Task Framework, a component of Apache Helix designed for managing distributed tasks in large-scale data processing systems.

ApacheJavaKongSQL

Junkai Xue

15 min read

Has Summary

Advanced

Samza 1.0: Stream Processing at Massive Scale

The article announces the release of Samza 1. 0, a distributed stream processing framework developed at LinkedIn, highlighting its significant features and improvements.

ApacheApache KafkaAzureCachingJavaKubernetesPythonScalaSQL

Jagadish Venkatraman

10 min read

Has Summary

Intermediate

Behind "Big Data" and "AI": Elements of Modern Data Science

The article discusses the essential elements of modern data science, particularly in the context of Big Data and AI.

ApacheMachine LearningPythonSQL

Michael Li

8 min read

Has Summary

Advanced

Incremental Data Capture for Oracle Databases at LinkedIn: Then and Now

The article discusses the evolution of incremental data capture for Oracle databases at LinkedIn, highlighting the transition from a batch processing model to a near-real-time system.

ApacheApache KafkaIrisJavaMySQLOraclePerlSQL

Saurabh Goyal

9 min read

Has Summary

Intermediate

Query Analyzer: A Tool for Analyzing MySQL Queries Without Overhead

The article discusses the Query Analyzer, a tool developed by LinkedIn for analyzing MySQL queries with minimal overhead.

Google CloudMySQLSQL

Karthik Appigatla

10 min read

Has Summary

Intermediate

Migrating to Espresso

The article discusses the migration of LinkedIn's internal service, Babylonia, from Oracle to Espresso, a distributed NoSQL database.

AvroOracleSQL

David Max

11 min read

Has Summary

Intermediate

New Analytics for Sharing on LinkedIn: See Who’s Viewed Your Post

The article discusses new analytics features on LinkedIn that allow users to see who has viewed their posts, enhancing their ability to understand audience engagement.

JSONSQL

Andranik Kurghinyan

7 min read

Has Summary

Beginner

Rethinking Endorsements Infrastructure, Part 2: The New Endorsements Infrastructure

This article discusses the evolution of LinkedIn's Endorsements infrastructure, focusing on the integration of GraphDB to enhance the relevance of suggested endorsements.

Machine LearningSQL

Victor Kabdebon

6 min read

Has Summary

Intermediate

Rethinking Endorsements Infrastructure, Part 1

The article discusses the evolution of LinkedIn's Endorsements infrastructure, highlighting the need for a more effective system to provide valuable skill endorsements.

JavaMachine LearningSQL

Victor Kabdebon

7 min read

Has Summary

Advanced

Stream Processing Hard Problems Part II: Data Access

This article discusses the challenges of data access in high-scale stream processing, particularly focusing on the read/write and read-only data access patterns.

ApacheApache KafkaAWSAzureCassandraOracleSQL

LinkedIn Engineering Team

21 min read

Has Summary

Advanced

Stream Processing Hard Problems – Part 1: Killing Lambda

The article discusses the challenges faced in stream processing, particularly focusing on the limitations of the Lambda architecture.

ApacheAWSAzureJavaPythonSQL

LinkedIn Engineering Team

14 min read

Has Summary

Advanced