Modernizing the LDAP and Kerberos infrastructure that secures Hadoop at LinkedIn

Aswin M Prabhu
15 min readintermediate
--
View Original

Overview

The article discusses the modernization of the LDAP and Kerberos infrastructure that secures Hadoop at LinkedIn, detailing the transition from a legacy setup to a highly available, automated system on Azure Linux. It emphasizes the importance of eliminating single points of failure and reducing operational toil while ensuring minimal downtime during migration.

What You'll Learn

1

How to implement a multi-primary setup for LDAP to eliminate single points of failure

2

Why automating LDAP deployments reduces operational toil

3

How to monitor replication lag between LDAP instances during migration

Prerequisites & Requirements

  • Understanding of LDAP and Kerberos concepts
  • Familiarity with Azure Linux and HAProxy(optional)

Key Questions Answered

What were the main issues with the legacy LDAP setup at LinkedIn?
The legacy setup had several issues including single points of failure, manual operations leading to high maintenance toil, and a lack of a test environment for experimentation. These challenges necessitated a modernization effort to improve reliability and operational efficiency.
How does the new multi-primary setup improve LDAP availability?
The new multi-primary setup features four primary instances in a star replication topology, allowing any primary to handle write traffic. This configuration prevents single points of failure, enabling maintenance without downtime and facilitating disaster recovery across data centers.
What steps were taken to ensure minimal downtime during the migration?
To ensure minimal downtime, LinkedIn reduced the DNS A record TTL to 1 minute, stopped replication between the old and new primary, and created a CNAME record pointing to the new cluster. This approach allowed for a quick cutover while enabling a rollback if issues arose.
What role does replication play in the new LDAP infrastructure?
Replication is crucial in the new infrastructure as it ensures that all LDAP instances remain synchronized. It allows for seamless transitions during migrations and helps maintain data consistency across the system, especially during read traffic migrations.

Key Statistics & Figures

Data stored in Hadoop clusters
~5 exabytes
This massive amount of data necessitates a highly secure and reliable infrastructure.
Queries per minute handled by Hadoop
approximately 1 million
The LDAP and Kerberos infrastructure must efficiently support this high volume of queries.
Number of nodes in Hadoop installation
80k+
The scale of the installation highlights the need for a robust and scalable security infrastructure.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Ldap
Used for securing Hadoop clusters and managing user authentication.
Security
Kerberos
Provides network authentication for Hadoop clusters.
Operating System
Azure Linux
The new operating system environment for the LDAP and Kerberos infrastructure.
Load Balancer
Haproxy
Used for load balancing read traffic across LDAP worker instances.

Key Actionable Insights

1
Implement a multi-primary LDAP setup to enhance system reliability and eliminate single points of failure.
This approach not only improves availability but also allows for easier maintenance and disaster recovery, which is crucial for high-demand environments like LinkedIn's Hadoop infrastructure.
2
Automate LDAP deployments using a deployment agent to reduce operational toil and streamline maintenance tasks.
By automating tasks such as TLS certificate refreshes and service configurations, teams can focus on more strategic initiatives rather than repetitive manual processes.
3
Establish a robust testing environment for LDAP changes to catch issues early in the deployment process.
A dedicated test cluster allows for canary testing of changes, reducing the risk of introducing errors into the production environment.

Common Pitfalls

1
Relying on a single primary instance for write operations can lead to significant downtime during maintenance.
This occurs because if the primary instance fails, all write traffic is halted. Implementing a multi-primary setup mitigates this risk.
2
Manual operations in LDAP deployments can lead to inconsistencies and increased operational toil.
Without automation, tasks such as TLS certificate refreshes and service configurations become error-prone and time-consuming.

Related Concepts

Ldap Security Practices
Kerberos Authentication Mechanisms
High Availability In Distributed Systems
Automated Deployment Strategies