How a Jenkins Job Broke our Jenkins UI

Artwork courtesy of the Jenkins project. At Slack we manage a sophisticated Jenkins infrastructure to continuously build and test our mobile apps before release. We have hundreds of jobs running in a variety of different environments. One day something very odd happened — our Jenkins UI stopped working although the jobs continued to run. This…

Maria Sabastian
8 min readadvanced
--
View Original

Overview

This article discusses a critical incident at Slack where a Jenkins job caused the Jenkins UI to break, despite jobs continuing to run. It provides insights into troubleshooting processes, the importance of maintaining staging environments, and lessons learned from the incident.

What You'll Learn

1

How to troubleshoot Jenkins UI issues effectively

2

Why maintaining separate Jenkins environments is crucial for CI/CD

3

How to implement safer integrations with the Jenkins API

Prerequisites & Requirements

  • Understanding of Jenkins and CI/CD concepts
  • Familiarity with Jenkins plugins and Groovy scripting(optional)

Key Questions Answered

What caused the Jenkins UI to break at Slack?
The Jenkins UI broke after an upgrade to Jenkins and its plugins, specifically due to a security restriction in the Groovy sandbox that rejected unsandboxed property access in the OfflineMessage class. This was linked to a recent CVE that tightened security measures.
How did Slack resolve the Jenkins UI issue?
Slack resolved the issue by replacing the problematic OfflineMessage class with a simpler implementation using the Jenkins API's OfflineCause.ByCLI, which avoids the complications introduced by the Groovy sandbox restrictions.
What are the best practices for Jenkins troubleshooting?
Best practices include maintaining separate Jenkins environments that mirror production, keeping Jenkins API integrations lean, monitoring Groovy sandbox updates, and having a well-maintained runbook for upgrades and troubleshooting.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

CI/CD Tool
Jenkins
Used for continuous integration and deployment of mobile applications at Slack.
Scripting Language
Groovy
Used for scripting Jenkins jobs and integrations.

Key Actionable Insights

1
Maintain separate Jenkins environments to mirror production for testing upgrades and changes.
This practice allows teams to identify potential issues before they affect production, reducing downtime and improving deployment confidence.
2
Keep Jenkins API integrations simple and lean to avoid complications during updates.
Complex integrations can lead to unexpected issues when Jenkins or its plugins are updated, so minimizing dependencies can help maintain stability.
3
Regularly review and update runbooks for Jenkins processes.
A well-documented runbook can save time and effort during troubleshooting by providing clear steps and historical context for resolving issues.

Common Pitfalls

1
Assuming that Jenkins UI issues are unrelated to job executions.
This can lead to prolonged downtime as teams may not investigate the root cause effectively. Understanding the interdependencies between UI and job executions is crucial for timely resolutions.

Related Concepts

CI/CD Best Practices
Jenkins Plugin Management
Groovy Scripting In Jenkins