Finding leaked passwords with AI: How we built Copilot secret scanning

Passwords are notoriously difficult to detect with conventional programming approaches. AI can help us find passwords better because it understands context. This blog post will explore the technical challenges we faced with building the feature and the novel and creative ways we solved them.

Ashwin Mohan
9 min readintermediate
--
View Original

Overview

The article discusses the development and implementation of Copilot secret scanning, a feature that uses AI to detect leaked passwords in codebases. It covers the challenges faced during its creation, the methodologies employed for testing and iteration, and the improvements made to enhance detection accuracy.

What You'll Learn

1

How to leverage AI for detecting leaked passwords in codebases

2

Why precision in secret detection is crucial for security teams

3

When to apply different prompting strategies for LLMs

Prerequisites & Requirements

  • Understanding of AI and machine learning concepts
  • Familiarity with GitHub and version control systems(optional)

Key Questions Answered

How does Copilot secret scanning detect generic passwords?
Copilot secret scanning uses AI to analyze the context of potential secrets in codebases, focusing on the usage and location of these secrets to reduce noise and deliver relevant alerts. This approach improves upon traditional methods that relied on regular expressions, which often generated excessive false positives.
What challenges were faced during the development of Copilot secret scanning?
The development faced challenges such as the model's difficulty in interpreting unconventional file types and structures, which are not typically represented in the training data for large language models. This necessitated a reevaluation of the detection approach and prompting strategies.
What improvements were made to enhance the precision of secret detection?
Improvements included enhancing the offline evaluation framework, incorporating diverse test cases, experimenting with different models and prompting strategies, and implementing a workload-aware request management system to optimize resource usage.
What was the impact of mirror testing on detection quality?
Mirror testing revealed a significant drop in false positives, with a reported 94% reduction across organizations, indicating that the iterative changes made during development effectively increased precision without sacrificing recall.

Key Statistics & Figures

Reduction in false positives
94%
Achieved through mirror testing and iterative improvements in detection methods.
Percentage of repositories detecting passwords
35%
Indicates the current reach of Copilot secret scanning within GitHub Secret Protection repositories.

Technologies & Tools

AI Tool
Github Copilot
Used for detecting generic passwords in codebases.
AI Model
Gpt-3.5-turbo
Initially employed for password detection through few-shot prompting.
AI Model
Gpt-4
Used as a confirming scanner to validate candidates found by GPT-3.5-Turbo.

Key Actionable Insights

1
Implement a robust evaluation framework for AI models to ensure detection accuracy.
This framework should include diverse test cases and feedback from users to continuously refine detection capabilities and reduce false positives.
2
Utilize AI to analyze context when detecting sensitive information in code.
By focusing on the usage and location of potential secrets, teams can minimize noise and improve the relevance of alerts, enhancing overall security.
3
Adopt a workload-aware request management system to optimize resource usage.
This approach allows for equitable sharing of resources across different scanning workloads, enhancing performance without overwhelming the system.

Common Pitfalls

1
Relying solely on regular expressions for secret detection can lead to excessive false positives.
This approach often fails to account for the varied structures of generic passwords, necessitating a more nuanced AI-based detection strategy.
2
Neglecting the importance of diverse test cases in evaluating AI models.
Without a wide range of examples, the model may struggle to generalize effectively, leading to poor performance in real-world scenarios.

Related Concepts

AI/ML In Security
Github Secret Protection
Large Language Models In Software Development