Overview
This article details SafetyCulture's comprehensive approach to secure string input validation in microservices, covering the four essential steps: decode, normalize/canonicalize, sanitize, and validate. It introduces a developer-friendly, secure-by-default validation framework built on Protocol Buffers and gRPC that auto-generates strict validation code, including a novel security sanitizer that replaces dangerous characters with visually similar Unicode alternatives to prevent injection attacks.
What You'll Learn
How to implement a four-step string validation pipeline: decode, normalize, sanitize, and validate
How to replace dangerous characters with visually similar Unicode alternatives to prevent SQL injection, XSS, and command injection
How to auto-generate strict input validators from Protocol Buffer definitions without writing regex manually
Why Unicode normalization (NFC/NFD) and encoding validation are critical before applying validation rules
How to analyze real user data to craft default allowlist patterns that accept over 99.9% of legitimate input
Prerequisites & Requirements
- Basic understanding of string encoding formats (UTF-8, URL encoding, base64)
- Familiarity with common injection vulnerabilities (SQL injection, XSS, command injection)
- Understanding of microservices architecture and API contracts
- Protocol Buffers (protobuf) and gRPC familiarity for implementation examples(optional)
- Experience with Go programming language for reading code examples(optional)
Key Questions Answered
What are the four essential steps for secure string input validation?
How can you prevent SQL injection and XSS without rejecting common characters like quotes and angle brackets?
Why is checking only string length not sufficient for input validation?
How does double encoding bypass input validation in microservices?
How do you determine which Unicode characters to allow in a default validation pattern?
What is the difference between secure and unsafe string validators?
How can you auto-generate input validation code from API definitions?
Why is Unicode normalization important before validating strings?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Implement a four-step validation pipeline (decode, normalize, sanitize, validate) for all string inputs rather than just checking length or applying regex. Each step addresses a distinct class of security issues, and skipping any step leaves gaps that attackers can exploit through encoding manipulation, Unicode tricks, or dangerous character injection.This is especially critical in microservices architectures where data passes through multiple services, each potentially decoding or transforming the data differently.
2Replace dangerous characters with visually similar Unicode alternatives as a defense-in-depth sanitization layer. Characters like single quotes, asterisks, angle brackets, and pipe characters can be swapped with Unicode lookalikes that are not interpreted as operators by SQL engines, HTML parsers, or command shells.This technique is most effective for free-text fields like titles and descriptions where the visual appearance matters more than the exact character. For fields requiring literal characters (emails, URLs), use an explicit unsafe validator instead.
3Analyze your actual production data to determine which Unicode characters and categories your users need before defining default validation patterns. SafetyCulture extracted and counted individual characters across product features, then classified approximately 300 Unicode categories into accept, reject, optional, and unsafe groups to achieve over 99.9% acceptance of legitimate input.Starting with data analysis prevents creating overly restrictive patterns that frustrate users or overly permissive ones that miss threats. The key insight is that even categories representing only 0.05% of total characters can affect thousands of users in absolute terms.
4Define input validation rules in your API contracts (like Protocol Buffer definitions) and auto-generate validator code rather than requiring developers to write regex patterns or validation logic manually. This approach ensures consistent validation across all endpoints and languages while making it easy for engineers to add validation without security expertise.SafetyCulture's approach lets developers write simple annotations like 'len: 1:50' and 'replace_unsafe: true' which auto-generate complete validation code including normalization, sanitization, and character allowlisting. The open-source plugin is available on GitHub.
5Each system in a microservices architecture should implement its own input validation at trust boundaries rather than relying on upstream validation. Attackers exploit gaps between services by double or triple encoding payloads that pass validation in one service but decode to dangerous characters in another.Define trust boundaries explicitly and apply validation after any decoding step. Even if a gateway validates input, downstream services should re-validate after any transformation or decoding occurs.
6Use custom SAST rules (like Semgrep) to enforce input validation adoption across your codebase by checking that meaningful validation patterns are defined on all string fields in API definitions. This creates a scalable enforcement mechanism that catches missing validation during code review rather than in production.SafetyCulture integrated custom Semgrep rules into their GitHub CI pipeline to check protobuf definitions on every pull request, tracking adoption trends over time. This approach scales across hundreds of microservices.