Secure by default input validation

Overview

This article details SafetyCulture's comprehensive approach to secure string input validation in microservices, covering the four essential steps: decode, normalize/canonicalize, sanitize, and validate. It introduces a developer-friendly, secure-by-default validation framework built on Protocol Buffers and gRPC that auto-generates strict validation code, including a novel security sanitizer that replaces dangerous characters with visually similar Unicode alternatives to prevent injection attacks.

What You'll Learn

1

How to implement a four-step string validation pipeline: decode, normalize, sanitize, and validate

2

How to replace dangerous characters with visually similar Unicode alternatives to prevent SQL injection, XSS, and command injection

3

How to auto-generate strict input validators from Protocol Buffer definitions without writing regex manually

4

Why Unicode normalization (NFC/NFD) and encoding validation are critical before applying validation rules

5

How to analyze real user data to craft default allowlist patterns that accept over 99.9% of legitimate input

Prerequisites & Requirements

  • Basic understanding of string encoding formats (UTF-8, URL encoding, base64)
  • Familiarity with common injection vulnerabilities (SQL injection, XSS, command injection)
  • Understanding of microservices architecture and API contracts
  • Protocol Buffers (protobuf) and gRPC familiarity for implementation examples(optional)
  • Experience with Go programming language for reading code examples(optional)

Key Questions Answered

What are the four essential steps for secure string input validation?
The four steps are: (1) Decode from URL-encoding or non-UTF-8 character sets, (2) Normalize and canonicalize using Unicode normalization (NFC/NFD), (3) Sanitize by removing or replacing dangerous characters like trailing whitespace and Private Use Area characters, and (4) Validate against a strict allowlist of valid characters along with length checks. Each step must be performed in order before the data can be considered validated.
How can you prevent SQL injection and XSS without rejecting common characters like quotes and angle brackets?
SafetyCulture's security sanitizer replaces potentially dangerous characters with visually similar Unicode alternatives. For example, the asterisk (*) is replaced with the mathematical operator (∗), and single quotes (') are replaced with right single quotation marks ('). These replacement characters look identical in most fonts but are not interpreted as operators by SQL engines or HTML parsers, effectively neutralizing injection payloads while preserving the visual appearance of user input.
Why is checking only string length not sufficient for input validation?
A string is an array of bytes that can contain any character including null bytes and other dangerous characters. Length-only checks miss encoded payloads where characters are obfuscated, Unicode normalization differences where the same letter has multiple representations with different byte lengths, and potentially dangerous characters that could enable injection attacks. The string 'Validation rules!💥' is 18 characters but 21 bytes in UTF-8, demonstrating how length varies by encoding.
How does double encoding bypass input validation in microservices?
Attackers double or triple encode data to trick the validating service into accepting input because the encoded characters appear valid. A subsequent service in the request lifecycle may then decode the data, revealing dangerous characters that were not present during initial validation. This is why each system should have its own input validation for untrusted data rather than relying on upstream validation, and why defining trust boundaries is essential.
How do you determine which Unicode characters to allow in a default validation pattern?
SafetyCulture analyzed real user data by extracting and counting individual characters from popular product features, mapping results to Unicode categories. Characters were classified into four groups: accept by default (letters, numbers, basic punctuation), reject by default (control characters, null bytes), accept optionally (symbols, newlines), and potentially unsafe (characters used in injection attacks). This data-driven approach achieved over 99.9% acceptance of legitimate input.
What is the difference between secure and unsafe string validators?
The secure string validator replaces dangerous characters with visually similar Unicode alternatives, producing strings with secure properties against injection vulnerabilities. It's suitable for fields like titles and descriptions where character replacement is acceptable. The unsafe string validator preserves literal input without character replacement, needed for email addresses and URLs where modification would corrupt the data. It still applies strict validation but the name signals that strings may contain dangerous characters requiring extra care.
How can you auto-generate input validation code from API definitions?
Using Protocol Buffer definitions with custom validator annotations, a protoc plugin automatically generates complete validation code including Unicode normalization, encoding checks, security sanitization, length validation, and regex pattern matching. Engineers define simple rules like 'len: 1:50' and 'replace_unsafe: true' in proto files, and the plugin generates Go code with all necessary validation logic. This approach eliminates the need for developers to write regex patterns or implement validation steps manually.
Why is Unicode normalization important before validating strings?
The same letters can be represented in different ways in Unicode. For example, 'ñ' can be stored as one character (U+00F1 in NFC) or as two characters 'n' plus 'combining tilde' (U+006E + U+0303 in NFD). Without normalization, validation patterns may behave inconsistently, length checks produce different results, and string comparisons may fail. Normalizing to a consistent form (NFC or NFD) across your environment ensures reliable validation.

Key Statistics & Figures

Legitimate input acceptance rate with strict default validation
>99.9%
After adding Private Use Area codepoint sanitizer to the initial 99% acceptance rate pattern
Initial validation pattern acceptance rate
~98%
First iteration of allowlist pattern based on data analysis, later improved to over 99%
Unicode categories analyzed for classification
~300
Individual Unicode categories and characters evaluated for the default validation pattern
Example string byte length difference
18 characters vs 21 bytes
The string 'Validation rules!💥' in UTF-8, demonstrating character vs byte length discrepancy
Median vs maximum string length found in one domain
31 bytes median, 10,000 bytes maximum
Title field where lack of validation allowed strings far exceeding expected length
Symbol category percentage of total characters
0.05%
Despite low percentage, represented over 14,000 absolute occurrences that could not be rejected

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a four-step validation pipeline (decode, normalize, sanitize, validate) for all string inputs rather than just checking length or applying regex. Each step addresses a distinct class of security issues, and skipping any step leaves gaps that attackers can exploit through encoding manipulation, Unicode tricks, or dangerous character injection.
This is especially critical in microservices architectures where data passes through multiple services, each potentially decoding or transforming the data differently.
2
Replace dangerous characters with visually similar Unicode alternatives as a defense-in-depth sanitization layer. Characters like single quotes, asterisks, angle brackets, and pipe characters can be swapped with Unicode lookalikes that are not interpreted as operators by SQL engines, HTML parsers, or command shells.
This technique is most effective for free-text fields like titles and descriptions where the visual appearance matters more than the exact character. For fields requiring literal characters (emails, URLs), use an explicit unsafe validator instead.
3
Analyze your actual production data to determine which Unicode characters and categories your users need before defining default validation patterns. SafetyCulture extracted and counted individual characters across product features, then classified approximately 300 Unicode categories into accept, reject, optional, and unsafe groups to achieve over 99.9% acceptance of legitimate input.
Starting with data analysis prevents creating overly restrictive patterns that frustrate users or overly permissive ones that miss threats. The key insight is that even categories representing only 0.05% of total characters can affect thousands of users in absolute terms.
4
Define input validation rules in your API contracts (like Protocol Buffer definitions) and auto-generate validator code rather than requiring developers to write regex patterns or validation logic manually. This approach ensures consistent validation across all endpoints and languages while making it easy for engineers to add validation without security expertise.
SafetyCulture's approach lets developers write simple annotations like 'len: 1:50' and 'replace_unsafe: true' which auto-generate complete validation code including normalization, sanitization, and character allowlisting. The open-source plugin is available on GitHub.
5
Each system in a microservices architecture should implement its own input validation at trust boundaries rather than relying on upstream validation. Attackers exploit gaps between services by double or triple encoding payloads that pass validation in one service but decode to dangerous characters in another.
Define trust boundaries explicitly and apply validation after any decoding step. Even if a gateway validates input, downstream services should re-validate after any transformation or decoding occurs.
6
Use custom SAST rules (like Semgrep) to enforce input validation adoption across your codebase by checking that meaningful validation patterns are defined on all string fields in API definitions. This creates a scalable enforcement mechanism that catches missing validation during code review rather than in production.
SafetyCulture integrated custom Semgrep rules into their GitHub CI pipeline to check protobuf definitions on every pull request, tracking adoption trends over time. This approach scales across hundreds of microservices.

Common Pitfalls

1
Only checking string length as input validation and ignoring character content. A string can contain null bytes, control characters, and injection payloads that all pass a simple length check. Without character-level validation via allowlists, applications remain vulnerable to injection attacks even when length limits are enforced.
Effective validation requires checking both length and character content against a strict allowlist after decoding, normalizing, and sanitizing the input.
2
Validating encoded data before decoding it, which allows attackers to use double or triple encoding to bypass validation. The encoded representation obfuscates dangerous characters, and when a downstream service decodes the data, the raw dangerous characters are consumed without any validation.
Always decode data to its canonical form (typically UTF-8) before applying any validation rules. Each service at a trust boundary should decode and re-validate independently.
3
Failing to normalize Unicode before validation, leading to inconsistent behavior. The letter 'ñ' can be represented as one character (NFC) or two characters (NFD), causing validation patterns and length checks to produce different results depending on the input form. This inconsistency can be exploited to bypass validation.
Choose either NFC or NFD normalization and apply it consistently across your entire environment before any validation logic runs.
4
Requiring developers to write custom regex patterns for input validation, which leads to poor adoption, ReDoS vulnerabilities, and inconsistent patterns across services. SafetyCulture found that their initial regex-based approach had poor adoption because engineers avoided writing complex patterns.
Provide developer-friendly abstractions like declarative validation rules in API contracts that auto-generate the necessary regex and validation logic, removing the need for security expertise at the individual developer level.
5
Relying on a single upstream service for input validation instead of implementing validation at each trust boundary. In microservices architectures, data flows through multiple services that may decode or transform it differently, creating gaps that bypass initial validation.
Define trust boundaries explicitly and ensure each service validates untrusted data independently, especially after any decoding or transformation steps.
6
Using a blocklist approach (rejecting known-bad characters) instead of an allowlist approach (accepting only known-good characters). Blocklists inevitably miss dangerous characters or encodings, while allowlists provide a strict security boundary that rejects anything not explicitly permitted.
SafetyCulture's data-driven approach classified characters into accept, reject, optional, and unsafe categories, building allowlist patterns that accepted over 99.9% of legitimate input while blocking everything else.

Related Concepts

SQL Injection Prevention
Cross-site Scripting (xss) Prevention
Command Injection Prevention
Unicode Normalization (nfc/Nfd)
Defense In Depth
Trust Boundaries
Google Safehtml Types
Regular Expression Denial Of Service (redos)
Grpc Middleware Interceptors
Protocol Buffer Code Generation
Sast (static Application Security Testing)
Semgrep Custom Rules
Web Application Firewall (waf) Bypass Techniques
Character Encoding (utf-8, Utf-7, URL Encoding, Base64)
Unicode Private Use Areas
Allowlist Vs Blocklist Validation
Microservices Input Validation Patterns