Implementing URL Validation and Normalization with uriparserURL handling is a common need in networking, web applications, and tools that process links. Correctly validating and normalizing URLs reduces security risks, improves caching and comparison, and ensures reliable downstream processing. uriparser is a small, robust C library for parsing and manipulating URIs (Uniform Resource Identifiers). This article explains how to use uriparser to validate and normalize URLs, covers common pitfalls, and provides practical examples and patterns for production use.
What uriparser provides
uriparser focuses on parsing URIs according to RFC 3986 and related standards. Key features:
- Strict, standards-based parsing of URI components (scheme, authority, path, query, fragment).
- API functions to parse a URI string into a struct with individual components.
- Functions to manipulate, recompose, and encode/decode components.
- Support for error detection (invalid characters, malformed sequences).
- Lightweight and permissive licensing for embedding in applications.
Why validation and normalization matter
- Security: Invalid or maliciously crafted URLs can be used in injection attacks, request smuggling, or to bypass filters.
- Consistency: Normalized URLs allow cache keys, deduplication, and comparison to function reliably.
- Interoperability: Different clients/servers may represent the same resource with different but equivalent URLs (case differences, percent-encoding, default ports, trailing slashes).
- User experience: Cleaning and validating user-supplied URLs reduces errors and improves link handling.
Common normalization steps:
- Lowercasing scheme and host.
- Removing default ports (e.g., :80 for http).
- Percent-encoding normalization (decode safe characters; uppercase hex digits).
- Path segment normalization (resolve “.” and “..”).
- Removing duplicate slashes in path when appropriate.
- Sorting or canonicalizing query parameters (if your app depends on it).
Basic uriparser usage: parsing a URL
Below is an outline of typical uriparser workflow (names match common uriparser API concepts). The code examples are conceptual and use typical patterns; refer to uriparser headers for exact function names and types in the version you use.
#include <uriparser/Uri.h> /* Parse a URL string into a UriUriA structure */ UriParserStateA state; UriUriA uri; state.uri = &uri; if (uriParseUriA(&state, url_string) != URI_SUCCESS) { /* handle parse error */ } /* Use uri fields: scheme, hostText, portText, pathHead, query, fragment */
After parsing, you can inspect uri.scheme, uri.hostText, uri.portText, uri.query, and the path represented as a linked list of segments (pathHead/pathTail).
Validation checklist with uriparser
-
Syntactic validity: Let uriparser detect malformed URIs on parse.
- Check uriParseUriA result and use parsing error position from state.
-
Required components: Ensure presence of required parts for your use case (e.g., scheme + host for network requests).
- Reject URLs missing scheme or host if you need absolute references.
-
Allowed schemes: Whitelist schemes (http, https, ftp, mailto, etc.) or reject disallowed schemes (javascript:, data:, file:) to avoid XSS or local-file access.
- Compare uri.scheme (case-insensitive).
-
Host validation:
- For domain names: validate labels, optionally check punycode/IDNA if needed.
- For IPv4/IPv6: validate numeric formats; uriparser gives separate parsing info for IPv6 literals.
- Reject embedded credentials unless explicitly allowed (user:pass@host).
-
Port handling:
- If present, ensure numeric and within 1–65535.
- Remove default ports during normalization.
-
Path and query safety:
- Control max lengths to avoid buffer/resource exhaustion.
- Check for suspicious dot-segment sequences after normalization.
- Optionally re-encode characters not allowed in particular contexts.
-
Character set and percent-encoding:
- Ensure percent-escapes are valid (two hex digits).
- Normalize percent-encoding to uppercase hex digits or decode safe characters when canonicalizing.
Normalization steps with uriparser (detailed)
Below are concrete normalization steps, with notes on when to use them.
-
Lowercase scheme and host
- RFC 3986: scheme and host are case-insensitive.
- Convert uri.scheme and uri.hostText to lowercase.
-
Remove default ports
- If scheme is http and port is 80, or https and 443, drop the portText.
- Be careful: nonstandard ports must be preserved.
-
Percent-encoding normalization
- Decode percent-encoded octets that are unreserved characters (A–Z a–z 0–9 – . _ ~).
- Re-encode characters that must be percent-encoded in their component.
- Normalize hex digits to uppercase (e.g., %2f -> %2F).
- uriparser provides helper functions to extract raw component data; you may need helper code to iterate and adjust percent-escapes.
-
Path segment normalization (remove “.” and “..”)
- Implement dot-segment removal algorithm from RFC 3986 Section 5.2.4.
- uriparser represents path segments as a linked list—walk it, building a new list while resolving “.” and “..”.
-
Remove duplicate slashes (optional)
- Some servers treat // differently; decide based on application semantics.
-
Trailing slash normalization
- Normalize presence/absence of trailing slash depending on whether you treat directories and resources differently.
-
Sort and canonicalize query parameters (application-specific)
- If canonical representation is needed for caching or signing, split query on & and =, percent-decode names/values when appropriate, sort by name then value, and re-encode.
- Beware: reordering parameters can change semantics for some endpoints—only do this when safe.
-
Remove fragment (if you are normalizing for network requests)
- Fragment is client-side only; drop it for resource identity/requests.
Example: normalize_url() sketch (conceptual C)
/* Pseudocode sketch — adapt to your uriparser version and helpers. */ char *normalize_url(const char *input) { UriParserStateA state; UriUriA uri; state.uri = &uri; if (uriParseUriA(&state, input) != URI_SUCCESS) return NULL; /* 1. Lowercase scheme and host */ to_lowercase(uri.scheme); to_lowercase(uri.hostText); /* careful with percent-encoded or IDN */ /* 2. Remove default port */ if (uri.portText && is_default_port(uri.scheme, uri.portText)) { uri.portText = NULL; } /* 3. Normalize path: decode unreserved, remove dot-segments */ UriPathSegmentStructA *seg = uri.pathHead; UriPathSegmentStructA *out_head = NULL, *out_tail = NULL; while (seg) { if (is_single_dot(seg->text)) { /* skip */ } else if (is_double_dot(seg->text)) { /* pop last from out list if any */ } else { append_segment(&out_head, &out_tail, normalize_segment(seg->text)); } seg = seg->next; } uri.pathHead = out_head; /* 4. Recompose URI to string */ char *result = NULL; uriToStringAllocA(&result, &uri, NULL, 0); uriFreeUriMembersA(&uri); return result; }
Notes: implement helpers to_lowercase, is_default_port, normalize_segment (decode unreserved chars, uppercase percent hex), and proper memory management. Use uriparser’s uriToStringAllocA or equivalent to recompose a string.
Handling edge cases and internationalized domains
- IDNA (Unicode domain names): uriparser will expose hostText; convert to/from Punycode (ACE) using an IDNA library (libidn2 or similar) if you need to normalize internationalized domain names.
- IPv6 zone identifiers (e.g., fe80::1%eth0): validate and preserve or remove zone depending on local vs global addressing needs.
- Relative references: uriparser supports relative URIs; normalization for resolution requires a base URI and applying RFC 3986 resolution.
Performance and memory considerations
- uriparser is lightweight but be mindful of allocation patterns: reuse parser data structures where possible and free uri members after use.
- Limit maximum allowed input length and component sizes to avoid DoS from huge strings.
- For bulk processing, consider streaming or batch parsing with worker threads and pooled buffers.
Sample flow for a web application
- Accept user-supplied URL string.
- Trim whitespace; reject overly long inputs.
- Parse with uriparser; if parse fails, return a validation error.
- Check scheme whitelist and host presence.
- Normalize scheme, host, port, path, and query per policy.
- Optionally canonicalize query params if safe.
- Use normalized URL for storage, comparisons, or requests; drop fragments for network calls.
Testing and validation
- Unit tests: include many cases — uppercase/lowercase mix, percent-encoded unreserved chars, dot-segments, default vs explicit ports, IPv6, IDN, userinfo, empty path, query-only URIs.
- Fuzzing: run fuzz tests against your parsing/normalization code to find edge cases and crashes.
- Interoperability tests: compare results against browsers or other canonicalizers for a set of real-world URLs to ensure compatibility.
Summary
uriparser provides a standards-focused foundation for parsing URIs in C. Use it to detect malformed inputs, extract components, and implement normalization steps: lowercase scheme/host, remove default ports, percent-encoding normalization, dot-segment resolution, and application-specific query canonicalization. Combine uriparser’s parsing with careful validation rules (scheme whitelists, host checks, length limits), IDNA handling when needed, and thorough testing to build reliable, secure URL processing in your application.
Leave a Reply