🔡 Unicode Normalization Tool – NFC, NFD, NFKC & NFKD Explained
Unicode normalization converts text to a canonical or compatibility equivalent form so that strings which look identical are also byte-identical. Without normalization, the accented letter é can exist as two different byte sequences — the precomposed U+00E9 or the two-code-point sequence U+0065 U+0301 (base e + combining acute accent) — causing silent failures in string comparison, database lookups, and search.
Why Unicode Has Multiple Representations
Unicode was designed to be backward-compatible with hundreds of legacy character sets. This means many characters that look the same have multiple valid encodings. A French word like naïve can be typed with precomposed ï (U+00EF) or decomposed i + ̈ (U+0069 + U+0308). Both render identically on screen, but they are different byte sequences — which breaks equality tests, password hashing, and URL routing.
The Four Normalization Forms
The Unicode Standard defines four normalization forms, each suited to different use cases:
| Form | Full Name | What It Does | Best For |
|---|---|---|---|
| NFC | Canonical Decomposition + Canonical Composition | Decomposes then recomposes into shortest precomposed form | Web storage, APIs, databases |
| NFD | Canonical Decomposition | Splits precomposed characters into base + combining marks | Accent stripping, linguistic analysis, sorting |
| NFKC | Compatibility Decomposition + Canonical Composition | Replaces compatibility variants (ligatures, fullwidth, fractions) then composes | Search indexes, case-folding, slug generation |
| NFKD | Compatibility Decomposition | Fully decomposes both canonical and compatibility characters — most verbose form | Text indexing, keyword extraction |
Practical Examples
Ligature resolution (NFKC/NFKD): The typographic fi ligature (U+FB01) becomes fi (U+0066 + U+0069) under NFKC/NFKD. This is essential for search engines and password validators that must treat them as equivalent.
Accent decomposition (NFD): Ångström → A + ̊ + n + g + s + t + r + o + ̈ + m. After removing the combining marks (U+0300–U+036F) you get a plain ASCII string — a common technique for generating URL slugs.
Precomposition (NFC): Any NFD-decomposed text is recomposed back to precomposed code points when normalized to NFC, giving a compact, portable representation for storage.
Normalization in Programming
Modern languages provide built-in normalization support:
// JavaScript / TypeScript
const nfc = text.normalize("NFC");
const nfkd = text.normalize("NFKD");
// Accent stripping (slug generation)
const slug = text
.normalize("NFD")
.replace(/[\u0300-\u036f]/g, "")
.toLowerCase();
# Python 3
import unicodedata
nfc = unicodedata.normalize("NFC", text)
# Java
import java.text.Normalizer;
String nfc = Normalizer.normalize(text, Normalizer.Form.NFC);Code Point Explorer
Enable the Code Points toggle to see every character in your normalized output listed with its U+XXXX hex code point, decimal value, and UTF-8 byte sequence. Characters that were modified during normalization are highlighted with an Changed badge, making it easy to spot exactly which code points were composed, decomposed, or substituted.
Compare All Forms Side by Side
The Compare All Forms mode displays NFC, NFD, NFKC, and NFKD outputs simultaneously in a four-column grid. Each card shows the normalized text, character count, and UTF-8 byte size, along with a colour badge indicating whether the form differs from the original input. This view is especially useful when diagnosing why two strings that look the same fail an equality check.
UTF-8 Byte Size Impact
Normalization can change the byte size of a string. NFD typically increases byte size (splitting accented characters), while NFC and NFKC typically decrease it (merging combining marks into single precomposed code points). The stats panel shows exact byte counts before and after so you can measure the impact for bandwidth-sensitive applications.
When NOT to Normalize
Avoid normalizing arbitrary binary data or strings that you intend to reproduce exactly (such as cryptographic keys, tokens, or file paths on case-sensitive file systems). Also be cautious with NFKC when working with mathematical notation — it replaces visually distinct symbols (e.g., superscript digits, bold letters) with their plain equivalents, which may alter meaning.