Logo

MonoCalc

/

Unicode Normalization

Encode/Decode
0 / 10,000 characters

Normalization Form

NFC: Canonical Composition — precomposed form (web default)

Enter Unicode text above to see the normalized output and code point breakdown.

About This Tool

🔡 Unicode Normalization Tool – NFC, NFD, NFKC & NFKD Explained

Unicode normalization converts text to a canonical or compatibility equivalent form so that strings which look identical are also byte-identical. Without normalization, the accented letter é can exist as two different byte sequences — the precomposed U+00E9 or the two-code-point sequence U+0065 U+0301 (base e + combining acute accent) — causing silent failures in string comparison, database lookups, and search.

Why Unicode Has Multiple Representations

Unicode was designed to be backward-compatible with hundreds of legacy character sets. This means many characters that look the same have multiple valid encodings. A French word like naïve can be typed with precomposed ï (U+00EF) or decomposed i + ̈ (U+0069 + U+0308). Both render identically on screen, but they are different byte sequences — which breaks equality tests, password hashing, and URL routing.

The Four Normalization Forms

The Unicode Standard defines four normalization forms, each suited to different use cases:

FormFull NameWhat It DoesBest For
NFCCanonical Decomposition + Canonical CompositionDecomposes then recomposes into shortest precomposed formWeb storage, APIs, databases
NFDCanonical DecompositionSplits precomposed characters into base + combining marksAccent stripping, linguistic analysis, sorting
NFKCCompatibility Decomposition + Canonical CompositionReplaces compatibility variants (ligatures, fullwidth, fractions) then composesSearch indexes, case-folding, slug generation
NFKDCompatibility DecompositionFully decomposes both canonical and compatibility characters — most verbose formText indexing, keyword extraction

Practical Examples

Ligature resolution (NFKC/NFKD): The typographic ligature (U+FB01) becomes fi (U+0066 + U+0069) under NFKC/NFKD. This is essential for search engines and password validators that must treat them as equivalent.

Accent decomposition (NFD): Ångström A + ̊ + n + g + s + t + r + o + ̈ + m. After removing the combining marks (U+0300–U+036F) you get a plain ASCII string — a common technique for generating URL slugs.

Precomposition (NFC): Any NFD-decomposed text is recomposed back to precomposed code points when normalized to NFC, giving a compact, portable representation for storage.

Normalization in Programming

Modern languages provide built-in normalization support:

// JavaScript / TypeScript
const nfc = text.normalize("NFC");
const nfkd = text.normalize("NFKD");

// Accent stripping (slug generation)
const slug = text
  .normalize("NFD")
  .replace(/[\u0300-\u036f]/g, "")
  .toLowerCase();

# Python 3
import unicodedata
nfc = unicodedata.normalize("NFC", text)

# Java
import java.text.Normalizer;
String nfc = Normalizer.normalize(text, Normalizer.Form.NFC);

Code Point Explorer

Enable the Code Points toggle to see every character in your normalized output listed with its U+XXXX hex code point, decimal value, and UTF-8 byte sequence. Characters that were modified during normalization are highlighted with an Changed badge, making it easy to spot exactly which code points were composed, decomposed, or substituted.

Compare All Forms Side by Side

The Compare All Forms mode displays NFC, NFD, NFKC, and NFKD outputs simultaneously in a four-column grid. Each card shows the normalized text, character count, and UTF-8 byte size, along with a colour badge indicating whether the form differs from the original input. This view is especially useful when diagnosing why two strings that look the same fail an equality check.

UTF-8 Byte Size Impact

Normalization can change the byte size of a string. NFD typically increases byte size (splitting accented characters), while NFC and NFKC typically decrease it (merging combining marks into single precomposed code points). The stats panel shows exact byte counts before and after so you can measure the impact for bandwidth-sensitive applications.

When NOT to Normalize

Avoid normalizing arbitrary binary data or strings that you intend to reproduce exactly (such as cryptographic keys, tokens, or file paths on case-sensitive file systems). Also be cautious with NFKC when working with mathematical notation — it replaces visually distinct symbols (e.g., superscript digits, bold letters) with their plain equivalents, which may alter meaning.

Frequently Asked Questions

Is the Unicode Normalization free?

Yes, Unicode Normalization is totally free :)

Can I use the Unicode Normalization offline?

Yes, you can install the webapp as PWA.

Is it safe to use Unicode Normalization?

Yes, any data related to Unicode Normalization only stored in your browser (if storage required). You can simply clear browser cache to clear all the stored data. We do not store any data on server.

What is Unicode normalization and why does it matter?

Unicode normalization is the process of converting text to a standard representation so that equivalent sequences produce identical byte patterns. Without it, the same visible character (e.g., 'é') can be stored in two ways — as a single precomposed code point (U+00E9) or as a base letter plus a combining accent (U+0065 U+0301) — causing string comparisons and searches to fail silently.

How does this Unicode Normalization Tool work?

Paste or type any Unicode text in the input area, then choose a normalization form (NFC, NFD, NFKC, or NFKD). The tool instantly applies JavaScript's built-in String.prototype.normalize() to produce the normalized output, highlights which characters changed, and shows character counts and UTF-8 byte sizes before and after. All processing happens locally in your browser — no data is sent to a server.

What is the difference between NFC, NFD, NFKC, and NFKD?

NFC (Canonical Decomposition + Canonical Composition) produces the shortest precomposed form and is preferred for storage and display. NFD (Canonical Decomposition) fully decomposes characters into base + combining marks, useful for linguistic analysis. NFKC also replaces compatibility characters (e.g., the fi ligature → fi, fullwidth digits → ASCII digits). NFKD combines full compatibility decomposition without recomposition.

Which normalization form should I use for my application?

Use NFC for web storage, JSON, and most databases — it gives the most compact precomposed form widely expected by web APIs. Use NFD when you need to inspect or strip accent marks. Use NFKC for search indexes, case-folding, or slug generation because it collapses visual variants. Use NFKD when you need the most decomposed form for full text analysis.

What does 'Is Already Normalized' mean?

If the input text produces an identical byte sequence after normalization, the tool shows a green 'Already Normalized' badge. This is useful for validating that incoming data from an API or user input is already in your expected form, saving a normalization step at runtime.

Does the tool support emoji and surrogate pairs?

Yes. The tool uses the JavaScript spread operator ([...text]) to correctly iterate over Unicode scalar values, which handles surrogate pairs (characters above U+FFFF, such as most emoji) without splitting them. Code point display always shows the full U+XXXXX value for astral characters.