🔍 Text Encoding Detector – Identify Charset from Bytes
Character encoding is the mapping between raw bytes and the human-readable characters they represent. When the wrong encoding is assumed, text turns into garbled symbols — a phenomenon known as mojibake. The Text Encoding Detector analyzes raw bytes from pasted text, uploaded files, or hex sequences and returns a ranked list of the most likely encodings with per-candidate confidence scores.
📥 Three Input Modes
The tool supports three ways to provide input data, each suited to a different workflow:
- Text — paste any string directly. The tool re-encodes it to UTF-8 bytes and runs full analysis. Ideal for confirming that a copied snippet is valid UTF-8 or pure ASCII.
- File Upload — upload a
.txt,.csv,.html,.xml,.json, or any text-based file up to 10 MB. Bytes are read locally via the FileReader API — nothing leaves your browser. - Hex Bytes — enter a raw hex byte string (e.g.,
EF BB BF 48 65 6C 6C 6F). Spaces, colons, and dashes are stripped automatically. Useful for debugging binary data or inspecting BOM prefixes.
🔬 How Detection Works
The algorithm runs three phases in sequence:
Phase 1 — BOM Sniffing
The first 2–4 bytes are compared against known Byte Order Marks. A BOM match yields 100% confidence with no further analysis needed.
Phase 2 – Multi-byte Sequence Validation
For encodings without a BOM (plain UTF-8, CJK charsets), the tool validates multi-byte sequences according to each encoding's rules and counts valid vs. invalid sequences to compute a confidence score.
Phase 3 – Byte-Frequency Analysis
For single-byte legacy charsets (ISO-8859, Windows-125x, KOI8), byte-frequency histograms are compared against known encoding fingerprints. Higher ratios of bytes in the characteristic range of a charset raise its confidence score.
📊 Supported Encoding Families
| Category | Encodings |
|---|---|
| Unicode | UTF-8, UTF-8 BOM, UTF-16 LE/BE, UTF-32 LE/BE, UTF-7 |
| ASCII | US-ASCII (7-bit pure) |
| Legacy Western | ISO-8859-1, Windows-1252, ISO-8859-15, MacRoman |
| Cyrillic | Windows-1251, ISO-8859-5, KOI8-R, KOI8-U |
| Other Legacy | ISO-8859-2/7/8/9, Windows-1250/1253/1254/1255/1256 |
| CJK | Shift-JIS, EUC-JP, ISO-2022-JP, GBK, GB2312, Big5, EUC-KR |
🎯 Reading the Results
After clicking Detect Encoding, the results panel shows:
- Primary Encoding — the highest-confidence candidate displayed in a prominent card with encoding name, confidence badge, BOM status, and recommended action.
- Ranked Candidates — all plausible encodings sorted by confidence. Each row has a colour-coded confidence bar (green ≥ 90%, yellow 60–89%, red < 60%).
- Byte Statistics — total byte count, unique byte values, null bytes, high bytes (0x80–0xFF), and ASCII bytes for a quick fingerprint of the data.
- Multi-Encoding Preview — the input decoded using the top 4 candidate encodings so you can visually confirm which rendering looks correct.
💡 Common Use Cases
- Diagnosing mojibake in exported CSV or database dumps where the wrong encoding was assumed during import.
- Verifying that a file is genuine UTF-8 before loading it into a UTF-8–only parser or database.
- Inspecting BOM prefixes in Windows-generated text files that cause issues in Unix environments (
EF BB BFat the start of a file is often invisible but causes parse errors). - Helping data engineers migrate legacy CJK content from Shift-JIS or GBK to UTF-8 by confirming the source encoding before conversion.