Logo

MonoCalc

/

Text Encoding Detector

Encode/Decode

Input Mode

Encoded as UTF-8 bytes for analysis

About This Tool

🔍 Text Encoding Detector – Identify Charset from Bytes

Character encoding is the mapping between raw bytes and the human-readable characters they represent. When the wrong encoding is assumed, text turns into garbled symbols — a phenomenon known as mojibake. The Text Encoding Detector analyzes raw bytes from pasted text, uploaded files, or hex sequences and returns a ranked list of the most likely encodings with per-candidate confidence scores.

📥 Three Input Modes

The tool supports three ways to provide input data, each suited to a different workflow:

  • Text — paste any string directly. The tool re-encodes it to UTF-8 bytes and runs full analysis. Ideal for confirming that a copied snippet is valid UTF-8 or pure ASCII.
  • File Upload — upload a .txt, .csv, .html, .xml, .json, or any text-based file up to 10 MB. Bytes are read locally via the FileReader API — nothing leaves your browser.
  • Hex Bytes — enter a raw hex byte string (e.g., EF BB BF 48 65 6C 6C 6F). Spaces, colons, and dashes are stripped automatically. Useful for debugging binary data or inspecting BOM prefixes.

🔬 How Detection Works

The algorithm runs three phases in sequence:

Phase 1 — BOM Sniffing

The first 2–4 bytes are compared against known Byte Order Marks. A BOM match yields 100% confidence with no further analysis needed.

Phase 2 – Multi-byte Sequence Validation

For encodings without a BOM (plain UTF-8, CJK charsets), the tool validates multi-byte sequences according to each encoding's rules and counts valid vs. invalid sequences to compute a confidence score.

Phase 3 – Byte-Frequency Analysis

For single-byte legacy charsets (ISO-8859, Windows-125x, KOI8), byte-frequency histograms are compared against known encoding fingerprints. Higher ratios of bytes in the characteristic range of a charset raise its confidence score.

📊 Supported Encoding Families

CategoryEncodings
UnicodeUTF-8, UTF-8 BOM, UTF-16 LE/BE, UTF-32 LE/BE, UTF-7
ASCIIUS-ASCII (7-bit pure)
Legacy WesternISO-8859-1, Windows-1252, ISO-8859-15, MacRoman
CyrillicWindows-1251, ISO-8859-5, KOI8-R, KOI8-U
Other LegacyISO-8859-2/7/8/9, Windows-1250/1253/1254/1255/1256
CJKShift-JIS, EUC-JP, ISO-2022-JP, GBK, GB2312, Big5, EUC-KR

🎯 Reading the Results

After clicking Detect Encoding, the results panel shows:

  • Primary Encoding — the highest-confidence candidate displayed in a prominent card with encoding name, confidence badge, BOM status, and recommended action.
  • Ranked Candidates — all plausible encodings sorted by confidence. Each row has a colour-coded confidence bar (green ≥ 90%, yellow 60–89%, red < 60%).
  • Byte Statistics — total byte count, unique byte values, null bytes, high bytes (0x80–0xFF), and ASCII bytes for a quick fingerprint of the data.
  • Multi-Encoding Preview — the input decoded using the top 4 candidate encodings so you can visually confirm which rendering looks correct.
Limitation with short inputs
Encoding detection accuracy increases with sample size. Short strings (fewer than ~50 bytes) may produce uncertain results because there are not enough byte patterns to distinguish similar encodings. When in doubt, upload the full file rather than a short excerpt.

💡 Common Use Cases

  • Diagnosing mojibake in exported CSV or database dumps where the wrong encoding was assumed during import.
  • Verifying that a file is genuine UTF-8 before loading it into a UTF-8–only parser or database.
  • Inspecting BOM prefixes in Windows-generated text files that cause issues in Unix environments (EF BB BF at the start of a file is often invisible but causes parse errors).
  • Helping data engineers migrate legacy CJK content from Shift-JIS or GBK to UTF-8 by confirming the source encoding before conversion.

Frequently Asked Questions

Is the Text Encoding Detector free?

Yes, Text Encoding Detector is totally free :)

Can I use the Text Encoding Detector offline?

Yes, you can install the webapp as PWA.

Is it safe to use Text Encoding Detector?

Yes, any data related to Text Encoding Detector only stored in your browser (if storage required). You can simply clear browser cache to clear all the stored data. We do not store any data on server.

How does the Text Encoding Detector work?

The tool analyzes the raw bytes of your input using three strategies: BOM (Byte Order Mark) sniffing for definitive Unicode identification, multi-byte sequence validation for UTF-8 and CJK encodings, and byte-frequency histogram analysis for legacy single-byte encodings. Each candidate receives a confidence score between 0 and 100%.

What is a BOM and why does it matter?

A Byte Order Mark (BOM) is a fixed byte sequence at the very start of a file that signals its Unicode encoding. For example, UTF-8 BOM starts with EF BB BF, UTF-16 LE with FF FE, and UTF-16 BE with FE FF. When a BOM is present, encoding detection is 100% certain — no heuristics are needed.

Can I detect encoding from pasted text?

Yes. When you paste text, the tool re-encodes it to UTF-8 bytes and then runs the detection algorithm on those bytes. This is useful for confirming that a string contains valid UTF-8 sequences and for checking if it could safely be interpreted as ASCII. For files with unknown legacy encodings, use the File Upload mode.

What encodings can the tool detect?

The tool covers the most common charset families: Unicode (UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, UTF-8 BOM), Western European (ISO-8859-1, Windows-1252), Cyrillic (Windows-1251, ISO-8859-5, KOI8-R), Greek, Turkish, Hebrew, Arabic, CJK Japanese (Shift-JIS, EUC-JP), Simplified Chinese (GBK, GB2312), Traditional Chinese (Big5), and Korean (EUC-KR).

Why do I see multiple candidate encodings with similar confidence?

Many single-byte legacy encodings (e.g., ISO-8859-1 vs Windows-1252, or KOI8-R vs Windows-1251) share large overlapping byte ranges. For short inputs or files with mostly ASCII content, the algorithm cannot always disambiguate between them. Use a language hint or provide a larger sample for more accurate results.

Is my data uploaded to any server?

No. All encoding detection runs entirely in your browser using JavaScript. No bytes from your text or files are ever sent to any external server. Files are read locally via the FileReader API.