🔡 Unicode Character Lookup – Explore Every Code Point
The Unicode Character Lookup tool lets you explore any of the 150,000+ characters in the Unicode standard. Paste a character, enter a code point, or analyse an entire string to retrieve its official Unicode metadata — name, category, block, plane, and a full suite of encoding representations — all computed instantly in your browser.
Three Lookup Modes
Character Mode
Paste or type any single character into the input field. The tool extracts its code point and displays everything you need to know: the official Unicode name (e.g., EURO SIGN), the general category (e.g., Sc – Currency Symbol), the Unicode block (e.g., Currency Symbols), the plane, and eight different encoding formats from UTF-8 bytes through CSS and JavaScript escape sequences.
Code Point Mode
Enter a code point in any of these formats and the tool decodes it into the corresponding character and full metadata:
U+20AC— standard U+ notation0x20AC— C-style hex prefix8364— decimal integer20AC— plain 4–6 digit hex
Valid code points range from U+0000 to U+10FFFF. Surrogate code points (U+D800–U+DFFF) are detected and flagged as invalid for standalone use.
String Analysis Mode
Enter a multi-character string (up to 200 code points) and the tool generates a per-character breakdown table. Each row shows the glyph, code point badge, Unicode name, category, and UTF-8 bytes. A summary footer reports the total number of code points, the UTF-8 byte length, whether the string contains surrogate pairs, and whether it includes right-to-left characters from Arabic, Hebrew, or Syriac scripts.
Encoding Formats Explained
| Format | Example (€ = U+20AC) | Use case |
|---|---|---|
| UTF-8 Bytes | E2 82 AC | File storage, network transmission, databases |
| UTF-16 Code Unit(s) | 0x20AC | JavaScript, Java, Windows APIs |
| UTF-32 | 0x000020AC | Internal processing, simple indexing |
| HTML Entity | € / € | HTML source code, XML documents |
| URL Encoded | %E2%82%AC | Query strings, URI components |
| CSS Escape | \20AC | CSS content property, selector escaping |
| JS / Python Escape | \u20AC | String literals in JavaScript, Python, Java |
Understanding Unicode Structure
Unicode organises its 1,114,112 code points into 17 planes, each containing up to 65,536 code points. Plane 0 (the Basic Multilingual Plane or BMP, U+0000–U+FFFF) covers almost every character in modern use — Latin alphabets, CJK ideographs, Greek, Cyrillic, Arabic, Hebrew, and the bulk of symbols and punctuation. Plane 1 (the Supplementary Multilingual Plane) holds emoji, historic scripts, and musical notation. Planes 2–3 extend CJK coverage with rare ideographs.
Within each plane, characters are grouped into named blocks (e.g., Currency Symbols, Emoticons, Hiragana). The tool identifies the block for every code point so you can immediately understand the script or symbol category.
Each character also carries a general category property — a two-letter code such as Lu (Uppercase Letter), Nd (Decimal Digit), Sc (Currency Symbol), or So (Other Symbol). These categories drive text-processing algorithms, regular expression character classes, and locale-sensitive sorting.
UTF-8 and Surrogate Pairs
UTF-8 encodes BMP characters (U+0000–U+FFFF) in 1–3 bytes and supplementary characters (U+10000–U+10FFFF) in 4 bytes, using a variable-length scheme designed for ASCII compatibility. UTF-16, used internally by JavaScript and Java, represents supplementary characters as surrogate pairs — two consecutive 16-bit code units in the ranges U+D800–U+DBFF (high surrogate) and U+DC00–U+DFFF (low surrogate). The tool detects when a supplementary character requires a surrogate pair and shows both code units.
Common Use Cases
- Debugging encoding issues — find why a character displays as a replacement symbol or garbled text by checking its UTF-8 byte sequence.
- Web development — copy the HTML entity or CSS escape ready to paste into source code without worrying about character encoding in the file.
- Internationalisation (i18n) — verify that a string contains the expected code points and does not accidentally include lookalike characters from a different script.
- Security research — identify homoglyph characters (e.g., Cyrillic
аU+0430 vs. LatinaU+0061) that could be used in phishing domain names or IDN homograph attacks. - Learning Unicode — explore how emoji, mathematical symbols, or rare scripts are encoded and what their official Unicode names are.