Unicode
- It's a much more comprehensive character encoding standard that aims to represent every character from every written system in use today.
- Character is not the right terminology. It's better to use the word grapheme, which means a single unit of a human writing system.
- It assigns a unique number, called a code point, to "things" regardless of platform, program, or language.
- "Things" because "ã" is two code points. The letter and the tilda.
- ==grapheme != code point != a byte==
- Making this mistake have a price.
- Unicode uses a variety of encoding schemes to represent these code points, one of which is UTF-8.
- Some examples of grapheme not available on ASCII.
- Chinese: 你好世界
- Arabic: مرحبا بالعالم
- Mongolian: Сайн уу Дэлхий (Cyrillic)
- Hindi: हैलो वर्ल्ड (Devanagari)
- Emojis: 🤓
- Functions that don't understand unicode are called unicode unaware string functions.
len(🤓)
is 4 but the answer we want is probably 1.- If unicode is important in the language is best to use functions that can handle them.
- 👍🏽 is a good test, it uses a regular thumbs up plus a skin modifier. It's one graphene, two code points, and 4 bytes.
Encoding
The process of turning the code point to binary.