Unicode

Created on: September 24, 2023Modified on: September 30, 2023

It's a much more comprehensive character encoding standard that aims to represent every character from every written system in use today.
Character is not the right terminology. It's better to use the word grapheme, which means a single unit of a human writing system.
It assigns a unique number, called a code point, to "things" regardless of platform, program, or language.
- "Things" because "ã" is two code points. The letter and the tilda.
- ==grapheme != code point != a byte==
  - Making this mistake have a price.
Unicode uses a variety of encoding schemes to represent these code points, one of which is UTF-8.
Some examples of grapheme not available on ASCII.
- Chinese: 你好世界
- Arabic: مرحبا بالعالم
- Mongolian: Сайн уу Дэлхий (Cyrillic)
- Hindi: हैलो वर्ल्ड (Devanagari)
- Emojis: 🤓
Functions that don't understand unicode are called unicode unaware string functions.
- len(🤓) is 4 but the answer we want is probably 1.
- If unicode is important in the language is best to use functions that can handle them.
- 👍🏽 is a good test, it uses a regular thumbs up plus a skin modifier. It's one graphene, two code points, and 4 bytes.

Encoding

The process of turning the code point to binary.