# Unicode and You! ## An Introduction to **Unicode** in **Python** Eddie Antonio Santos
Special thanks to Jessica Malik!
# What even *is* Unicode?

Isn't Unicode all those çɦåяäcṱǝrŝ that aren't in ASCII?

No! It's all those characters including ASCII.

Isn't Unicode a character encoding?

No! It supports many character encodings!

Isn't Unicode that emoji 💩 company?

Sort of! They are a consortium that standardizes text, characters, and emoji!

Unicode is frustrating

Sometimes—But knowing is half the battle!

What is Unicode?

  • Standard for representing text in computers
  • Maps a code point to every character. Ever.
  • Database of properties for each character
# How do I use Unicode in Python?

How to use Unicode in Python 3

```python "Hello, World!" ```
```python "Dzień dobry!" ```
```python "ᑖᓂᓯ" ```
```python "Hello, 🌎!" ```

How to use Unicode in Python 2*

```python u"Hello, World!" ```
```python u"Dzień dobry!" ```
```python u"ᑖᓂᓯ" ```
```python u"Hello, 🌎!" ```
*you must specify the source file's coding at the top of the file
## In Python 2: `"Dzień dobry!" != u"Dzień dobry!"`

Recommendation

Use Python 3

Unicode characters

What is a character?

letter, digit, punctuation, symbol

space, formatting, control character

How are characters represented in Unicode?

A number called a Code point

  • A = U+0041
  • Ω = U+03A9
  • = U+8A9E
  • 𐎄 = U+10384

1,114,112 total; 137,374 (12.33%) used
(As of version 11.0)

How do I get code points in Python?

ord()

What other properties do characters have?

name: unicodedata.name()

general category: unicodedata.category()

Typing Unicode characters in Python

Directly!

By hex code: "\uXXXX" or "\U000XXXXX"

By name: "\N{NAME}"

Unicode outside of Python

Outside world

Character Encoding

code points ⬌ bytes

character != byte

Character encodings

Explosion of character encodings!

US-ASCII

Latin-1 == ISO 8859-1 ⊆ Windows Code Page 1252

ISO 8859-2

Windows Code Page 1251

Macintosh Western == MacRoman

Shift-JIS (several)

EBCDIC (several)

...

GB 2312

Big5

A letter written by Madame Marie Curie

Marie Curie

Recommendation

Use UTF-8 character encoding

UTF-8 Supports ALL Unicode characters

Backwards compatible with ASCII

Recommendation

Always explicitly specify the character encoding

```python # -*- coding: UTF-8 -*- ```
```python open("filename", "w", encoding="UTF-8") ```
```python socket.write("¿Qué haremos mañana?".encode("UTF-8")) ```
```html <meta charset="UTF-8"> ```
```http Content-Type: application/json; charset=utf-8 ```

Recap

  • Unicode characters are code points (numbers)
  • Character encodings convert Unicode and bytes
  • Recommendation: Use Python 3
  • Recommendation: Use UTF-8
  • Recommendation: Explicitly specify character encoding

ASK ME QUESTIONS

About Unicode, Python, and the intersection thereof!
# Extra links! [Unicode! And how ES6 can help!](http://www.eddieantonio.ca/unicode-es6/)