Unicode!

(And how ES6 can help)

By Eddie Antonio Santos / @_eddieantonio

(Hypothetical Scenario)

(Thanks, Kim!)

(demo)

What does `String.length` actually measure?

Let's consult the standard!

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”)

The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.

The length of a String is the number of elements (i.e., 16-bit values) within it.

ECMAScript 6.0 Standard §6.1.4

What the 👿 is a UTF-16 code unit value‽

Unicode!

What is Unicode?

A mapping of numbers (code points) to every character. Ever.
Database of properties for each character (e.g., name, general category).

Code Points

Unique number given to each character

They look like this:

U+hhhh or U+hhhhhh

Range from U+0000 to U+10FFFF

1,114,112 code points available in total

(See Unicode Chapter 3)

A tour of the Unicode character space!

Code points are divided into 17 planes

The Basic Multilingual Plane

(Plane 0)

Characters from practically all widely-used modern-day scripts
Code points are notated as U+hhhh
Code points range from U+0000 to U+FFFF

All Unicode Code Points

The Astral Planes

Everything else (Planes 1-16)
Characters from ancient scripts, alternative scripts, pictograms, and rare and archaic CJK(V) ideograms (Chinese-style characters). Also, (most) Emoji.
Two entire planes devoted to private use characters
Code points are notated as U+hhhhhh
Code points range from U+010000 to U+10FFFF

What is not Unicode?

~~a character encoding~~ It's several character encodings!
Code points ≠ Bytes

Code Unit

Smallest unit of storage required to store or transmit a single character in an encoding scheme

Ways of transmitting code points

UTF-8
UTF-16
UTF-32/UCS-4

UTF-16 needs two code units to represent one astral code point

Back to our problem...

We want to count code points and not code units

Enter:

`String.prototype[@@iterator]`

(String.prototype.codePointAt() exists too)

When the @@iterator method is called it returns an Iterator object (25.1.1.2) that iterates over the code points of a String value, returning each code point as a String value.

Compare


let a = [];
for (let c of s) {
  a.push(c);
}

vs.


var i, a = [],
for (i = 0; i < s.length; i++) {
  a.push(s[i]);
}

Let's fix our code!

Trick: Use Array#from

(it just does this:)


function (s) {
  let a = [];
  for (let c of s) {
    a.push(c);
  }
  return a;
}

Change of plans

(Thanks, Kim!)

(demo)

Three different ways of writing 🍲

phở = o + ◌̛ + ◌̉
phở = ơ + ◌̉
phở = ở

There are multiple ways of representing the same abstract character sequence

Normalization forms!

Useful for comparing different representations of the same abstract character sequence

NFD Canonical decomposition
NFC Canonical decomposition, followed by Canonical Composition
NFKD Compatibility Decomposition
NFKC Compatibility Decomposition, followed by Canonical Composition

(See UAX #15)

Canonical (De)composition

NFD: ở ⇒ o + ◌̛ + ◌̉
NFC: ở ⇒ ở

Enter:

`String.prototype.normalize()`

When in doubt, use

NFC

Let's fix our app!

Compatibility

Check the compatibility table!

Unicode!

(And how ES6 can help)

(Hypothetical Scenario)

(demo)

What does `String.length` actually measure?

What the 👿 is a UTF-16 code unit value‽

Unicode!

What is Unicode?

Code Points

A tour of the Unicode character space!

The Basic Multilingual Plane

The Basic Multilingual Plane

(Plane 0)

All Unicode Code Points

The Astral Planes

What is not Unicode?

Code Unit

Ways of transmitting code points

UTF-16 needs two code units to represent one astral code point

Back to our problem...

Enter:

`String.prototype[@@iterator]`

Compare

vs.

Let's fix our code!

Change of plans

(demo)

Three different ways of writing 🍲

Normalization forms!

Canonical (De)composition

Enter:

`String.prototype.normalize()`

NFC

Let's fix our app!

Compatibility

ASK ME QUESTIONS

Resources

Standards

Unicode

Other

Unicode!

(And how ES6 can help)

(Hypothetical Scenario)

(demo)

What does String.length actually measure?

What the 👿 is a UTF-16 code unit value‽

Unicode!

What is Unicode?

Code Points

A tour of the Unicode character space!

The Basic Multilingual Plane

The Basic Multilingual Plane

(Plane 0)

All Unicode Code Points

The Astral Planes

What is not Unicode?

Code Unit

Ways of transmitting code points

UTF-16 needs two code units to represent one astral code point

Back to our problem...

Enter:

String.prototype[@@iterator]

Compare

vs.

Let's fix our code!

Change of plans

(demo)

Three different ways of writing 🍲

Normalization forms!

Canonical (De)composition

Enter:

String.prototype.normalize()

NFC

Let's fix our app!

Compatibility

ASK ME QUESTIONS

Resources

Standards

Unicode

Other

What does `String.length` actually measure?

`String.prototype[@@iterator]`

`String.prototype.normalize()`