Unicode!

(And how ES6 can help)

By Eddie Antonio Santos / @_eddieantonio

(Hypothetical Scenario)

(Thanks, Kim!)

(demo)

What does String.length actually measure?

Let's consult the standard!

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”)

The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.

The length of a String is the number of elements (i.e., 16-bit values) within it.

ECMAScript 6.0 Standard §6.1.4

What the 👿 is a UTF-16 code unit value‽

Unicode!

What is Unicode?

  • A mapping of numbers (code points) to every character. Ever.
  • Database of properties for each character (e.g., name, general category).

Code Points

Unique number given to each character

They look like this:

U+hhhh or U+hhhhhh

Range from U+0000 to U+10FFFF

1,114,112 code points available in total

(See Unicode Chapter 3)

A tour of the Unicode character space!

Code points are divided into 17 planes

The Basic Multilingual Plane

Map of the Basic Multilingual Plane

The Basic Multilingual Plane

(Plane 0)

  • Characters from practically all widely-used modern-day scripts
  • Code points are notated as U+hhhh
  • Code points range from U+0000 to U+FFFF

All Unicode Code Points

Diagram of all 17 Unicode planes

The Astral Planes

  • Everything else (Planes 1-16)
  • Characters from ancient scripts, alternative scripts, pictograms, and rare and archaic CJK(V) ideograms (Chinese-style characters). Also, (most) Emoji.
  • Two entire planes devoted to private use characters
  • Code points are notated as U+hhhhhh
  • Code points range from U+010000 to U+10FFFF

What is not  Unicode?

  • a character encoding It's several character encodings!
  • Code points ≠ Bytes

Code Unit

Smallest unit of storage required to store or transmit a single character in an encoding scheme

Ways of transmitting code points

  • UTF-8
  • UTF-16
  • UTF-32/UCS-4

UTF-16 needs two code units to represent one astral code point

Back to our problem...

We want to count code points and not code units

Enter:

String.prototype[@@iterator]

(String.prototype.codePointAt() exists too)

When the @@iterator method is called it returns an Iterator object (25.1.1.2) that iterates over the code points of a String value, returning each code point as a String value.

Compare


let a = [];
for (let c of s) {
  a.push(c);
}
            

vs.


var i, a = [],
for (i = 0; i < s.length; i++) {
  a.push(s[i]);
}
            

Let's fix our code!

Trick: Use Array#from

(it just does this:)

function (s) {
  let a = [];
  for (let c of s) {
    a.push(c);
  }
  return a;
}
            

Change of plans

(Thanks, Kim!)

(demo)

Three different ways of writing 🍲

  • phở = o + ◌̛ + ◌̉
  • phở = ơ + ◌̉
  • phở =

There are multiple ways of representing the same abstract character sequence

Normalization forms!

Useful for comparing different representations of the same abstract character sequence

  • NFD Canonical decomposition
  • NFC Canonical decomposition, followed by Canonical Composition
  • NFKD Compatibility Decomposition
  • NFKC Compatibility Decomposition, followed by Canonical Composition

(See UAX #15)

Canonical (De)composition

  • NFD: ở ⇒ o + ◌̛ + ◌̉
  • NFC: ở ⇒

Enter:

String.prototype.normalize()

When in doubt, use

NFC

Let's fix our app!

Compatibility

Check the compatibility table!

ASK ME QUESTIONS

Resources

Standards

Unicode

Other