Unicode composition and decomposition
Unicode has a class of characters known as “canonical compositions”1
which use a single code point to represent what would otherwise bed
a regular character and a combining character.
For example, “é” (an e with acute accent)
may either be represented as
U+0065 LATIN SMALL LETTER E followed by U+00B4 ACUTE ACCENT;
or as U+00E9 LATIN SMALL LETTER E WITH ACUTE
Normalisation
This is a potential source of issues for string comparison, since while the two examples listed above are (canonically) the same, they will be detected as different. As a result, to compare strings, it is advisable to normalise them beforehand, either in the direction of decomposition or of composition
JavaScript
In JavaScript, the built-in string API provides String.prototype.normalize() for this very purpose.
Importantly, it is possible to canonically decompose strings, or alternatively to decompose and then compose. Both of these normalise the string.
- Use the form argument
"NFD"to decompose. - Use the form argument
"NFC"to decompose and then compose (this is the default behaviour if no argument is provided).
import { fromCharCodes} from "https://deno.land/x/iter@v2.5.0/lib/generators.ts";
const dec = "é".normalize("NFD");
const com = "é".normalize("NFC");
const codes = [...fromCharCodes("ā́".normalize("NFD"))]
.map(n => n.toString(16).padStart(4, '0'))
.map(s => `U+${s}`)
.join(", ")
console.log(codes); // "U+0061, U+0304, U+0301"For other options, see the MDN documentation listed above.