Comparing strings - What can go wrong #unicode
Published 10/19/2021
Experiment
For this experiment, please have a mac ready and open this sandbox: https://codesandbox.io/s/string-comparison-unicode-bl9q7.
Create a file with the same name as the variable NAME_FILE_LIKE_THIS
(JalapeƱo.txt) and upload it to the sandbox. The onChange
event gets triggered and the uploaded file name is logged to the console and compared with the variable.
Now, you would assume they match. It clearly logs "JalapeƱo.txt". And on Windows, they do match. But on Macs, they don't...
Why? To first understand what's happening, spread out the variable "name" like this in the onChange event: console.log(...name)
.
The result is: J a l a p e n Ģ o . t x t
. The Ʊ got split into two characters!
You can observe similar behavior with Japanese words like "ććć", or any word that contains diacritics.
What's happening?
There are two ways to represent unicode characters. Precomposed (Ʊ), which is the default when you type, and decomposed (n + diacritic). When you upload a file on Macs, it turns the filename into the decomposed version.
Check out my e-book!
What's the fix?
You can turn a string into both its precomposed as well as decomposed representation using string.normalize.
const decomposed = [...'JalapeƱo'.normalize('NFD')]
// (9)Ā ['J', 'a', 'l', 'a', 'p', 'e', 'n', 'Ģ', 'o']
const precomposedAgain = [...decomposed.join('').normalize()]
// (8)Ā ['J', 'a', 'l', 'a', 'p', 'e', 'Ʊ', 'o']