EthiCS: Unicode

Character representation

A brief history of coding systems, considering two sets of values in tension:
- Representation and equality
- Efficiency and unambiguity

Coding systems

A coding system represents human language in form other than speech
- Standardization!
Any written language is a coding system
- Alphabetic (a mark represents a part of a sound; Latin, Cyrillic, Coptic, Korean)
- Syllabic (a mark represents a sound combination; some Japanese scripts)
- Logographic (a mark represents a word; Chinese Hanzi)
Representations can be translated into new forms!

Chappe semaphore, 1790s

Letters → arrangements of mechanical arms visible from far away

Braille alphabet, 1820s

Letters and letter combinations → patterns of dots perceptible to touch

Japanese Braille, 1880s

Braille’s symbols repurposed to a completely different syllabary
Issues of ambiguity and misinterpretation
- Meaning of Braille symbols depends on context
- How to unambiguously interpret a document combining French and Japanese Braille symbols?

Efficiency vs. representation

Efficiency and parsimony are essential goals for computer systems
- Represent data in the smallest space
- Less expensive, more capacity
- Take less time to transmit
Efficiency and parsimony are also essential for humans
- Human limitations in distinguishing marks
- Braille formerly had more symbols, distinguished by the use of dashes in some positions as well as dots (a ternary-like system), but “by the second edition in 1837 [Louis Braille] had discarded the dashes because they were too difficult to read.” ref
Efficiency and parsimony can conflict with representation!

Telegraphy, 1930s

Telegraphy uses a binary coding system (dot + dash)
2⁵ = 32 distinct patterns
Complex “shift” system multiplexes some patterns
- 11111 01110 11011 01110 = “C:”
- So the 01110 pattern means different things depending on context
- Vulnerable to error: a misinterpreted symbol can change meanings of all future symbols

BCDIC, 1930s

Derived from punched-card codes originally developed for the US Census in the late 1800s

ASCII, 1960s

American Standard Code for Information Interchange
Foundation for many national standards
Controversial at the time

MORE THAN 64 CHARACTERS!

America and the world

ISO (International Standards Organization) adopted ASCII as ISO/IEC 646, with a caveat
The characters [ \ ] { | } (and, to a lesser extent, ^ ~ # $ @ ') were “reserved for national use”
Different governments used those slots to represent critical characters in their languages
- Different nations speaking the same language made different choices!

	à	â	ç	É	é	ê	è	î	ô	ù	û	£	°	§	¨
French (ISO-IR-025)	`@`	N/A	`\`	N/A	`{`	N/A	`}`	N/A	N/A	`\|`	N/A	`#`	`[`	`]`	`~`
Canadian French #1 (ISO-IR-121)	`@`	`[`	`\`	N/A	`{`	`]`	`}`	`^`	`	`\|`	`~`	N/A	N/A	N/A	N/A
Canadian French #2 (ISO-IR-122)	`@`	`[`	`\`	`^`	`{`	`]`	`}`	N/A	`	`\|`	`~`	N/A	N/A	N/A	N/A

The meaning of an encoded text depends on national context!
- Does ^le mean that, or Éle (Canadian #2), or île (Canadian #1)?
- Humans can sort of adapt, but it’s painful

C

{ a[i] = '\n'; }

C?

ä aÄiÜ = 'Ön'; ü

The Swedish national character set uses the {[]\} code points for Swedish letters, äÄÜÖü
A Swedish programmer would choose how to configure their computer, and either write C in a crazy style using letters, or write Swedish in a crazy style using punctuation

…C?

??< a??(i??) = '??/n'; ??>

The C standards body introduced a workaround that almost everyone hated

What about that bit 7?

ASCII used 7 bits; there’s another bit
ASCII designers suggested it be used for error correction
Parity bit (checksum)
- Bit 7 = Bit 6 ^ Bit 5 ^ Bit 4 ^ Bit 3 ^ Bit 2 ^ Bit 1 ^ Bit 0
- Can detect any single-bit-flip error (“hit”)
Telephone equipment gets better, errors less frequent, storage cheaper—parity bit is less important

MORE THAN 128 CHARACTERS!

Use slots 128-255 for accented characters and additional symbols

ISO 8859

Can represent most texts for many Western languages
The 7-bit subset agrees with ASCII
- Everyone can write C as intended!
But not all Western languages are supported
- Different versions re-encode the upper 128 characters for other languages or scripts (Greek, Cyrillic)
- ISO 8859-1 becomes the most common and the default, but the choice of languages it represents seems odd to us

Ísland sigurinn

Icelandic: supported by ISO 8859-1
- Ỳ ỳ Þ þ Ð ð
- ~360,000 speakers
Turkish: not supported by ISO 8859-1
- ı I (undotted I), i İ (dotted İ), ş Ş ğ Ğ
- ~88,000,000 speakers!
ISO 8859-9 replaces the Icelandic characters with the Turkish ones
- Still have ambiguity!
- Still unclear how to represent a text with both Icelandic and Turkish
  - Metadata gives a different encoding for each byte range?
  - Additional shift characters, as in telegraphy, change from language to language?
  - No good choices

The panda in the room

Hanzi
- Chinese logographs
More than 50,000 in use!
Hard to cram that into 128 bit patterns

Unicode: the dream

Represent all the world’s languages in a single system
Every character has one unambiguous encoding
- No ambiguity in interpretation
- Any text can contain fragments of any language without external metadata or internal shift patterns

Unicode 1

65,536 slots!
Fixed-width two-byte encoding
Enabled by “Han unification”
- Mapping of characters in different Hanzi national character sets into a set without duplicates
- Based on work by librarians and others: Taiwan’s Chinese Character Code for Information Interchange (CCCII); the Research Libraries Information Network’s East Asian Character Code (EACC); etc.
- Continued by a Unicode-convened Joint Research Group, working with experts from China, Japan, Korea (and now Vietnam, Taiwan) (reference)
Representation problems
- Some scripts not included (Khmer, Mongolian, Cherokee)
- Some scripts excluded (historic scripts)
- Han unification successfully encoded most characters commonly used by people overall, but not some characters commonly used by particular people—such as characters for writing surnames!

Unicode 2

1,114,112 slots!
- More room for unencoded characters and languages
But fixed-width encoding seems really unwise
- Too big for 2-byte encoding
- 4-byte encoding supremely wasteful (though some programming languages do use it internally…)
- 3-byte encoding? Not a power of two??
Non-fixed-width encoding
- Mapping and interpretation difficulties
- Waste

UTF-16

Unicode Transformation Format, 16 bits
Goal: Represent every character as a fixed width unit!
Problem: Characters aren’t fixed width any more
Problem: Representational ambiguity (byte order)
Problem: Economic harm!
- Much of the text in the world can use a shorter encoding
- 16-bit encoding expands English texts by 2x

A more perfect encoding

There can be no perfect encoding for all situations!
But there may be an encoding that offers less stark tradeoffs
- Better interoperability and equality of representation than ASCII
- Better efficiency and unambiguity than UTF-16
- Better overall?

UTF-8

Variable-width, byte-based encoding for characters
- Contrast UTF-16: variable-width, 16-bit-based encoding
Every ASCII character represents itself, unambiguously
- Every occurrence of the byte value 65 represents the character A
Non-ASCII characters are represented by short byte sequences of values ≥128
- Bytes from 0xC2…0xF4 start a multi-byte sequence; bytes from 0x80…0xBF continue the sequence
- Example: 0xC3 0xA5 means å
Sketched on a placemat in 1992 by Ken Thompson, one of the Unix inventors (citation)

Advantages of UTF-8

Compatible with existing software and libraries
- Every ASCII file is also a UTF-8 file with the same meaning
- UTF-8 does not use the 0 byte, so existing C library functions continue to work on UTF-8 strings! (UTF-16 texts contain tons of zero bytes)
Resistant to errors
- No synchronization issues: an error in one byte affects at most one character
- Contrast the telegraphy system, where an error in a “shift byte” can affect the interpretation of all future characters
Relatively efficient
- Unicode code points U+0080–U+07FF can be represented in two bytes
- So texts in European Latin script, African Latin script, Greek script, Cyrillic script, and most Arabic-script languages take no more space than in UTF-16
- But other scripts may take more space than UTF-16: Brahmic (Indic), Han, Japanese, Korean

How can we tell UTF-8 is good?

Now over 95% of web pages!

Emoji

If you’re interested in digging deeper into encoding issues, consider emoji
Intersection between technical and social issues
Many of the same high-level issues encountered in encoding written language recur
- Similar solutions recur too

Emoji history

People have built tiny pictures from punctuation for years, including artists and poets but also scientists and engineers
- September 1982: A message thread at CMU spins out of control. Tone is hard to communicate electronically!
- “Maybe we should adopt a convention of putting a star (*) in the subject field of any notice which is to be taken as a joke.”
- “I propose the following character sequence for joke markers: :-) Read it sideways.” (citation)
Modern emoji were invented for cell phones in Japan by 1997
- Carriers competed on their emoji sets
- Initially not interoperable!

Emoji and Unicode

Emoji became so popular that Unicode encoding was inevitable
Enormous technical challenges
- First characters where color is important—font standards require changes
Social challenges: Which emoji deserve representation?
- Unicode code points are a finite resource
- The powerful start lobbying

Will a fixed emoji set suffice?

Issues of unequal representation
Emoji sets from Japan used a light skin tone for many characters

How to represent varieties of skin tone? Some designers went for non-representational tones (gray, bright yellow), but that didn’t suffice
Make 5x as many emoji? Why stop at 5x?

Emoji and culture

Is character encoding normative?
- Images that seem inoffensive in one culture and time aren’t appropriate in others
- People interpret the default emoji set as a statement by society, or by the computer industry, of what is right or most normal
- Consider some characters added to Unicode 6.0 in 2010, adopted from Japanese mobile phone emoji sets:
  
  U+1F48F KISS U+1F46F WOMAN WITH BUNNY EARS

Reducing undesirable signification

Emoji representations lose specificity

U+1F48F in 2010 U+1F48F in 2020
Emoji representations change appearance more fundamentally

U+1F46F in 2010 U+1F46F in 2020
- The standards now say the emoji is “most popularly depicted as two women dancing”; some redefine it to be gender neutral, as “people with bunny ears” or “party”

Representing more cultures

Erasing cultural differences is not the best way to achieve equality
- People want to be represented!
Impossible or inefficient to represent all important representations with individual code points
Try variable-length encoding?

Example: Kiss

U+1F48F KISS 💏 now represents gender-nonspecific people
To represent more specific people kissing, use a Unicode combiner, U+200D ZERO WIDTH JOINER, invented to represent scripts such as Arabic and Indic where sometimes characters are visually connected
“Kiss: Woman, Man” 👩‍❤️‍💋‍👨 is represented as:
- U+1F469 WOMAN 👩
- U+200D ZERO WIDTH JOINER
- RED HEART ❤️, which is represented as…
  - U+2764 HEAVY BLACK HEART ❤︎
  - U+F30F EMOJI VARIATION SELECTOR
- U+200D ZERO WIDTH JOINER
- U+1F48B KISS MARK 💋
- U+200D ZERO WIDTH JOINER
- U+1F468 MAN 👨
- That might seem crazy, but look at the HTML source using hexdump -C to verify. It takes 31 bytes: f0 9f 91 a9 (WOMAN) e2 80 8d (ZWJ) e2 9d a4 (HEAVY BLACK HEART) ef b8 8f (EMOJI VARIATION SELECTOR) e2 80 8d (ZWJ) f0 9f 92 8b (KISS MARK) e2 80 8d (ZWJ) f0 9f 91 a8 (MAN).
But this representation naturally generalizes to “Kiss: Man, Man” and “Kiss: Woman, Woman”! (Or “Kiss: Male Astronaut, Female Zombie”)
Other sequences can represent skin tone, hair color, other features

Presented without comment

From The Unicode Emoji technical standard:

Character representation

Coding systems

Chappe semaphore, 1790s

Braille alphabet, 1820s

Japanese Braille, 1880s

Efficiency vs. representation

Telegraphy, 1930s

BCDIC, 1930s

ASCII, 1960s

MORE THAN 64 CHARACTERS!

America and the world

C

C?

…C?

What about that bit 7?

MORE THAN 128 CHARACTERS!

ISO 8859

Ísland sigurinn

The panda in the room

Unicode: the dream

Unicode 1

Intense technical and social arguments

Unicode 2

UTF-16

A more perfect encoding

UTF-8

Advantages of UTF-8

How can we tell UTF-8 is good?

Emoji

Emoji history

Emoji and Unicode

Will a fixed emoji set suffice?

Emoji and culture

Reducing undesirable signification

Representing more cultures

Example: Kiss

Presented without comment