Unicode

Character representation

A brief history of coding systems
Representation
Efficiency
Unicode
A theme: Good systems design can achieve excellent tradeoffs between efficiency and representation!

Coding systems

A coding system represents human language in form other than speech
- Standardization!
Any written language is a coding system
- Alphabetic (a mark represents a part of a sound; Latin, Cyrillic, Coptic, Korean)
- Syllabic (a mark represents a sound combination; some Japanese scripts)
- Logographic (a mark represents a word; Chinese Hanzi)
Representations can be translated into new forms!

Chappe semaphore, 1790s

Letters → arrangements of mechanical arms visible from far away

Braille alphabet, 1820s

Letters and letter combinations → patterns of dots perceptible to touch

Japanese Braille, 1880s

Braille’s symbols repurposed to a completely different syllabary
Issues of ambiguity and misinterpretation
- Meaning of Braille symbols depends on context
- How to unambiguously interpret a document combining French and Japanese Braille symbols?

Efficiency vs. representation

Efficiency and parsimony are essential goals for computer systems
- Represent data in the smallest space
- Less expensive, more capacity
- Take less time to transmit
Efficiency and parsimony are also essential for humans
- Human limitations in distinguishing marks
- Braille formerly had more symbols, distinguished by the use of dashes in some positions as well as dots (a ternary-like system), but “by the second edition in 1837 [Louis Braille] had discarded the dashes because they were too difficult to read.” ref
Efficiency and parsimony can conflict with representation!

Telegraphy, 1930s

Telegraphy uses a binary coding system (dot + dash)
2⁵ = 32 distinct patterns
Complex “shift” system multiplexes some patterns
- 11111 01110 11011 01110 = “C:”
- So the 01110 pattern means different things depending on context
- Vulnerable to error: a misinterpreted symbol can change meanings of all future symbols

BCDIC, 1930s

Derived from punched-card codes originally developed for the US Census in the late 1800s

ASCII, 1960s

American Standard Code for Information Interchange
Foundation for many national standards
Controversial at the time

MORE THAN 64 CHARACTERS!

America and the world

ISO (International Standards Organization) adopted ASCII as ISO/IEC 646, with a caveat
The characters [ \ ] { | } (and, to a lesser extent, ^ ~ # $ @ ') were “reserved for national use”
Different governments used those slots to represent critical characters in their languages
- Different nations speaking the same language made different choices!

	à	â	ç	É	é	ê	è	î	ô	ù	û	£	°	§	¨
French (ISO-IR-025)	`@`	N/A	`\`	N/A	`{`	N/A	`}`	N/A	N/A	`\|`	N/A	`#`	`[`	`]`	`~`
Canadian French #1 (ISO-IR-121)	`@`	`[`	`\`	N/A	`{`	`]`	`}`	`^`	`	`\|`	`~`	N/A	N/A	N/A	N/A
Canadian French #2 (ISO-IR-122)	`@`	`[`	`\`	`^`	`{`	`]`	`}`	N/A	`	`\|`	`~`	N/A	N/A	N/A	N/A

The meaning of an encoded text depends on national context!
- Does ^le mean that, or Éle (Canadian #2), or île (Canadian #1)?
- Humans can sort of adapt, but it’s painful

C

{ a[i] = '\n'; }

C?

ä aÄiÜ = 'Ön'; ü

…C?

??< a??(i??) = '??/n'; ??>

What about that 7th bit?

Error correction!
Parity bit (checksum)
- Bit 7 = Bit 6 ^ Bit 5 ^ Bit 4 ^ Bit 3 ^ Bit 2 ^ Bit 1 ^ Bit 0
- Can detect any single-bit-flip error (“hit”)

MORE THAN 128 CHARACTERS!

Telephone equipment gets better, errors less frequent, storage cheaper
Use slots 128-255 for accented characters and additional symbols

ISO 8859

Can represent most texts for many Western languages
The 7-bit subset agrees with ASCII
- Everyone can write C as intended!
But not all Western languages are supported
- Different versions re-encode the upper 128 characters for other languages or scripts (Greek, Cyrillic)
- ISO 8859-1 becomes the most common and the default, but the choice of languages it represents seems odd to us

Ísland sigurinn

Icelandic: supported by ISO 8859-1
- Ỳ ỳ Þ þ Ð ð
- ~360,000 speakers
Turkish: not supported by ISO 8859-1
- ı I (undotted I), i İ (dotted İ), ş Ş ğ Ğ
- ~88,000,000 speakers!
ISO 8859-9 replaces the Icelandic characters with the Turkish ones
- Still have ambiguity!
- Still unclear how to represent a text with both Icelandic and Turkish
  - Metadata gives a different encoding for each byte range?
  - Additional shift characters, as in telegraphy, change from language to language?
  - No good choices

The panda in the room

Hanzi
- Chinese logographs
More than 50,000 in use!
Hard to cram that into 128 bit patterns

Unicode: the dream

Represent all the world’s languages in a single system
Every character has one unambiguous encoding
- No ambiguity in interpretation
- Any text can contain fragments of any language without external metadata or internal shift patterns

Unicode 1

65,536 slots!
Fixed-width two-byte encoding
Enabled by “Han unification”
- Mapping of characters in different Hanzi national character sets into a set without duplicates
- Based on work by librarians and others: Taiwan’s Chinese Character Code for Information Interchange (CCCII); the Research Libraries Information Network’s East Asian Character Code (EACC); etc.
- Continued by a Unicode-convened Joint Research Group, working with experts from China, Japan, Korea (and now Vietnam, Taiwan) (reference)
Representation problems
- Some scripts not included (Khmer, Mongolian, Cherokee)
- Some scripts excluded (historic scripts)
- Han unification successfully encoded most characters commonly used by people overall, but not some characters commonly used by particular people—such as characters for writing surnames!

Unicode 2

1,114,112 slots!
Non-fixed-width encoding
- Mapping and interpretation difficulties
- Waste

UTF-16

Unicode Transformation Format, 16 bits
Goal: Represent every character as a fixed width unit!
Problem: Characters aren’t fixed width any more
Problem: Economic harm!
- A large fraction of the world’s text can use a shorter encoding
- 16-bit encoding expands English texts by 2x

UTF-8

Variable-width encoding for characters

Character representation

Coding systems

Chappe semaphore, 1790s

Braille alphabet, 1820s

Japanese Braille, 1880s

Efficiency vs. representation

Telegraphy, 1930s

BCDIC, 1930s

ASCII, 1960s

MORE THAN 64 CHARACTERS!

America and the world

C

C?

…C?

What about that 7th bit?

MORE THAN 128 CHARACTERS!

ISO 8859

Ísland sigurinn

The panda in the room

Unicode: the dream

Unicode 1

Intense technical and social arguments

Unicode 2

UTF-16

UTF-8

Emoji