This is not the current version of the class.

# Unicode

## Character representation

• A brief history of coding systems
• Representation
• Efficiency
• Unicode

• A theme: Good systems design can achieve excellent tradeoffs between efficiency and representation!

## Coding systems

• A coding system represents human language in form other than speech
• Standardization!
• Any written language is a coding system
• Alphabetic (a mark represents a part of a sound; Latin, Cyrillic, Coptic, Korean)
• Syllabic (a mark represents a sound combination; some Japanese scripts)
• Logographic (a mark represents a word; Chinese Hanzi)
• Representations can be translated into new forms!

## Chappe semaphore, 1790s

Reference

• Letters → arrangements of mechanical arms visible from far away

## Braille alphabet, 1820s

• Letters and letter combinations → patterns of dots perceptible to touch

## Japanese Braille, 1880s

• Braille’s symbols repurposed to a completely different syllabary
• Issues of ambiguity and misinterpretation
• Meaning of Braille symbols depends on context
• How to unambiguously interpret a document combining French and Japanese Braille symbols?

## Efficiency vs. representation

• Efficiency and parsimony are essential goals for computer systems
• Represent data in the smallest space
• Less expensive, more capacity
• Take less time to transmit
• Efficiency and parsimony are also essential for humans
• Human limitations in distinguishing marks
• Braille formerly had more symbols, distinguished by the use of dashes in some positions as well as dots (a ternary-like system), but “by the second edition in 1837 [Louis Braille] had discarded the dashes because they were too difficult to read.” ref
• Efficiency and parsimony can conflict with representation!

## Telegraphy, 1930s

Reference

• Telegraphy uses a binary coding system (dot + dash)
• 25 = 32 distinct patterns
• Complex “shift” system multiplexes some patterns
• 11111 01110 11011 01110 = “C:”
• So the 01110 pattern means different things depending on context
• Vulnerable to error: a misinterpreted symbol can change meanings of all future symbols

## BCDIC, 1930s

Reference

• Derived from punched-card codes originally developed for the US Census in the late 1800s

## ASCII, 1960s

Reference

• American Standard Code for Information Interchange
• Foundation for many national standards
• Controversial at the time

Reference

## America and the world

• ISO (International Standards Organization) adopted ASCII as ISO/IEC 646, with a caveat
• The characters [ \ ] { | } (and, to a lesser extent, ^ ~ # \$ @ ') were “reserved for national use”
• Different governments used those slots to represent critical characters in their languages
• Different nations speaking the same language made different choices!
à â ç É é ê è î ô ù û £ ° § ¨
French (ISO-IR-025) @ N/A \ N/A { N/A } N/A N/A | N/A # [ ] ~
Canadian French #1 (ISO-IR-121) @ [ \ N/A { ] } ^  | ~ N/A N/A N/A N/A
Canadian French #2 (ISO-IR-122) @ [ \ ^ { ] } N/A  | ~ N/A N/A N/A N/A
• The meaning of an encoded text depends on national context!
• Does ^le mean that, or Éle (Canadian #2), or île (Canadian #1)?
• Humans can sort of adapt, but it’s painful

## C

{ a[i] = '\n'; }

## C?

ä aÄiÜ = 'Ön'; ü

## …C?

??< a??(i??) = '??/n'; ??>

## What about that 7th bit?

• Error correction!
• Parity bit (checksum)
• Bit 7 = Bit 6 ^ Bit 5 ^ Bit 4 ^ Bit 3 ^ Bit 2 ^ Bit 1 ^ Bit 0
• Can detect any single-bit-flip error (“hit”)

## MORE THAN 128 CHARACTERS!

• Telephone equipment gets better, errors less frequent, storage cheaper
• Use slots 128-255 for accented characters and additional symbols

Reference

## ISO 8859

• Can represent most texts for many Western languages
• The 7-bit subset agrees with ASCII
• Everyone can write C as intended!
• But not all Western languages are supported
• Different versions re-encode the upper 128 characters for other languages or scripts (Greek, Cyrillic)
• ISO 8859-1 becomes the most common and the default, but the choice of languages it represents seems odd to us

## Ísland sigurinn

• Icelandic: supported by ISO 8859-1
• Ỳ ỳ Þ þ Ð ð
• ~360,000 speakers
• Turkish: not supported by ISO 8859-1
• ı I (undotted I), i İ (dotted İ), ş Ş ğ Ğ
• ~88,000,000 speakers!
• ISO 8859-9 replaces the Icelandic characters with the Turkish ones
• Still have ambiguity!
• Still unclear how to represent a text with both Icelandic and Turkish
• Metadata gives a different encoding for each byte range?
• Additional shift characters, as in telegraphy, change from language to language?
• No good choices

## The panda in the room

• Hanzi
• Chinese logographs
• More than 50,000 in use!
• Hard to cram that into 128 bit patterns

## Unicode: the dream

• Represent all the world’s languages in a single system
• Every character has one unambiguous encoding
• No ambiguity in interpretation
• Any text can contain fragments of any language without external metadata or internal shift patterns

## Unicode 1

• 65,536 slots!
• Fixed-width two-byte encoding
• Enabled by “Han unification
• Mapping of characters in different Hanzi national character sets into a set without duplicates
• Based on work by librarians and others: Taiwan’s Chinese Character Code for Information Interchange (CCCII); the Research Libraries Information Network’s East Asian Character Code (EACC); etc.
• Continued by a Unicode-convened Joint Research Group, working with experts from China, Japan, Korea (and now Vietnam, Taiwan) (reference)
• Representation problems
• Some scripts not included (Khmer, Mongolian, Cherokee)
• Some scripts excluded (historic scripts)
• Han unification successfully encoded most characters commonly used by people overall, but not some characters commonly used by particular people—such as characters for writing surnames!

## Intense technical and social arguments

1

Have these people no shame?

This is what happens when a computing tradition that has never been able to move off ground-zero in associating 1 character to 1 glyph keeps grinding through the endless lists of variants, mistakes, rare, obsolete, nonce, idiosyncratic, and novel ideographs available through the millenia in East Asia.

2

I did not attend the meetings in which ISO 10646 was slowly turned into a de facto American industrial standard. I have read that the first person to broach the subject of "unifying" Chinese characters was a Canadian with links to the Unicode project. I have also read that the people looking out for Japan's interests are from a software house that produces word processors, Justsystem Corp. Most shockingly, I have read that the unification of Chinese characters is being conducted on the basis of the Chinese characters used in China, and that the organization pushing this project forward is a private company, not representatives of the Chinese government. … However, basic logic dictates that China should not be setting character standards for Japan, nor should Japan be setting character standards for China. Each country and/or region should have the right to set its own standard, and that standard should be drawn up by a non-commercial entity.

## Unicode 2

• 1,114,112 slots!
• Non-fixed-width encoding
• Mapping and interpretation difficulties
• Waste

## UTF-16

• Unicode Transformation Format, 16 bits
• Goal: Represent every character as a fixed width unit!
• Problem: Characters aren’t fixed width any more
• Problem: Economic harm!
• A large fraction of the world’s text can use a shorter encoding
• 16-bit encoding expands English texts by 2x

## UTF-8

• Variable-width encoding for characters