Character representation
- A brief history of coding systems, considering two sets of values in tension:
- Representation and equality
- Efficiency and unambiguity
Coding systems
- A coding system represents human language in form other than speech
- Standardization!
- Any written language is a coding system
- Alphabetic (a mark represents a part of a sound; Latin, Cyrillic, Coptic, Korean)
- Syllabic (a mark represents a sound combination; some Japanese scripts)
- Logographic (a mark represents a word; Chinese Hanzi)
- Representations can be translated into new forms!
Chappe semaphore, 1790s
- Letters → arrangements of mechanical arms visible from far away
Braille alphabet, 1820s
- Letters and letter combinations → patterns of dots perceptible to touch
Japanese Braille, 1880s
- Braille’s symbols repurposed to a completely different syllabary
- Issues of ambiguity and misinterpretation
- Meaning of Braille symbols depends on context
- How to unambiguously interpret a document combining French and Japanese Braille symbols?
Efficiency vs. representation
- Efficiency and parsimony are essential goals for computer systems
- Represent data in the smallest space
- Less expensive, more capacity
- Take less time to transmit
- Efficiency and parsimony are also essential for humans
- Human limitations in distinguishing marks
- Braille formerly had more symbols, distinguished by the use of dashes in some positions as well as dots (a ternary-like system), but “by the second edition in 1837 [Louis Braille] had discarded the dashes because they were too difficult to read.” ref
- Efficiency and parsimony can conflict with representation!
Telegraphy, 1930s
- Telegraphy uses a binary coding system (dot + dash)
- 25 = 32 distinct patterns
- Complex “shift” system multiplexes some patterns
11111
01110
11011
01110
= “C:”- So the
01110
pattern means different things depending on context - Vulnerable to error: a misinterpreted symbol can change meanings of all future symbols
BCDIC, 1930s
- Derived from punched-card codes originally developed for the US Census in the late 1800s
ASCII, 1960s
- American Standard Code for Information Interchange
- Foundation for many national standards
- Controversial at the time
MORE THAN 64 CHARACTERS!
America and the world
- ISO (International Standards Organization) adopted ASCII as ISO/IEC 646, with a caveat
- The characters
[ \ ] { | }
(and, to a lesser extent,^ ~ # $ @ '
) were “reserved for national use” - Different governments used those slots to represent critical characters in their languages
- Different nations speaking the same language made different choices!
à | â | ç | É | é | ê | è | î | ô | ù | û | £ | ° | § | ¨ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
French (ISO-IR-025) | @ |
N/A | \ |
N/A | { |
N/A | } |
N/A | N/A | | |
N/A | # |
[ |
] |
~ |
Canadian French #1 (ISO-IR-121) | @ |
[ |
\ |
N/A | { |
] |
} |
^ |
` |
| |
~ |
N/A | N/A | N/A | N/A |
Canadian French #2 (ISO-IR-122) | @ |
[ |
\ |
^ |
{ |
] |
} |
N/A | ` |
| |
~ |
N/A | N/A | N/A | N/A |
- The meaning of an encoded text depends on national context!
- Does
^le
mean that, orÉle
(Canadian #2), orîle
(Canadian #1)? - Humans can sort of adapt, but it’s painful
- Does
C
{ a[i] = '\n'; }
C?
ä aÄiÜ = 'Ön'; ü
- The Swedish national character set uses the
{[]\}
code points for Swedish letters,äÄÜÖü
- A Swedish programmer would choose how to configure their computer, and either write C in a crazy style using letters, or write Swedish in a crazy style using punctuation
…C?
??< a??(i??) = '??/n'; ??>
- The C standards body introduced a workaround that almost everyone hated
What about that bit 7?
- ASCII used 7 bits; there’s another bit
- ASCII designers suggested it be used for error correction
- Parity bit (checksum)
- Bit 7 = Bit 6 ^ Bit 5 ^ Bit 4 ^ Bit 3 ^ Bit 2 ^ Bit 1 ^ Bit 0
- Can detect any single-bit-flip error (“hit”)
- Telephone equipment gets better, errors less frequent, storage cheaper—parity bit is less important
MORE THAN 128 CHARACTERS!
- Use slots 128-255 for accented characters and additional symbols
ISO 8859
- Can represent most texts for many Western languages
- The 7-bit subset agrees with ASCII
- Everyone can write C as intended!
- But not all Western languages are supported
- Different versions re-encode the upper 128 characters for other languages or scripts (Greek, Cyrillic)
- ISO 8859-1 becomes the most common and the default, but the choice of languages it represents seems odd to us
Ísland sigurinn
- Icelandic: supported by ISO 8859-1
Ỳ ỳ Þ þ Ð ð
- ~360,000 speakers
- Turkish: not supported by ISO 8859-1
ı I (undotted I), i İ (dotted İ), ş Ş ğ Ğ
- ~88,000,000 speakers!
- ISO 8859-9 replaces the Icelandic characters with the Turkish ones
- Still have ambiguity!
- Still unclear how to represent a text with both Icelandic and Turkish
- Metadata gives a different encoding for each byte range?
- Additional shift characters, as in telegraphy, change from language to language?
- No good choices
The panda in the room
- Hanzi
- Chinese logographs
- More than 50,000 in use!
- Hard to cram that into 128 bit patterns
Unicode: the dream
- Represent all the world’s languages in a single system
- Every character has one unambiguous encoding
- No ambiguity in interpretation
- Any text can contain fragments of any language without external metadata or internal shift patterns
Unicode 1
- 65,536 slots!
- Fixed-width two-byte encoding
- Enabled by “Han unification”
- Mapping of characters in different Hanzi national character sets into a set without duplicates
- Based on work by librarians and others: Taiwan’s Chinese Character Code for Information Interchange (CCCII); the Research Libraries Information Network’s East Asian Character Code (EACC); etc.
- Continued by a Unicode-convened Joint Research Group, working with experts from China, Japan, Korea (and now Vietnam, Taiwan) (reference)
- Representation problems
- Some scripts not included (Khmer, Mongolian, Cherokee)
- Some scripts excluded (historic scripts)
- Han unification successfully encoded most characters commonly used by people overall, but not some characters commonly used by particular people—such as characters for writing surnames!
Intense technical and social arguments
Have these people no shame?
This is what happens when a computing tradition that has never been able to move off ground-zero in associating 1 character to 1 glyph keeps grinding through the endless lists of variants, mistakes, rare, obsolete, nonce, idiosyncratic, and novel ideographs available through the millenia in East Asia.
I did not attend the meetings in which ISO 10646 was slowly turned into a de facto American industrial standard. I have read that the first person to broach the subject of "unifying" Chinese characters was a Canadian with links to the Unicode project. I have also read that the people looking out for Japan's interests are from a software house that produces word processors, Justsystem Corp. Most shockingly, I have read that the unification of Chinese characters is being conducted on the basis of the Chinese characters used in China, and that the organization pushing this project forward is a private company, not representatives of the Chinese government. … However, basic logic dictates that China should not be setting character standards for Japan, nor should Japan be setting character standards for China. Each country and/or region should have the right to set its own standard, and that standard should be drawn up by a non-commercial entity.
Unicode 2
- 1,114,112 slots!
- More room for unencoded characters and languages
- But fixed-width encoding seems really unwise
- Too big for 2-byte encoding
- 4-byte encoding supremely wasteful (though some programming languages do use it internally…)
- 3-byte encoding? Not a power of two??
- Non-fixed-width encoding
- Mapping and interpretation difficulties
- Waste
UTF-16
- Unicode Transformation Format, 16 bits
- Goal: Represent every character as a fixed width unit!
- Problem: Characters aren’t fixed width any more
- Problem: Representational ambiguity (byte order)
- Problem: Economic harm!
- Much of the text in the world can use a shorter encoding
- 16-bit encoding expands English texts by 2x
A more perfect encoding
- There can be no perfect encoding for all situations!
- But there may be an encoding that offers less stark tradeoffs
- Better interoperability and equality of representation than ASCII
- Better efficiency and unambiguity than UTF-16
- Better overall?
UTF-8
- Variable-width, byte-based encoding for characters
- Contrast UTF-16: variable-width, 16-bit-based encoding
- Every ASCII character represents itself, unambiguously
- Every occurrence of the byte value 65 represents the character
A
- Every occurrence of the byte value 65 represents the character
- Non-ASCII characters are represented by short byte sequences of values ≥128
- Bytes from 0xC2…0xF4 start a multi-byte sequence; bytes from 0x80…0xBF continue the sequence
- Example: 0xC3 0xA5 means
å
- Sketched on a placemat in 1992 by Ken Thompson, one of the Unix inventors (citation)
Advantages of UTF-8
- Compatible with existing software and libraries
- Every ASCII file is also a UTF-8 file with the same meaning
- UTF-8 does not use the 0 byte, so existing C library functions continue to work on UTF-8 strings! (UTF-16 texts contain tons of zero bytes)
- Resistant to errors
- No synchronization issues: an error in one byte affects at most one character
- Contrast the telegraphy system, where an error in a “shift byte” can affect the interpretation of all future characters
- Relatively efficient
- Unicode code points U+0080–U+07FF can be represented in two bytes
- So texts in European Latin script, African Latin script, Greek script, Cyrillic script, and most Arabic-script languages take no more space than in UTF-16
- But other scripts may take more space than UTF-16: Brahmic (Indic), Han, Japanese, Korean
How can we tell UTF-8 is good?
- Now over 95% of web pages!
Emoji
- If you’re interested in digging deeper into encoding issues, consider emoji
- Intersection between technical and social issues
- Many of the same high-level issues encountered in encoding written language recur
- Similar solutions recur too
Emoji history
- People have built tiny pictures from punctuation for years, including
artists and poets but also scientists and engineers
- September 1982: A message thread at CMU spins out of control. Tone is hard to communicate electronically!
- “Maybe we should adopt a convention of putting a star (*) in the subject field of any notice which is to be taken as a joke.”
- “I propose the following character sequence for joke markers: :-) Read it sideways.” (citation)
- Modern emoji were invented for cell phones in Japan by 1997
- Carriers competed on their emoji sets
- Initially not interoperable!
Emoji and Unicode
- Emoji became so popular that Unicode encoding was inevitable
- Enormous technical challenges
- First characters where color is important—font standards require changes
Social challenges: Which emoji deserve representation?
- Unicode code points are a finite resource
- The powerful start lobbying
Will a fixed emoji set suffice?
- Issues of unequal representation
- Emoji sets from Japan used a light skin tone for many characters
SoftBank is a Japanese cell phone carrier, the original inventors of emoji. This is from their 1999 set, the first set that had color (and animations). Image from Emojipedia
- How to represent varieties of skin tone? Some designers went for non-representational tones (gray, bright yellow), but that didn’t suffice
- Make 5x as many emoji? Why stop at 5x?
Emoji and culture
Is character encoding normative?
- Images that seem inoffensive in one culture and time aren’t appropriate in others
- People interpret the default emoji set as a statement by society, or by the computer industry, of what is right or most normal
Consider some characters added to Unicode 6.0 in 2010, adopted from Japanese mobile phone emoji sets:
U+1F48F KISS U+1F46F WOMAN WITH BUNNY EARS
Reducing undesirable signification
Emoji representations lose specificity
U+1F48F in 2010 U+1F48F in 2020 Emoji representations change appearance more fundamentally
U+1F46F in 2010 U+1F46F in 2020 - The standards now say the emoji is “most popularly depicted as two women dancing”; some redefine it to be gender neutral, as “people with bunny ears” or “party”
Representing more cultures
- Erasing cultural differences is not the best way to achieve equality
- People want to be represented!
- Impossible or inefficient to represent all important representations with individual code points
- Try variable-length encoding?
Example: Kiss
- U+1F48F KISS 💏 now represents gender-nonspecific people
- To represent more specific people kissing, use a Unicode combiner, U+200D ZERO WIDTH JOINER, invented to represent scripts such as Arabic and Indic where sometimes characters are visually connected
- “Kiss: Woman, Man” 👩❤️💋👨 is represented as:
- U+1F469 WOMAN 👩
- U+200D ZERO WIDTH JOINER
- RED HEART ❤️, which is represented as…
- U+2764 HEAVY BLACK HEART ❤︎
- U+F30F EMOJI VARIATION SELECTOR
- U+200D ZERO WIDTH JOINER
- U+1F48B KISS MARK 💋
- U+200D ZERO WIDTH JOINER
- U+1F468 MAN 👨
- That might seem crazy, but look at the HTML source using
hexdump -C
to verify. It takes 31 bytes:f0 9f 91 a9
(WOMAN)e2 80 8d
(ZWJ)e2 9d a4
(HEAVY BLACK HEART)ef b8 8f
(EMOJI VARIATION SELECTOR)e2 80 8d
(ZWJ)f0 9f 92 8b
(KISS MARK)e2 80 8d
(ZWJ)f0 9f 91 a8
(MAN).
- But this representation naturally generalizes to “Kiss: Man, Man” and “Kiss: Woman, Woman”! (Or “Kiss: Male Astronaut, Female Zombie”)
- Other sequences can represent skin tone, hair color, other features