EthiCS: UTF-8

In this lecture, we discuss the ways harms and technical systems can interact, using language coding systems as an example.

This material is by Eddie Kohler, Eliza Wells, and William Cochran.

Background reading

Harm

Harm: damage, injury
A central concept in many ethical systems, including “classical” liberalism
- “People should be free to act however they wish unless their actions cause harm to somebody else”
But what counts as harm?
- And who count as people?
Useful to name harms apart from a specific value system
- Naming a harm does not imply intent
- Evaluating and weighing harms and benefits comes later

Kinds of harm

Physical
Emotional
Political
Allocative
Representational

Reasonable accommodation and undue burden

New functionality often opens new opportunity for harm
- Focusing only on benefit is irresponsible
- Focusing only on harm can cause paralysis
Need a framework for weighing harms
Reasonable accommodation
- Reduce harm when doing so is “reasonable”
- Meaning it does not cause “undue burden” on provider

Examples from real life according to the US legal system

Reasonable accommodations
- Wheelchair-accessible sidewalks and buildings
- Extended exam times
- Prayer breaks during the workday
Undue burdens
- Paying for an ASL interpreter (Searls v. Johns Hopkins Hospital, 2016)
- Changing employee duties (Treadwell v. Alexander, 1983)
- Paying for an audio transcription service (Dobard v. San Francisco Bay Area Rapid Transit Authority, 1993)

Ethical questions for a design decision

Who will likely benefit from this decision?
Who could it harm?
In what ways, specifically, could it benefit or harm them?
What will it take to avoid or mitigate such harm(s)?
Does the work to avoid the harm constitute a reasonable accommodation or an undue burden? For instance,
- Who has the resources to shoulder this burden?
- Who has the responsibility to shoulder it?

History of character encoding

Written language centers on the concept of character
- “The smallest component of written language that has semantic value” (ref)
Once language is encoded in characters, characters themselves can be encoded in other forms
- Semaphore
- Braille
- Numbers!

Encoding requirements

Interchange: Messages can be exchanged without losing information
Efficiency/parsimony: Representation of language uses few bytes
Unambiguity: A message has one fixed meaning
These goals conflict!

Encoding history

Early encodings represented only unaccented capital Roman letters
- IBM’s BCDIC based on 1880s technology developed for the US Census
To represent new characters, some encodings used shift codes that change the mode for future codes
- CCITT-2: 0b01010 means R in “letter shift” mode and 4 in “figure shift” mode
- Shift code 0b11011 changes to figure shift mode, 0b11111 to letter shift mode
Storage becomes more plentiful → more bits per character
- ASCII: 7-bit code, upper and lower case
- ISO 8859-1: 8-bit code, adds accented letters
- Many other ISO encodings support other languages

Harms and one-byte encodings

Say that the only available text encodings were ISO 8859 variants
Who would be harmed? In what way?
Can you suggest ways to alleviate these harms? Are they reasonable accommodations or undue burdens?

Enter Unicode

A single encoding to support all human languages with one code point per character
First thought 2¹⁶ = 65536 code points would suffice
Now 1,112,064 ≅ 17×2¹⁶ code points
- 149,186 used as of Unicode 15.0
- Covering 161 modern and historic scripts

Harms and Unicode

Does Unicode alleviate any of the harms caused by one-byte encodings?
Have any new harms been created? On whom?

Problems with 16- and 32-bit encodings for Unicode

Expensive relative to 7-bit ASCII or 8-bit ISO 8859
- 2x or 4x!
Ambiguity
- Does the byte sequence 0x65 0x00 represent A (little-endian U+0065) or 攀 (big-endian U+6500)?
- U+FEFF BYTE ORDER MARK
Incompatibility with existing programming languages
- In the C programming language, a null character 0x00 ends a string
- In 16- or 32-bit-encoded Unicode, null characters abound

Desiderata

Byte-based encoding
Compatible with ASCII
- Any byte corresponding to ASCII should indicate that character
- Regardless of where it appears
Resistant to errors
- No shift-based encodings
- Self-synchronizing: only one way to decode a sequence of bytes
Relatively efficient

Straw-man solution

Characters U+0000–U+00FD represented by bytes 0x00–0xFD
Character C > U+00FD represented by 0xFE, followed by (C-\texttt{0xFE}) 0xFF bytes

Text	Code points	Encoding
Hi!	U+0065 U+0069 U+0021	`0x65 0x69 0x21`
«Allô !»	U+00AB U+0065 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB	`0xAB 0x65 0x6C 0x6C 0xF4 0x20 0x21 0xBB`
你好	U+4F60 U+597D	0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF … 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF … (43,044 bytes omitted)

Harm, undue burden, reasonable accommodation

Say that our straw-man was the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Does this encoding represent a reasonable accommodation?

Second straw man (surrogate pairs)

Bytes 0x00-0x7F represent characters U+0000–U+007F
If C>U+007F, divide C into upper and lower bits
- Let C = 64\times C_1 + C_0, where 0 \leq C_1,C_0 < 64
- Represent U+C as [0xC0+C₁] [0x80+C₀]
- So bytes 0xC0–0xFF can only represent the first byte in a two-byte representation
- Bytes 0x80–0xBF can only represent the second byte

Third straw man

Bytes 0x00-0x7F represent characters U+0000–U+007F
If C>U+007F, divide C into blocks of 64 bits
- C = \sum_{i=0}^{N-1} C_i \times 64^i where 0 \leq C_i < 64
- Represent U+C as [0xC0+C_{N-1}] [0x80+C_{N-2}] ... [0x80+C_0]

UTF-8

Bytes 0x00–0x7F represent characters U+0000–U+007F
For U+0080≤C≤U+07FF, use two bytes: [0xC0–0xDF] [0x80–0xBF]
- Bitwise, 0b100'0000 ≤ C ≤ 0b111'1111'1111
- Find C_1, C_0 < 64 so that C = \sum_{i=0}^1 C_i \times 64^i
- 0 \leq C_1 < 32
- Encoding: [0xC0+C_1] [0x80+C_0]
For U+0800≤C≤U+FFFF, use three bytes: [0xE0–0xEF] [0x80–0xBF] [0x80–0xBF]
- Find C_2, C_1, C_0 < 64 so that C = \sum_{i=0}^2 C_i \times 64^i
- 0 \leq C_2 < 16
- Encoding: [0xE0+C_2] [0x80+C_1] [0x80+C_0]
For U+10000≤C≤U+10FFFF, use four bytes: [0xF0–0xF7] [0x80–0xBF] [0x80–0xBF] [0x80–0xBF]
- Find C_3, C_2, C_1, C_0 < 64 so that C = \sum_{i=0}^3 C_i \times 64^i
- 0 \leq C_3 < 8
- Encoding: [0xF0+C_3] [0x80+C_2] [0x80+C_1] [0x80+C_0]
Sketched on a placemat in 1992 by Ken Thompson, one of the Unix inventors (citation)

Advantages of UTF-8

Compatible with existing software and libraries
- Every ASCII file is also a UTF-8 file with the same meaning
- UTF-8 does not use the 0 byte, so existing C library functions continue to work on UTF-8 strings! (UTF-16 texts contain tons of zero bytes)
Resistant to errors
- No synchronization issues: an error in one byte affects at most one character
- Contrast the telegraphy system, where an error in a “shift byte” can affect the interpretation of all future characters
Relatively efficient
- Unicode code points U+0080–U+07FF can be represented in two bytes
- So texts in European Latin script, African Latin script, Greek script, Cyrillic script, and most Arabic-script languages take no more space than in UTF-16
- But other scripts may take more space than UTF-16: Brahmic (Indic), Han, Japanese, Korean

UTF-8 deployment

Now over 97% of web pages!
Increasingly the default in programming languages and operating systems

Ref, ref

Harms in UTF-8

Say that Unicode UTF-8 were the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Would requiring this encoding represent a reasonable accommodation?

Case studies

Emoji

If you’re interested in digging deeper into encoding issues, consider emoji
Intersection between technical and social issues
Many of the same high-level issues encountered in encoding written language recur
- Similar solutions recur too

Emoji history

People have built tiny pictures from punctuation for years, including artists and poets but also scientists and engineers
- September 1982: A message thread at CMU spins out of control. Tone is hard to communicate electronically!
- “Maybe we should adopt a convention of putting a star (*) in the subject field of any notice which is to be taken as a joke.”
- “I propose the following character sequence for joke markers: :-) Read it sideways.” (citation)
Modern emoji were invented for cell phones in Japan by 1997
- Carriers competed on their emoji sets
- Initially not interoperable!

Emoji and Unicode

Emoji became so popular that Unicode encoding was inevitable
Enormous technical challenges
- First characters where color is important—font standards require changes
Social challenges: Which emoji deserve representation?
- Unicode code points are a finite resource
- The powerful start lobbying

Will a fixed emoji set suffice?

Issues of unequal representation
Emoji sets from Japan used a light skin tone for many characters

SoftBank is a Japanese cell phone carrier, the original inventors of emoji. This is from their 1999 set, the first set that had color (and animations). Image from Emojipedia
How to represent varieties of skin tone? Some designers went for non-representational tones (gray, bright yellow), but that didn’t suffice
Make 5x as many emoji? Why stop at 5x?

Emoji and culture

Is character encoding normative?
- Images that seem inoffensive in one culture and time aren’t appropriate in others
- People interpret the default emoji set as a statement by society, or by the computer industry, of what is right or most normal
- Consider some characters added to Unicode 6.0 in 2010, adopted from Japanese mobile phone emoji sets:
  
  U+1F48F KISS U+1F46F WOMAN WITH BUNNY EARS

Reducing undesirable signification

Emoji representations lose specificity

U+1F48F in 2010 U+1F48F in 2020
Emoji representations change appearance more fundamentally

U+1F46F in 2010 U+1F46F in 2020
- The standards now say the emoji is “most popularly depicted as two women dancing”; some redefine it to be gender neutral, as “people with bunny ears” or “party”

Representing more cultures

Erasing cultural differences is not the best way to achieve equality
- People want to be represented!
Impossible or inefficient to represent all important representations with individual code points
Try variable-length encoding?

Example: Kiss

U+1F48F KISS 💏 now represents gender-nonspecific people
To represent more specific people kissing, use a Unicode combiner, U+200D ZERO WIDTH JOINER, invented to represent scripts such as Arabic and Indic where sometimes characters are visually connected
“Kiss: Woman, Man” 👩‍❤️‍💋‍👨 is represented as:
- U+1F469 WOMAN 👩
- U+200D ZERO WIDTH JOINER
- RED HEART ❤️, which is represented as…
  - U+2764 HEAVY BLACK HEART ❤︎
  - U+F30F EMOJI VARIATION SELECTOR
- U+200D ZERO WIDTH JOINER
- U+1F48B KISS MARK 💋
- U+200D ZERO WIDTH JOINER
- U+1F468 MAN 👨
- That might seem crazy, but look at the HTML source using hexdump -C to verify. It takes 31 bytes: f0 9f 91 a9 (WOMAN) e2 80 8d (ZWJ) e2 9d a4 (HEAVY BLACK HEART) ef b8 8f (EMOJI VARIATION SELECTOR) e2 80 8d (ZWJ) f0 9f 92 8b (KISS MARK) e2 80 8d (ZWJ) f0 9f 91 a8 (MAN).
But this representation naturally generalizes to “Kiss: Man, Man” and “Kiss: Woman, Woman”! (Or “Kiss: Male Astronaut, Female Zombie”)
Other sequences can represent skin tone, hair color, other features

Presented without comment

From The Unicode Emoji technical standard:

Gas pump