EthiCS: UTF-8

In this lecture, we discuss the ways harms and technical systems can interact, using language coding systems as an example.

This material is by Eddie Kohler, Eliza Wells, and William Cochran.

Background reading

Harm

Harm: damage, injury
A central concept in many ethical systems, including “classical” liberalism
- “People should be free to act however they wish unless their actions cause harm to somebody else”
But what counts as harm?
- And who count as people?
Useful to name harms apart from a specific value system
- Naming a harm does not imply intent
- Evaluating and weighing harms and benefits comes later

Kinds of harm

Physical
Emotional
Political
Economic
Allocative
- When resources or opportunities are withheld from an individual or group
Representational
- When an identity is denied, denigrated, or subordinated

Reasonable accommodation and undue burden

New functionality often opens new opportunity for harm
- Focusing only on benefit is irresponsible
- Focusing only on harm can cause paralysis
Need a framework for weighing harms
Reasonable accommodation
- Reduce harm when doing so is “reasonable”
- Meaning it does not cause “undue burden” on provider
Technical advances can reduce burdens and make accommodations more reasonable!

Examples involving the US legal system

Reasonable accommodations
- Wheelchair-accessible sidewalks and buildings
- Extended exam times
- Prayer breaks during the workday
Undue burdens
- Paying for an ASL interpreter (Searls v. Johns Hopkins Hospital, 2016)
- Changing employee duties (Treadwell v. Alexander, 1983)
- Paying for an audio transcription service (Dobard v. San Francisco Bay Area Rapid Transit Authority, 1993)

Ethical questions for a design decision

Who will likely benefit from this decision?
Who could it harm?
In what ways, specifically, could it benefit or harm them?
What will it take to avoid or mitigate such harm(s)?
Does the work to avoid the harm constitute a reasonable accommodation or an undue burden? For instance,
- Who has the resources to shoulder this burden?
- Who has the responsibility to shoulder it?

An example at Harvard

National Association for the Deaf v. Harvard University, 2015
“The NAD claimed that Harvard and MIT denied people who are deaf and hard of hearing equal access to the universities’ free and online programming. Notably, these lawsuits were the first of their kind to address the accuracy and quality of the captions provided.” (link)
- “The lawsuit was prompted by the recognition that, notwithstanding the description of Harvard’s online resources as available to ‘learners throughout the world,’ many of its videos and audio recordings lacked captions or used inaccurate captions. … [In response to motions to dismiss], the court ruled that federal laws prohibiting disability discrimination covered Harvard’s online content.” (link)
“Websites are public spaces. … [O]nline presence may be interpreted as a public space and would need to be accessible … [A]uto-captions (such as YouTube’s auto-transcription feature) would not be considered equitable access under the ADA due to a high error rate.” (link)

Accessible videos

“In order to ensure captions meet the accuracy standards of Harvard’s Policy, any captions produced with AI will need to be edited before the video is posted online.” (link)
- “Zoom offers automatically generated live transcripts for every meeting, and meeting hosts are encouraged to turn on live captions in Zoom as an inclusive practice. Note: If live transcripts/captions have been requested as an accommodation for an event, or if Harvard's Digital Accessibility Policy requires your event to be live captioned, auto-generated captions are not sufficient. A professional vendor such as Vitac must be used to provide live captions in such cases.”
Example prices: $120–$150/hour for live transcription, $100–$120/hour for 24–48 hour turnaround
Is this cost affordable?

Ref
This solves the allocative harm on a group, but imposes a broader harm

Data representation: Human language

History of character encoding

Written language centers on the concept of character
- “The smallest component of written language that has semantic value” (ref)
Once language is encoded in characters, characters themselves can be encoded in other forms
- Semaphore
- Braille
- Bytes!

Encoding requirements

Interchange: Messages can be exchanged without losing information
Efficiency/parsimony: Representation of language uses few bytes
Unambiguity: A message has one fixed meaning
Robustness: A message preserves meaning despite transmission errors
These goals conflict!

Encoding history

Early encodings represented only unaccented capital Roman letters
- IBM’s BCDIC based on 1880s technology developed for the US Census
To represent new characters, some encodings used shift codes that change the mode for future codes
- CCITT-2: 0b01010 means R in “letter shift” mode and 4 in “figure shift” mode
- Shift code 0b11011 changes to figure shift mode, 0b11111 to letter shift mode
Storage becomes more plentiful → more bits per character
- ASCII: 7-bit code, upper and lower case
- ISO 8859-1: 8-bit code, adds accented letters
- Many other ISO encodings support other languages

Harms and one-byte encodings

Say that the only available text encodings were ISO 8859 variants
Who would be harmed? In what way?
Can you suggest ways to alleviate these harms? Are they reasonable accommodations or undue burdens?

Enter Unicode

A single encoding to support all human languages with one code point per character
First thought 2¹⁶ = 65536 code points would suffice
Now 1,112,064 ≅ 17×2¹⁶ code points
- U+0000 – U+10FFFF; requires up to 21 bits to represent
- 149,186 code points used as of Unicode 15.0
- Covering 161 modern and historic scripts

Harms and Unicode

Does Unicode alleviate any of the harms caused by one-byte encodings?
Have any new harms been created? On whom?

Problems with 16- and 32-bit encodings for Unicode

Expensive relative to 7-bit ASCII or 8-bit ISO 8859
- 2x or 4x!
Ambiguity
- Does the byte sequence 0x65 0x00 represent e (little-endian U+0065) or 攀 (big-endian U+6500)?
- Solution: U+FEFF BYTE ORDER MARK
Incompatibility with existing programming languages
- In the C programming language, the zero byte 0x00 ends a string
- In 16- or 32-bit encodings of Unicode, zero bytes abound

Desiderata for an 8-bit encoding for Unicode

Byte-based
Compatible with ASCII
- Any byte corresponding to ASCII should indicate that character
- Regardless of where it appears
Resistant to errors
- Not shift-based
- Self-synchronizing: only one way to decode a sequence of bytes
Relatively efficient

Idea #1: Unary

Bytes 0x00–0xFD represent characters U+0000–U+00FD
Character C > \texttt{0xFD} represented by 0xFE, followed by (C-\texttt{0xFE}) 0xFF bytes
Examples

Text Code points Encoding

Hi! U+0048 U+0069 U+0021 0x48 0x69 0x21

ATĀ U+0041 U+0054 U+0100 0x41 0x54 0xFE 0xFF 0xFF

Desiderata

Byte-based?
Compatible with ASCII?
Not shift-based?
Self-synchronizing?
Efficient?

Text	Code points	Encoding
«Allô !»	U+00AB U+0041 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB	`0xAB 0x41 0x6C 0x6C 0xF4 0x20 0x21 0xBB`
你好	U+4F60 U+597D	0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF … (42,723 bytes total)

Harm, undue burden, reasonable accommodation

Idea #1:
- Bytes 0x00-0xFD represent U+0000–U+00FD
- Character C > \texttt{0xFD} represented by 0xFE, followed by (C-\texttt{0xFE}) 0xFF bytes
Say that Idea #1 was the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Does this encoding represent a reasonable accommodation?

Idea #2: Byte pairs

Bytes 0x00–0x7F represent characters U+0000–U+007F
Divide C > \texttt{0x7F} into two 7-bit blocks
- C_0 has bits 0–6 of C, and C_1 has bits 7–13
- Alternately, C_0 = C \bmod 128 and C_1 = \lfloor C / 128 \rfloor
- C = C_0 + 128 \times C_1, where 0 \leq C_0, C_1 \leq \texttt{0x7F}
- Use bytes ⟨\texttt{0x80} + C_1⟩ ⟨\texttt{0x80} + C_0⟩

Examples

Text	Code points	Encoding
Hi!	U+0048 U+0069 U+0021	`0x48 0x69 0x21`
ATĀ	U+0041 U+0054 U+0100	`0x41 0x54 0x82 0x80`
«Allô !»	U+00AB U+0041 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB	`0x81 0xAB 0x41 0x6C 0x6C 0x81 0xF4 0x20 0x21 0x81 0xBB`
你好	U+4F60 U+597D	…?

Say that Idea #2 was the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Does this encoding represent a reasonable accommodation?

Idea #3: Byte triples

Bytes 0x00-0x7F represent characters U+0000–U+007F
Divide C > \texttt{0x7F} into three 7-bit blocks
- C_0 has bits 0–6 of C; C_1 has bits 7–13; C_2 has bits 14–20
- Use bytes ⟨\texttt{0x80} + C_2⟩ ⟨\texttt{0x80} + C_1⟩ ⟨\texttt{0x80} + C_0⟩

Examples

Text	Code points	Encoding
Hi!	U+0048 U+0069 U+0021	`0x48 0x69 0x21`
ATĀ	U+0041 U+0054 U+0100	`0x41 0x54 0x80 0x82 0x80`
«Allô !»	U+00AB U+0041 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB	`0x80 0x81 0xAB 0x41 0x6C 0x6C 0x80 0x81 0xF4 0x20 0x21 0x80 0x81 0xBB`
你好	U+4F60 U+597D	`0x81 0x9E 0xD0 0x81 0xB2 0xFD`

Desiderata
- Byte-based?
- Compatible with ASCII?
- Not shift-based?
- Self-synchronizing? No

Idea #4: Self-synchronizing byte quadruplets

Bytes 0x00-0x7F represent characters U+0000–U+007F
Divide C > \texttt{0x7F} into four 6-bit blocks
- C_0 has bits 0–5 of C; C_1 has bits 6–11; C_2 has bits 12–17; C_3 has bits 18–20
- Use bytes ⟨\texttt{0xC0} + C_3⟩ ⟨\texttt{0x80} + C_2⟩ ⟨\texttt{0x80} + C_1⟩ ⟨\texttt{0x80} + C_0⟩
- Can always tell if a byte starts a character (self-synchronizing)
  - 0x00–0x7F is a one-byte character, 0xC0–0xC4 starts a four-byte character, 0x80-0xBF continues a four-byte character

Examples

Text	Code points	Encoding
Hi!	U+0048 U+0069 U+0021	`0x48 0x69 0x21`
ATĀ	U+0041 U+0054 U+0100	`0x41 0x54 0xC0 0x80 0x84 0x80`
«Allô !»	U+00AB U+0041 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB	`0xC0 0x80 0x82 0xAB 0x41 0x6C 0x6C 0xC0 0x80 0x83 0xB4 0x20 0x21 0xC0 0x80 0x82 0xBB`
你好	U+4F60 U+597D	`0xC0 0x84 0xBD 0xA0 0xC0 0x85 0xA5 0xBD`

Say that Idea #4 was the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Does this encoding represent a reasonable accommodation?

UTF-8

Bytes 0x00–0x7F represent characters U+0000–U+007F
Divide \texttt{0x80} \leq C \leq \texttt{0x7FF} into two blocks and encode in two bytes
- C_0 has bits 0–5 of C; C_1 has bits 6–10
- Use bytes ⟨\texttt{0xC0} + C_1⟩ ⟨\texttt{0x80} + C_0⟩
- First byte has value 0xC2–0xDF
Divide \texttt{0x800} \leq C \leq \texttt{0xFFFF} into three blocks and encode in three bytes
- C_0 has bits 0–5 of C; C_1 has bits 6–11; C_2 has bits 12–15
- Use bytes ⟨\texttt{0xE0} + C_2⟩ ⟨\texttt{0x80} + C_1⟩ ⟨\texttt{0x80} + C_0⟩
- First byte has value 0xE0–0xEF
Divide \texttt{0x10000} \leq C \leq \texttt{0x10FFFF} into four blocks and encode in four bytes
- C_0 has bits 0–5 of C; C_1 has bits 6–11; C_2 has bits 12–17; C_3 has bits 18–20
- Use bytes ⟨\texttt{0xF0} + C_3⟩ ⟨\texttt{0x80} + C_2⟩ ⟨\texttt{0x80} + C_1⟩ ⟨\texttt{0x80} + C_0⟩
- First byte has value 0xF0–0xF4
Sketched on a placemat in 1992 by Ken Thompson, one of the Unix inventors (citation)

Advantages of UTF-8

Compatible with existing software and libraries
- Every ASCII file is also a UTF-8 file with the same meaning
- UTF-8 does not use the 0 byte, so existing C library functions continue to work on UTF-8 strings! (UTF-16 texts contain tons of zero bytes)
Resistant to errors
- No synchronization issues: an error in one byte affects at most one character
- Contrast the telegraphy system, where an error in a “shift byte” can affect the interpretation of all future characters
Relatively efficient
- Unicode code points U+0080–U+07FF can be represented in two bytes
- So texts in European Latin script, African Latin script, Greek script, Cyrillic script, and most Arabic-script languages take no more space than in UTF-16
- But other scripts may take more space than UTF-16: Brahmic (Indic), Han, Japanese, Korean

UTF-8 deployment

Now over 97% of web pages!
Increasingly the default in programming languages and operating systems

Ref, ref

Harms in UTF-8

Say that Unicode UTF-8 were the only available text encoding
Who would be harmed? In what way?
What harms would be alleviated? And from whom?
Would requiring this encoding represent a reasonable accommodation?

Case studies

Emoji

If you’re interested in digging deeper into encoding issues, consider emoji
Intersection between technical and social issues
Many of the same high-level issues encountered in encoding written language recur
- Similar solutions recur too

Emoji history

People have built tiny pictures from punctuation for years, including artists and poets but also scientists and engineers
- September 1982: A message thread at CMU spins out of control. Tone is hard to communicate electronically!
- “Maybe we should adopt a convention of putting a star (*) in the subject field of any notice which is to be taken as a joke.”
- “I propose the following character sequence for joke markers: :-) Read it sideways.” (citation)
Modern emoji were invented for cell phones in Japan by 1997
- Carriers competed on their emoji sets
- Initially not interoperable!

Emoji and Unicode

Emoji became so popular that Unicode encoding was inevitable
Enormous technical challenges
- First characters where color is important—font standards require changes
Social challenges: Which emoji deserve representation?
- Unicode code points are a finite resource
- The powerful start lobbying

Will a fixed emoji set suffice?

Issues of unequal representation
Emoji sets from Japan used a light skin tone for many characters

SoftBank is a Japanese cell phone carrier, the original inventors of emoji. This is from their 1999 set, the first set that had color (and animations). Image from Emojipedia
How to represent varieties of skin tone? Some designers went for non-representational tones (gray, bright yellow), but that didn’t suffice
Make 5x as many emoji? Why stop at 5x?

Emoji and culture

Is character encoding normative?
- Images that seem inoffensive in one culture and time aren’t appropriate in others
- People interpret the default emoji set as a statement by society, or by the computer industry, of what is right or most normal
- Consider some characters added to Unicode 6.0 in 2010, adopted from Japanese mobile phone emoji sets:
  
  U+1F48F KISS U+1F46F WOMAN WITH BUNNY EARS

Reducing undesirable signification

Emoji representations lose specificity

U+1F48F in 2010 U+1F48F in 2020
Emoji representations change appearance more fundamentally

U+1F46F in 2010 U+1F46F in 2020
- The standards now say the emoji is “most popularly depicted as two women dancing”; some redefine it to be gender neutral, as “people with bunny ears” or “party”

Representing more cultures

Erasing cultural differences is not the best way to achieve equality
- People want to be represented!
Impossible or inefficient to represent all important representations with individual code points
Try variable-length encoding?

Example: Kiss

U+1F48F KISS 💏 now represents gender-nonspecific people
To represent more specific people kissing, use a Unicode combiner, U+200D ZERO WIDTH JOINER, invented to represent scripts such as Arabic and Indic where sometimes characters are visually connected
“Kiss: Woman, Man” 👩‍❤️‍💋‍👨 is represented as:
- U+1F469 WOMAN 👩
- U+200D ZERO WIDTH JOINER
- RED HEART ❤️, which is represented as…
  - U+2764 HEAVY BLACK HEART ❤︎
  - U+F30F EMOJI VARIATION SELECTOR
- U+200D ZERO WIDTH JOINER
- U+1F48B KISS MARK 💋
- U+200D ZERO WIDTH JOINER
- U+1F468 MAN 👨
- That might seem crazy, but look at the HTML source using hexdump -C to verify. It takes 31 bytes: f0 9f 91 a9 (WOMAN) e2 80 8d (ZWJ) e2 9d a4 (HEAVY BLACK HEART) ef b8 8f (EMOJI VARIATION SELECTOR) e2 80 8d (ZWJ) f0 9f 92 8b (KISS MARK) e2 80 8d (ZWJ) f0 9f 91 a8 (MAN).
But this representation naturally generalizes to “Kiss: Man, Man” and “Kiss: Woman, Woman”! (Or “Kiss: Male Astronaut, Female Zombie”)
Other sequences can represent skin tone, hair color, other features

Presented without comment

From The Unicode Emoji technical standard:

Gas pump