In this lecture, we discuss the ways harms and technical systems can interact, using language coding systems as an example.
This material is by Eddie Kohler, Eliza Wells, and William Cochran.
Harm
- Harm: damage, injury
- A central concept in many ethical systems, including “classical” liberalism
- “People should be free to act however they wish unless their actions cause harm to somebody else”
- But what counts as harm?
- And who count as people?
- Useful to name harms apart from a specific value system
- Naming a harm does not imply intent
- Evaluating and weighing harms and benefits comes later
Kinds of harm
- Physical
- Emotional
- Political
- Allocative
- Representational
Reasonable accommodation and undue burden
- New functionality often opens new opportunity for harm
- Focusing only on benefit is irresponsible
- Focusing only on harm can cause paralysis
- Need a framework for weighing harms
- Reasonable accommodation
- Reduce harm when doing so is “reasonable”
- Meaning it does not cause “undue burden” on provider
Examples from real life according to the US legal system
- Reasonable accommodations
- Wheelchair-accessible sidewalks and buildings
- Extended exam times
- Prayer breaks during the workday
- Undue burdens
- Paying for an ASL interpreter (Searls v. Johns Hopkins Hospital, 2016)
- Changing employee duties (Treadwell v. Alexander, 1983)
- Paying for an audio transcription service (Dobard v. San Francisco Bay Area Rapid Transit Authority, 1993)
Ethical questions for a design decision
- Who will likely benefit from this decision?
- Who could it harm?
- In what ways, specifically, could it benefit or harm them?
- What will it take to avoid or mitigate such harm(s)?
- Does the work to avoid the harm constitute a reasonable accommodation or an undue burden?
For instance,
- Who has the resources to shoulder this burden?
- Who has the responsibility to shoulder it?
History of character encoding
- Written language centers on the concept of character
- “The smallest component of written language that has semantic value” (ref)
- Once language is encoded in characters, characters themselves can be encoded in other forms
Encoding requirements
- Interchange: Messages can be exchanged without losing information
- Efficiency/parsimony: Representation of language uses few bytes
- Unambiguity: A message has one fixed meaning
- These goals conflict!
Encoding history
- Early encodings represented only unaccented capital Roman letters
- IBM’s BCDIC based on 1880s technology developed for the US Census
- To represent new characters, some encodings used shift codes that change
the mode for future codes
- CCITT-2:
0b01010
meansR
in “letter shift” mode and4
in “figure shift” mode - Shift code
0b11011
changes to figure shift mode,0b11111
to letter shift mode
- CCITT-2:
- Storage becomes more plentiful → more bits per character
- ASCII: 7-bit code, upper and lower case
- ISO 8859-1: 8-bit code, adds accented letters
- Many other ISO encodings support other languages
Harms and one-byte encodings
- Say that the only available text encodings were ISO 8859 variants
- Who would be harmed? In what way?
- Can you suggest ways to alleviate these harms? Are they reasonable accommodations or undue burdens?
Enter Unicode
- A single encoding to support all human languages with one code point per character
- First thought 216 = 65536 code points would suffice
- Now 1,112,064 ≅ 17×216 code points
- 149,186 used as of Unicode 15.0
- Covering 161 modern and historic scripts
Harms and Unicode
- Does Unicode alleviate any of the harms caused by one-byte encodings?
- Have any new harms been created? On whom?
Problems with 16- and 32-bit encodings for Unicode
- Expensive relative to 7-bit ASCII or 8-bit ISO 8859
- 2x or 4x!
- Ambiguity
- Does the byte sequence
0x65 0x00
representA
(little-endian U+0065) or攀
(big-endian U+6500)? - U+FEFF BYTE ORDER MARK
- Does the byte sequence
- Incompatibility with existing programming languages
- In the C programming language, a null character
0x00
ends a string - In 16- or 32-bit-encoded Unicode, null characters abound
- In the C programming language, a null character
Desiderata
- Byte-based encoding
- Compatible with ASCII
- Any byte corresponding to ASCII should indicate that character
- Regardless of where it appears
- Resistant to errors
- No shift-based encodings
- Self-synchronizing: only one way to decode a sequence of bytes
- Relatively efficient
Straw-man solution
- Characters U+0000–U+00FD represented by bytes
0x00
–0xFD
- Character C > U+00FD represented by
0xFE
, followed by (C-\texttt{0xFE})0xFF
bytes
Text | Code points | Encoding |
---|---|---|
Hi! | U+0065 U+0069 U+0021 | 0x65 0x69 0x21 |
«Allô !» | U+00AB U+0065 U+006C U+006C U+00F4 U+0020 U+0021 U+00BB | 0xAB 0x65 0x6C 0x6C 0xF4 0x20 0x21 0xBB |
你好 | U+4F60 U+597D | 0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF … 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFE 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF … (43,044 bytes omitted) |
Harm, undue burden, reasonable accommodation
- Say that our straw-man was the only available text encoding
- Who would be harmed? In what way?
- What harms would be alleviated? And from whom?
- Does this encoding represent a reasonable accommodation?
Second straw man (surrogate pairs)
- Bytes 0x00-0x7F represent characters U+0000–U+007F
- If C>U+007F, divide C into upper and lower bits
- Let C = 64\times C_1 + C_0, where 0 \leq C_1,C_0 < 64
- Represent U+C as [
0xC0
+C1] [0x80
+C0] - So bytes
0xC0
–0xFF
can only represent the first byte in a two-byte representation - Bytes
0x80
–0xBF
can only represent the second byte
Third straw man
- Bytes 0x00-0x7F represent characters U+0000–U+007F
- If C>U+007F, divide C into blocks of 64 bits
- C = \sum_{i=0}^{N-1} C_i \times 64^i where 0 \leq C_i < 64
- Represent U+C as [
0xC0
+C_{N-1}] [0x80
+C_{N-2}] ... [0x80
+C_0]
UTF-8
- Bytes 0x00–0x7F represent characters U+0000–U+007F
- For U+0080≤C≤U+07FF, use two bytes: [
0xC0
–0xDF
] [0x80
–0xBF
]- Bitwise, 0b100'0000 ≤ C ≤ 0b111'1111'1111
- Find C_1, C_0 < 64 so that C = \sum_{i=0}^1 C_i \times 64^i
- 0 \leq C_1 < 32
- Encoding: [
0xC0
+C_1] [0x80
+C_0]
- For U+0800≤C≤U+FFFF, use three bytes: [
0xE0
–0xEF
] [0x80
–0xBF
] [0x80
–0xBF
]- Find C_2, C_1, C_0 < 64 so that C = \sum_{i=0}^2 C_i \times 64^i
- 0 \leq C_2 < 16
- Encoding: [
0xE0
+C_2] [0x80
+C_1] [0x80
+C_0]
- For U+10000≤C≤U+10FFFF, use four bytes: [
0xF0
–0xF7
] [0x80
–0xBF
] [0x80
–0xBF
] [0x80
–0xBF
]- Find C_3, C_2, C_1, C_0 < 64 so that C = \sum_{i=0}^3 C_i \times 64^i
- 0 \leq C_3 < 8
- Encoding: [
0xF0
+C_3] [0x80
+C_2] [0x80
+C_1] [0x80
+C_0]
- Sketched on a placemat in 1992 by Ken Thompson, one of the Unix inventors (citation)
Advantages of UTF-8
- Compatible with existing software and libraries
- Every ASCII file is also a UTF-8 file with the same meaning
- UTF-8 does not use the 0 byte, so existing C library functions continue to work on UTF-8 strings! (UTF-16 texts contain tons of zero bytes)
- Resistant to errors
- No synchronization issues: an error in one byte affects at most one character
- Contrast the telegraphy system, where an error in a “shift byte” can affect the interpretation of all future characters
- Relatively efficient
- Unicode code points U+0080–U+07FF can be represented in two bytes
- So texts in European Latin script, African Latin script, Greek script, Cyrillic script, and most Arabic-script languages take no more space than in UTF-16
- But other scripts may take more space than UTF-16: Brahmic (Indic), Han, Japanese, Korean
UTF-8 deployment
- Now over 97% of web pages!
- Increasingly the default in programming languages and operating systems
Harms in UTF-8
- Say that Unicode UTF-8 were the only available text encoding
- Who would be harmed? In what way?
- What harms would be alleviated? And from whom?
- Would requiring this encoding represent a reasonable accommodation?
Case studies
Emoji
- If you’re interested in digging deeper into encoding issues, consider emoji
- Intersection between technical and social issues
- Many of the same high-level issues encountered in encoding written language recur
- Similar solutions recur too
Emoji history
- People have built tiny pictures from punctuation for years, including
artists and poets but also scientists and engineers
- September 1982: A message thread at CMU spins out of control. Tone is hard to communicate electronically!
- “Maybe we should adopt a convention of putting a star (*) in the subject field of any notice which is to be taken as a joke.”
- “I propose the following character sequence for joke markers: :-) Read it sideways.” (citation)
- Modern emoji were invented for cell phones in Japan by 1997
- Carriers competed on their emoji sets
- Initially not interoperable!
Emoji and Unicode
- Emoji became so popular that Unicode encoding was inevitable
- Enormous technical challenges
- First characters where color is important—font standards require changes
- Social challenges: Which emoji deserve representation?
-
Unicode code points are a finite resource
-
The powerful start lobbying
-
Will a fixed emoji set suffice?
-
Issues of unequal representation
-
Emoji sets from Japan used a light skin tone for many characters
SoftBank is a Japanese cell phone carrier, the original inventors of emoji. This is from their 1999 set, the first set that had color (and animations). Image from Emojipedia
-
How to represent varieties of skin tone? Some designers went for non-representational tones (gray, bright yellow), but that didn’t suffice
-
Make 5x as many emoji? Why stop at 5x?
Emoji and culture
- Is character encoding normative?
- Images that seem inoffensive in one culture and time aren’t appropriate in others
- People interpret the default emoji set as a statement by society, or by the computer industry, of what is right or most normal
- Consider some characters added to Unicode 6.0 in 2010, adopted from
Japanese mobile phone emoji sets:
U+1F48F KISS U+1F46F WOMAN WITH BUNNY EARS
Reducing undesirable signification
-
Emoji representations lose specificity
U+1F48F in 2010 U+1F48F in 2020 -
Emoji representations change appearance more fundamentally
U+1F46F in 2010 U+1F46F in 2020 - The standards now say the emoji is “most popularly depicted as two women dancing”; some redefine it to be gender neutral, as “people with bunny ears” or “party”
Representing more cultures
- Erasing cultural differences is not the best way to achieve equality
- People want to be represented!
- Impossible or inefficient to represent all important representations with individual code points
- Try variable-length encoding?
Example: Kiss
- U+1F48F KISS 💏 now represents gender-nonspecific people
- To represent more specific people kissing, use a Unicode combiner, U+200D ZERO WIDTH JOINER, invented to represent scripts such as Arabic and Indic where sometimes characters are visually connected
- “Kiss: Woman, Man” 👩❤️💋👨 is represented as:
- U+1F469 WOMAN 👩
- U+200D ZERO WIDTH JOINER
- RED HEART ❤️, which is represented as…
- U+2764 HEAVY BLACK HEART ❤︎
- U+F30F EMOJI VARIATION SELECTOR
- U+200D ZERO WIDTH JOINER
- U+1F48B KISS MARK 💋
- U+200D ZERO WIDTH JOINER
- U+1F468 MAN 👨
- That might seem crazy, but look at the HTML source using
hexdump -C
to verify. It takes 31 bytes:f0 9f 91 a9
(WOMAN)e2 80 8d
(ZWJ)e2 9d a4
(HEAVY BLACK HEART)ef b8 8f
(EMOJI VARIATION SELECTOR)e2 80 8d
(ZWJ)f0 9f 92 8b
(KISS MARK)e2 80 8d
(ZWJ)f0 9f 91 a8
(MAN).
- But this representation naturally generalizes to “Kiss: Man, Man” and “Kiss: Woman, Woman”! (Or “Kiss: Male Astronaut, Female Zombie”)
- Other sequences can represent skin tone, hair color, other features