The input language for sh61
command lines is described in terms of a BNF
grammar, where BNF stands for Backus–Naur Form, or Backus Normal Form.
BNF is a declarative notation for describing a language, which in this
context just means a set of strings. BNF notation consists of three pieces:
- Terminals, such as
"x"
, are strings of characters that must exactly match characters in the input. - Nonterminals (or symbols for short), such as
lettera
, represent sets of strings. One of the nonterminals is called the root or start symbol of the grammar. By convention, the root is the first nonterminal mentioned in the grammar. - Rules, such as
lettera ::= "a"
orword ::= letter word
, define how nonterminals and strings relate.
A string is in the language if it can be obtained by following the rules of the grammar starting from a single instance of the root symbol. Any context-free language can be described this way.
Here’s an example grammar that can recognize exactly two sentences, namely
"cs61 is awesome"
and "cs61 is terrible"
:
cs61 ::= "cs61 is " emotionword
emotionword ::= "awesome" | "terrible"
It’s useful to be able to read BNF grammars because they’re very common in computer science and programming. Many specification documents use BNF to define languages, including programming languages, configuration files, and even network packet formats.
These exercises show features of BNF grammars using simple examples. Peek at some answers and then try to answer the questions on your own.
Exercise: Define a grammar for the empty language, which is a language containing no strings. (There is no valid sentence in the empty language.)
shutup ::=
This grammar has one nonterminal, the root symbol
shutup
. The::=
notation has nothing on the right, indicating that no string will matchshutup
.
Exercise: Define a grammar for the children language (because children should be “not heard”). The empty string is the only valid sentence in this language.
children ::= ""
Note the difference between
children
andshutup
. Theshutup
language contains no sentences: no string is valid in this language, not even the empty string. Thechildren
language contains one sentence: the empty string""
is valid.
Exercise: Define a grammar for the letter language. A letter is a
lower-case Latin letter between a
and z
.
letter ::= "a" | "b" | "c" | "d" | ... | "z"
The
|
symbol indicates alternate choices for the rule’s definition. It’s actually shorthand for multiple rules for the same nonterminal:letter ::= "a" letter ::= "b" letter ::= "c" letter ::= "d" ... letter ::= "z"
Exercise: Define a grammar for the word language. A word is a sequence
of one or more letters. You may refer to the letter
nonterminal defined
above.
word ::= letter | word letter
This grammar features a recursive rule! Why is
"butt"
a valid sentence in this language? Well:
"b"
is aword
(a sentence in the word language) because it is aletter
. (This uses the first alternative,word ::= letter
.)"bu"
is aword
because"b"
is a word (see above) and"u"
is a letter. (This uses this second alternative,word ::= word letter
.)"but"
is aword
because"bu"
is a word and"t"
is a letter."butt"
is aword
because"but"
is a word and"t"
is a letter.
Exercise: Define two different grammars for the word language.
word1 ::= letter | word1 letter word2 ::= letter | letter word2 word3 ::= letter | word3 word3
Each of these grammars recognizes the same language: every string that matches the
word1
nonterminal matchesword2
andword3
, and every string that doesn’t matchword1
doesn’t matchword2
orword3
either.It is possible to prove that these simple grammars recognize the same language, but in general, testing whether two context-free grammars recognize the same language is undecidable! There is no algorithm that can reliably report whether two grammars are effectively the same. If this blows your mind (and it should), take CS121.
Exercise: Define a grammar for the parenthesis language, where all sentences in the parenthesis language consist of balanced pairs of left and right parentheses. Some examples:
Valid | Invalid |
---|---|
() |
())( |
(()) |
( |
()() |
)( |
(((()())()())) |
))))))))))( |
An important question is, is the empty string a valid sentence in this language? Let’s say yes. Then this works:
parens ::= "(" parens ")" | parens parens | ""
Note that the following subset grammar accepts no finite strings:
parens2 ::= "(" parens2 ")"
This is because this recursion has no base case. (You could argue that the infinite string consisting of an infinite number of
(
s, followed by an infinite number of)
s, would match the grammar, but also you could argue that it would not, because infinity is weird.)The balanced parenthesis language is context-free, but not regular. In practical terms, this means that no regular expression can recognize the language. Regular expressions are too limited to remember a nesting depth.
The shell grammar
This is the BNF grammar for sh61
, as found in problem set 5. (Note that this
grammar doesn’t define word
s, and doesn’t define where whitespace is
required. That kind of looseness is typical in systems grammars; in your
case helpers.cc
already implements the necessary rules.)
commandline ::= list | list ";" | list "&" list ::= conditional | list ";" conditional | list "&" conditional conditional ::= pipeline | conditional "&&" pipeline | conditional "||" pipeline pipeline ::= command | pipeline "|" command command ::= word | redirection | command word | command redirection redirection ::= redirectionop filename redirectionop ::= "<" | ">" | "2>"
The recursive definitions allow for lists or trees of terms to be chained
together. Taking list
as an example:
list ::= conditional | list ";" conditional | list "&" conditional
This reads “a list
is a conditional
, or a list
followed by a semicolon and
then a conditional
, or a list
followed by an ampersand
and then a
conditional
.” But this means that the list
is just a bunch of
conditional
s, linked by semicolons or ampersands! Notice that the other
recursive definitions in sh61
’s grammar also follow this pattern. In other
words:
- A
list
is a series ofconditionals
, concatenated by;
or&
. - A
conditional
is a series ofpipelines
, concatenated by&&
or||
. - A
pipeline
is a series ofcommand
s, concatenated by|
. - A
redirection
is one of<
,>
, or2>
, followed by afilename
.
We’ve written out a full definition of command
, but BNF-style
grammars seen in the wild often use shorthand, such as:
command ::= [word or redirection]...
This shorthand, like the full definition, says that a command is a series of words representing the program name and its arguments, interspersed with redirection commands.