CS 201: Strings and Languages¶

Strings and languages I.

Summary.

Numbers
Transition to strings and languages, compilers
example of translating a higher-level language program to an assembly-language program
examples of Linux utility egrep.
Basic definitions: alphabet, symbols, strings, empty string, concatenation, languages (sets of strings).

Numbers

We swim in a sea of numbers since childhood.

Basic: Counting, arithmetic, cardinality
Properties: positive/negative, odd/even, prime/composite, integer/real/complex
Equations: algebra, calculus, linear algebra
Applications: physics, accounting, finance
Algorithms: counting, sorting, range, statistics

We will now turn our attention to another data abstraction: strings. (We note that strings are in fact sequences of characters, which, underneath the hood, are actual numbers, e.g., ASCII or Unicode.)

Strings

The machine instructions executed by a computer give us a basis for understanding (1) what a program in a higher-level language (like Java) is translated to in order to be executed by a computer and (2) where our measures of the "time" and "memory" used by a program come from. For example, the wall-clock time used by a program running on a particular input is related to the number of machine language instructions that are executed in the course of running that program on that input. We now turn to item (1) -- how does a compiler translate a higher-level language program into an assembly-language (or machine-language) program?

As an example, we consider the following portion of a "mini-Java" program.


    sum = 0;
    n = 1;
    while (n < 10)
    {  sum = sum + n;
       n = n + 1;
    }
    System.out.println(sum);

This is not a terribly interesting program: it is computing the sum of the numbers from 1 to 9 and printing out the sum. We can translate this program to the following TC-201 assembly-language program:

       TC-201 assembly language    mini-Java program

            load constant-0        (sum = 0;)
            store sum
            load constant-1        (n = 1;)
            store n
while-test  load constant-10       (while (n < 10))
            sub  n
            skippos
            jump done
body        load sum               ({sum = sum + n;)
            add n
            store sum
            load n                 (n = n + 1;})  
            add constant-1
            store n
            jump while-test        
done        load sum               (System.out.println(sum);)
            output 0
            halt 0                 (we assume the program is finished.)
sum         data 0                 for the variable sum
n           data 0                 for the variable n
constant-0  data 0                 for the constant 0
constant-1  data 1                 for the constant 1
constant-10 data 10                for the constant 10

Note that an assignment like "n = 1;" is translated to a load and a store. An expression like "sum + n" is translated to a load and an add. The while statement includes code to test the while condition (n < 10) and code corresponding to the while body to execute when the condition is true. (Note that I've replaced the skipzero test that we used in lecture with the slightly more robust skippos test above, which also better reflects the meaning of the test (n < 10).) The task of a compiler is to perform the translation of programs from a higher-level language to corresponding assembly-language programs (or all the way to machine-language programs.) The most common representation of a program is as a (longish) string of characters in a file; the next few lectures will talk about notations and other technologies for dealing with strings and sets of strings.

grep vs egrep¶

egrep is "extended grep" which includes the additional meta-characters: +, ?, and |.

egrep is the same as "grep -E"

Searching: another use of strings.

Suppose I have a score file (such as I email to you giving the scores for your homework) named grade3.txt and I want to see the lines with an occurrence of the string WRONG. I can use the Linux utility egrep as follows.

> egrep "WRONG" grade3.txt
Your output is WRONG: (conf 'q4 '() 'b '(x y))

The search found one line of the file containing the string WRONG as a substring, and printed it out. Now suppose I want to see the line(s) with the string Total, to see the total score.

> egrep "Total" grade3.txt
Total test cases score: 90 / 91
+ 9/9 for TM descriptions: Total: 99/100

This search found two lines in the file containing the string Total and printed them out. If I want to see the scores for each problem, I can search for the lines containing the string Problem as follows.


> egrep "Problem" grade3.txt
========= Problem 0 =========
Problem 0 test cases score: 1 / 1
========= Problem 1 =========
Problem 1 test cases score: 12 / 12
========= Problem 2 =========
Problem 2 test cases score: 10 / 10
========= Problem 3 =========
Problem 3 test cases score: 9 / 9
========= Problem 4 =========
Problem 4 test cases score: 10 / 10
========= Problem 5 =========
Problem 5 test cases score: 9 / 10
========= Problem 6 =========
Problem 6 test cases score: 15 / 15
========= Problem 7 =========
Problem 7 test cases score: 12 / 12
========= Problem 8 =========
Problem 8 test cases score: 12 / 12
>

This is all the lines containing the string Problem, but the ones with all the ===='s do not convey any information. I can specify a pattern that will match just the lines above with scores. A sufficient pattern would be the word Problem followed by a blank followed by any single character followed by a blank followed by the string test. I can specify that as follows.


> egrep "Problem . test" grade3.txt
Problem 0 test cases score: 1 / 1
Problem 1 test cases score: 12 / 12
Problem 2 test cases score: 10 / 10
Problem 3 test cases score: 9 / 9
Problem 4 test cases score: 10 / 10
Problem 5 test cases score: 9 / 10
Problem 6 test cases score: 15 / 15
Problem 7 test cases score: 12 / 12
Problem 8 test cases score: 12 / 12
>

The "." in the pattern does not literally match a period, but will match any single character. This pattern matched all the lines with scores, and did not match the lines with the ===='s. Note that if there were ten or more problems, this pattern would fail to match those with numbers 10 or above. If I want a list of the problem scores and also the total scores, I can specify a somewhat more complex pattern as follows.


> egrep "(Problem . test)|Total" grade3.txt 
Problem 0 test cases score: 1 / 1
Problem 1 test cases score: 12 / 12
Problem 2 test cases score: 10 / 10
Problem 3 test cases score: 9 / 9
Problem 4 test cases score: 10 / 10
Problem 5 test cases score: 9 / 10
Problem 6 test cases score: 15 / 15
Problem 7 test cases score: 12 / 12
Problem 8 test cases score: 12 / 12
Total test cases score: 90 / 91
+ 9/9 for TM descriptions: Total 99/100

This pattern matches any line that contains either the string Total OR a string consisting of Problem, blank, any single character, blank, and then test.

The Linux utility egrep is just one tool that uses regular expressions -- they feature prominently in scripting languages like Perl, because one important use of scripting languages is to transform data in one format into another format. Regular expressions are used in Perl to match and extract portions of an input file, so that they may be rearranged and reformatted for an output file. Unfortunately many tools use slightly different conventions for regular expressions, so it pays to be alert for possible differences. For our practical examples, we'll refer to egrep.

Many other fields make heavy use of strings, languages and operations on strings, for example, linguistics, in which strings of letters or phonemes are a common representation for words and sentences, and biology, in which strings of nucleotides (represented by the four letters A, C, G, and T) are used to encode DNA sequences, for example, GATTACA, and strings of amino acids (which in nature form an alphabet of 20 elements) to encode proteins. The machinery of the cell translates DNA sequences to RNA sequences (which have a U instead of T), and RNA sequences to amino acid sequences. The latter process is carried out by ribosomes, which can be thought of as finite-state "transducers." (In effect, every cell in a human body teems with finite-state machines.) In biology, one important operation is to find "approximate matches" of DNA strings in huge databases of very long DNA strings encoding the genetic information about various organisms.

Basics of strings and languages.

Alphabet¶

An alphabet is a finite set of symbols, for example, $\{a, c, d, r\}$. A string is a finite sequence of symbols, for example, $daccr$. The length of a string is the number of occurrences of symbols in it, for example, length(daccr) = 5. There is a unique string of length 0, the empty string, which we will denote by a "variant epsilon" (which looks a bit like a backwards 3: ϵ. [ϵ])

Concatenation¶

We can concatenate two strings, an operation indicated by a centered dot. This constructs the string that consists of the symbols of the first string followed by the symbols of the second string. For example, the concatenation of dacc and rda is daccrda. (Note the resemblance to appending two lists.) The concatenation operation is not commutative, for example, the concatenation of $rda$ and $dacc$ is $rdadacc$, which is not equal to daccrda. The operation of concatenation is associative, which allows us to leave out parentheses, since (($x$ concatenated with $y$) concatenated with $z$) is equal to ($x$ concatenated with ($y$ concatenated with $z$)). The empty string is an identity for concatenation, because the empty string concatenated with any string is that string itself, and similarly for that string concatenated with the empty string. This means that the set of all strings over an alphabet with the operation of concatenation is a monoid. (Increment cool word score here.)

Languages¶

A language is a set (empty, finite, or infinite) of strings. For example, the set $\{car, cdr\}$ is a language containing just two strings. The empty set (of no strings) is a (rather trivial) language. The set containing just the empty string is also pretty trivial, but because it has one element, it is not equal to the empty set. The language

{car, cdr, caar, cadr, cdar, cddr, caaar, caadr, cadar, caddr, ...}

contains an infinite number of strings, namely, every string that starts with a $c$, ends with an $r$ and has a string of 1 or more $a$'s or $d$'s in between. Next time: regular expressions give a concise notation for this language.

Strings and languages II.

Summary.

Symbol
alphabet
strings
empty string
concatenation of strings
language (a set of strings)
concatenation of languages
union of languages
Kleene closure of a language
the empty language
an inductive definition of the syntax and semantics of basic regular expressions.

Strings and languages.

Recall that an alphabet is a finite set of symbols and a string is a finite sequence of symbols from an alphabet. The length of a string is the number of occurrences of symbols in it. The empty string, denoted by variant epsilon, ϵ, is the unique string of length 0. The concatenation of two strings $x$ and $y$ is the string obtained by taking the symbols of $x$ followed by the symbols of $y$. Concatenation is associative, but not commutative. The concatenation of the empty string and $s$ (in either order) is just $s$ itself. A language is just a (finite, infinite, or empty) set of strings. So the empty language, containing no strings, is a language.

Recall from last lecture the following language over the alphabet $\{a, c, d, r\}$.

{car, cdr, caar, cadr, cdar, cddr, caaar, caadr, cadar, caddr, ...}

It contains an infinite number of strings, namely, every string that starts with a $c$, ends with an $r$ and has a string of 1 or more $a$'s or $d$'s in between. We can write the following regular expression for this language.

    c(a|d)(a|d)*r

This expression contains all three of the operations that we use to define basic regular expressions, namely concatenation (indicated by juxtaposition here), alternation (indicated by the vertical bar), and Kleene star (indicated by the asterisk.)

Please see Regular expressions.

We proceeded to construct regular expressions to denote two different languages over the alphabet $\{a, b\}$. The first language is the set of all strings of even length. Because 0 is an even number, the language contains the empty string and such strings as $aa, ab, ba, bb, aaaa, aaab, aaba, aabb, abaa, abab, abba, abbb, baaa$, and so on. A basic regular expression denoting this language is the following.

    ((a|b)(a|b))*

If we iterate the $*$-loop zero times, we get the empty string. If we iterate it once, then we can choose $a$ or $b$ from the first $(a|b)$ expression, and (independently) $a$ or $b$ from the second $(a|b)$ expression, so we can get $aa, ab, ba$, or $bb$. If we iterate it twice, we can get all strings of length 4, and so on.

The second language over $\{a,b\}$ that we considered is the set of all strings with an even number of $a$'s and any number of $b$'s. Because 0 is even, some examples of this language are the empty string, $b, aa$, $bb$, $aab$, $aba$, $baa$, $bbb$, $aaaa$, $aabb$, and so on. We consider several candidates before settling on one that works. The first candidate was $(aa|b)*$. Examples of strings in its language are: the empty string, $b, aa, bb, aab, baa, bbb$, and so on.

All of the strings in this language have an even number of a's, but there are some strings that have an even number of a's that are not in its language, for example, aba. We then considered $(aa|b|aba)*$. The language of this expression contains only strings that have an even number of a's, and it contains strings like aba and abaaa, but it does not contain the string abba. The next candidate was $(ab*a)*$, whose language also contains only strings with an even number of $a$'s, including strings like $aba$, $abba$, $abaabbba$. However, every non-empty string in this language must start and end with $a$, so a string like $aabb$ is not in this language.

Our final candidate DOESN'T QUITE work: $b*(ab*a)*b*$. This expression allows 0 or more $b$'s to start, zero or more repetitions of a pair of $a$'s separated by an arbitrary number of $b$'s, and 0 or more $b$'s to end. UNFORTUNATELY, this fails to be able to generate a string like $abababa$, although we can generate abaaba. The occurrence of $b$ between the 2nd and 3rd $a$ is not accounted for by this regular expression. To CORRECT THIS, we need to allow for $b$'s after the second $a$ in this expression: $b*(ab*ab*)*b*$. This could then be simplified by noting that $b$'s after the last a can be generated by the last $b*$ inside the parentheses, so an equivalent expression would be $b*(ab*ab*)*$.

single . matches any single symbol (including spaces)
====================================================
a.c == a(a|b|...x|y|z|...8|9...)c

+ is 1 or more
===============
x+ == xx*
c(a|d)+r == c(a|d)(a|d)*r

? is 0 or 1
===========
yog(h)?urt == (yogurt|yoghurt)


[] denotes a set
================
[aeiou] == (a|e|i|o|u)

[^] denotes a set negation
==========================
[^aeiou] == NOT (a|e|i|o|u)


Abbreviations:
==============
\d == any digit
\D == any non-digit
\w == any word character (alphanumeric plus underscore)
\W == any non-word character
\s == any white space character (space, tab, newline, formfeed, CR)
\S == any non-white space character
\b == word boundary
\B == not a word boundary
^ == beginning of a string (often line)
$ == end of a string (often line)

{s,e} for range
==============
[a-z]{1,3} == all words with 1 to 3 letter
\-?\d{1,3} == all numbers with 1 to 3 digits

Regular expression practice:

ip address
dates (yyyy-mm-dd)
HTML tags
email address
web address
credit card numbers (be afraid, be very afraid)

Strings and languages III.

Summary.

Some egrep (or grep -E) extensions of basic regular expressions: dot, plus, question mark, sets, ranges, and how they can be expressed using basic regular expressions.
A language is regular if it can be denoted by a regular expression.
Are there non-regular languages?
Discussion of whether (a suitable formalization of) English might be a non-regular language, whether the lexicon (set of words) is infinite, whether there are infinitely many different sentences.
Digression on the fact that there are different cardinalities of infinity, and sketch of the proof that the real numbers are uncountable. This proof uses the technique of diagonalization, which is also useful in theoretical areas of computer science.
Introduction to deterministic finite state acceptors (DFAs), and a DFA to recognize the language of all strings of a's and b's with an even number of $a$'s and any number of $b$'s.

Some egrep extensions.

In linux, a synonym for egrep is grep -E. Basic regular expressions can be rather verbose -- to make regular expressions more useful, practical implementations introduce some extensions. We'll discuss a few of them, and show that they can be expressed in basic regular expressions, albeit less conveniently. One handy feature is that . can match any single symbol. If we explicitly list all the alphabet symbols (quite a few in ASCII), separated by |, then we get an expression we could use instead. Another useful feature is Kleene +, which is like Kleene $*$, but requires at least 1 string from the enclosed expression. Thus, the expression $c(a|d)+r$ denotes the set of all strings that start with $c$, have one or more $a$'s or $d$'s and then end in $r$. Thus, $L(c(a|d)+r) = L(c(a|d)(a|d)*r)$, and we can use this same trick to eliminate the Kleene plus from $(E)+$ by using $E(E)*$. Another useful feature is the ? operation, which indicates that we can take 0 or 1 string from the enclosed expression. Thus, $L(yog(h)?urt) = \{yogurt, yoghurt\}$. To eliminate the ? from $(E)?$ we can just take $([emptystring]|E)$, where $[emptystring]$ stands for the variant epsilon (ϵ) that we chose to name the empty string. Another useful feature is a set, for example $[aeiou]$, which matches any single symbol between the square brackets. To eliminate this, we could write $(a|e|i|o|u)$. The final feature we describe are ranges, like $a-z$, $A-Z$, or $0-9$, which can be used in sets, and match any single symbol in the range. Thus, $[a-z]$ matches any single lower case letter. Using these features, we could write an expression

    [A-Za-z]([A-Za-z0-9])*

which matches any string that starts with an upper or lower case letter and is followed by 0 or more upper case letters, lower case letters, or decimal digits. This might be the specification of an identifier in some computer language, and its language contains such strings as $Start$, $x32$ and $t3mp$.

Are there languages (ie, sets of strings) that are not regular?

One candidate might be a natural language like English. Of course, we have to agree on a formal version of English, so that it is a well-defined set of sentences. One approach would be to consider the alphabet to be upper and lower case letters, digits, and punctuation characters, including, say, blanks, periods, commas, colons, semicolons, question marks, exclamation points, parentheses and quotation marks. We imagine there is some ideal speaker of English who could examine any finite sequence of such symbols and declare it to be a correct English sentence or not. A variant of that approach would be to agree on a lexicon (word list), and take the alphabet to be the set of all the words in the lexicon, so that a sentence would be interpreted as a sequence of words rather than a sequence of characters; we'd still need the "ideal speaker to determine whether a sequence of words is a correct English sentence. There is an issue of whether the lexicon (word list) is a finite set. Compared to some other languages (such as German), English has few automatic rules to produce new words, so it might be reasonable to think of the lexicon as a fixed finite set.

With these stipulations, is formal English a non-regular language? If we believe that there are only finitely many correct sentences in this language, we could simply list them all, separated by the regular expression or symbol (|), and have a regular expression for the language. In general, any finite set of strings is a regular language. However, it seems that English contains infinitely many different correct sentences, for example the following.

    I had a bad day.
    I had a very bad day.
    I had a very very bad day.
    I had a very very very bad day.
            ...

This, by itself, doesn't mean that the language is non-regular. In fact, a regular expression for this set of sentences is the following.

    I had a (very)* bad day.

In fact, linguists generally agree that a reasonable definition of formal English is not a regular language.

A digression on infinities of different sizes.

There is a notion of size for sets, even infinite ones, that results in different sizes of infinities. The notion of size, or cardinality, is that two sets have the same cardinality if there exists a one-to-one correspondence between them. A one-to-one correspondence is a function that assigns to every element of the first set an element of the second set in such a way that (1) no two elements of the first set are assigned the same element of the second set, and (2) every element of the second set is assigned to some element of the first set. That is, the elements of the first set are paired up with the elements of the second set such that there are no overlaps and no elements left out. This makes sense for finite sets -- if I have two different sets of three elements, I can form three separate pairs consisting of an element of the first set and an element of the second set, using all the elements of both sets.

This concept allows an infinite set, for example, the positive integers, to have the same size (cardinality) as a proper subset of itself, for example, the even positive integers. In this case, a one-to-one correspondence between the positive integers and the even positive integers can be the function that takes n to 2n. This matches up the two sets as follows.

  1 --> 2
  2 --> 4
  3 --> 6
  4 --> 8
   ...

It's clear that two different positive integers are matched with two different even positive integers, and that every even positive integer is matched to some positive integer. A set that is either finite or has the same cardinality as the positive integers is said to be countable.

However, countable infinities are not the only kind of infinity. To see this, we looked at the sketch of the proof (by diagonalization) of the theorem that the set of real numbers between 0 and 1 is uncountable. That is, this set of real numbers is infinite, but of strictly larger cardinality than the set of positive integers. This proof starts by associating each real number between 0 and 1 with an infinite decimal. For example, the number (pi minus 3) starts out 0.1415962..., and the number 1/3 is 0.3333333... where the 3's continue indefinitely.

This system has a kink in it -- some numbers have two different representations. For example, we have the following, where the sequence of 9's on the left is infinite, and the sequence of 0's on the right is also infinite.

    0.499999999....  = 0.50000000...

Note that this says these two numbers are equal, not just very close to each other. To be sure of this, we need to think about the definition of what real number is specified by an infinite decimal like this. The actual definition takes the number to be the limit of the sequence of finite decimals, where we extend the fraction by one digit at a time. So the number 0.499999... with an infinite number of 9's is just the limit of the sequence of finite decimals, 0.4, 0.49, 0.499, 0.4999, and so on. It is clear that this sequence of finite decimals gets arbitrarily close to 0.5 without every exceeding it, so the limit must be 0.5.

Now that we understand this representation somewhat better, we can proceed to the proof that this set of real numbers is not countable. The proof proceeds by contradiction -- we assume that the set is countable, and then proceed by diagonalization to produce an element of the set that we haven't properly accounted for. Assuming the set is countable, we make a list of all its elements, numbered 1, 2, 3, and so on. Imagine we put these infinite decimals in a table:


  index  real number with that index
  -----  ---------------------------
    1    0.1415962.....
    2    0.3333333.....
    3    0.2340000.....
    4    0.5454545.....
    .        .
    .        .
    .        .

Our assumption is that every real number between 0 and 1 eventually gets listed as a row in this table. Now we use diagonalization to define a real number z between 0 and 1 that is different from every number in this table. To do this, we specify the decimal digits of $z$ one by one using the following rule: if the i-th digit of the number with index i is not 4, then the i-th digit of $z$ is 4, and if the i-th digit of the number with index i is 4, then the i-th digit of $z$ is 5. Thus, the number $z$ will have only 4's and 5's in its decimal expansion. For the beginning of the table above, we would define $z$'s first decimal digit to be 4 (because the first digit of the first row is 1, not 4), $z$'s second digit to be 4 (because the second digit of the second row is 3, not 4), $z$'s third digit to be 5 (because the third digit of the third row is 4), $z$'s fourth digit to be 5 (because the fourth digit of the fourth row is 4), and so on. Thus, our number $z$ would start out as follows.

    z = 0.4455.....

Note that $z$ is not one of those numbers that has two names, and it cannot appear anywhere in the table, because it has a different digit in the i-th position from the number in the i-th row, for every i = 1,2,3,... Thus, we have a contradiction -- we have defined a real number $z$ between 0 and 1 that is not on the list that we assumed contained all such numbers. The assumption, that the real numbers between 0 and 1 are countable, is therefore false. This concludes the proof that the set of real numbers between 0 and 1 is uncountable. (Note that uncountable is not a specific size or cardinality -- it just means a set that is not finite or countable.)

End of digression on different infinite cardinalities.

Deterministic finite state acceptors.

We consider another method of specifying a language, namely, a deterministic finite state acceptor (or DFA). This is a kind of computing machine with similarities to a Turing machine with no tape. A DFA has an alphabet (ie, a finite set of symbols), a finite set of states, one of which is the initial state, a set of accepting states, and a transition function. The transition function takes two inputs, a state and a symbol, and produces one output, a state. As an example, we can consider a DFA M1 with two states, defined as follows.

    alphabet = {a, b}
    states = {q1, q2}
    start state = q1
    accepting states = {q1}
    transition function:
              symbol
      state|   a    b
      ------------------
        q1 |  q2   q1
        q2 |  q1   q2

The machine M1 accepts strings with an even number of a's over the alphabet {a,b}.

A DFA can also be represented by a diagram, in which the states are represented by circles (labeled with the state name), a transition is represented by an arrow from one state to another (or to itself) labeled by a symbol, the start state is represented by a "sourceless" arrow into one state, and the set of accepting states is represented by putting an extra circle around each accepting state. A diagram of M1 is available: DFA, even a's.

Create DFA diagrams with fsm_designer

To determine whether a string is accepted or rejected by a finite state machine, we start out in the start state of the machine and follow the transitions indicated by the successive symbols of the string (processed from left to right) until we have processed all the symbols of the string. If the state we end in is accepting, then the string is accepted; otherwise, it is rejected. As examples, M1 accepts the empty string (because we start in q1, follow no transitions, and end in q1, which is an accepting state), and the strings b, aa, bb, aab, aba, baa, bbb, aaaa, aabb, and so on, and rejects the strings a, ab, ba, aaa, abb, bab, bba, aaa, and so on. The language recognized by a DFA M is the set of all strings accepted by M; we denote the language recognized by M by L(M). For M1, L(M1) is the set of all strings of a's and b's that contain an even number of a's (note that 0 is an even number) and any number of b's. This is the same language that is denoted by the CORRECTED regular expression from the previous lecture:

$$b*(ab*ab*)*b*$$

or more concisely $$b*(ab*ab*)*$$

Strings and languages IV.

Summary.

Deterministic finite state acceptors (DFAs), alphabet, states, start state, accepting state, transitions, acceptance or rejection of a string by a DFA.
Examples of DFAs to recognize the language $L(c(a|d)+r)$ and the language $L((a|b)*b(a|b))$.
Theorem (not proved): every language that can be denoted by a regular expression can be recognized by a DFA, and every language that can be recognized by a DFA can be denoted by a regular expression. That is, the two methods of specifying languages specify exactly the same class of languages, the regular languages.

Please see the notes: Deterministic finite acceptors.

Last lecture we saw the DFA M1, which recognizes the set of all strings of a's and b's with an even number of a's and any number of b's, corresponding to the CORRECTED regular expression

$$b*(ab*ab*)*b*$$

or more concisely

$$b*(ab*ab*)*$$

What about the regular expression $c(a|d)+r$ (or, equivalently, the regular expression $c(a|d)(a|d)*r$) -- is there a DFA to recognize this language? We constructed one with five states, whose description is as follows.

     alphabet = {a, c, d, r}
     states = {q1, q2, q3, q4, q5}
     start state = q1
     accepting states = {q4}
     transition function
              symbols
     states |  a   c   d   r
    ------------------------
       q1   | q5  q2  q5  q5
       q2   | q3  q5  q3  q5
       q3   | q3  q5  q3  q4
       q4   | q5  q5  q5  q5
       q5   | q5  q5  q5  q5

Note that this DFA, which we'll call M2, accepts the string cadr because starting in state q1, we visit the following states:

       c   a   d   r
    q1  q2  q3  q3  q4

and, since we end in q4, an accepting state, the string is accepted. However, for the string card, we get the following states:

       c   a   r   d
    q1  q2  q3  q4  q5

and, since we end in q5, a non-accepting state, the string is rejected. A diagram of M2 is available: DFA, c(a|d)(a|d)*r.

In this diagram, we omit state q5, the dead state. Thus, this is an incomplete DFA.

In fact, this is a general phenomenon. There is a theorem (which we won't prove) to the effect that (1) every language that can be denoted by a regular expression can also be recognized by a DFA, and vice versa, that is, (2) every language that can be recognized by a DFA can also be denoted by a regular expression. The proof of this theorem consists of two algorithms: (1) an algorithm that takes a regular expression as input and produces as output a DFA to recognize the same language, and (2) an algorithm that takes a DFA as input and produces as output a regular expression that denotes the same language. Thus, these two different methods of specifying sets of strings have the same expressive power: they can specify all and only the regular languages.

However, this doesn't necessarily mean that the specifications (as a regular expression or as a DFA) are equally concise. As an example of this phenomenon, we consider the following regular expression.

$$(a|b)*b(a|b)$$

This regular expression denotes the set of all strings of $a$'s and $b$'s in which the next-to-last symbol is a $b$. For example, $ba$, $bb$, $aba$, $abb$, $bba$, $bbb$, $aaba$, $aabb$, and so on, are strings in $L((a|b)*b(a|b))$, while the empty string, $a$, $b$, $aa$, $ab$, $aaa$, $aab$, $baa$, $bab$, and so on, are strings that are not in the language of this expression. Given the theorem described above, we know that there is a DFA that recognizes this language, so we set out to find one. See DFA for string with next to last character a

As a first attempt, we considered a three-state acceptor defined as follows.

    alphabet = {a,b}
    states = {q1, q2, q3}
    start state = q1
    final states = {q3}
    transitions
     (q1,a) -> q1
     (q1,b) -> q1
     (q1,b) -> q2
     (q2,a) -> q3
     (q2,b) -> q3

However, this is an NFA and NOT a DFA. NFA stands for Nondeterministic Finite State Acceptor, as opposed to DFA, which stands for Deterministic Finite State Acceptor. In the acceptor above, there is a choice of what transition to follow from state $q1$ on symbol $b$ -- we can either go to state $q1$ OR go to state $q2$. In a DFA, we have no choice -- given a state and a symbol there is always exactly one transition defined. (Or at most one transition, if we consider incomplete DFAs.) For an NFA, the definition of whether a string is accepted is whether there exists a way to choose the transitions so as to end in an accepting state. (The notion of nondeterminism is related to the famous P = NP? question, which we'll see later in the course.) Indeed the language of this NFA is the same as the language of the expression $(a|b)*b(a|b)$, but we still need to keep looking for a DFA that recognizes this language.

The next observation was that if only we could read in the input string backwards we would have an easier time of it. That is, we can consider the language consisting of the reverses of all the strings in $L((a|b)*b(a|b))$, which consists of all strings of $a$'s or $b$'s that have $b$ as the second symbol of the string, that is, $L((a|b)b(a|b)*)$. This reverse language has a DFA with four states, as follows.

    alphabet = {a,b}
    states = {q1, q2, q3, q4}
    start state = q1
    accepting states = {q3}
    transitions
    (q1, a) -> q2
    (q1, b) -> q2
    (q2, a) -> q4
    (q2, b) -> q3
    (q3, a) -> q3
    (q3, b) -> q3
    (q4, a) -> q4
    (q4, b) -> q4

Note that q4 is a dead state, so we could omit $q4$ and all transitions involving it and get an incomplete DFA for the same language. But, since we might not have the freedom to read the strings in reverse, we must still look for a DFA to recognize the unreversed language, $L((a|b)*b(a|b))$.

We finally came to the following four state DFA for this language

    alphabet = {a,b}
    states = {q1, q2, q3, q4}
    start state = q1
    accepting states = {q3, q4}
    transitions
    (q1, a) -> q1
    (q1, b) -> q2
    (q2, a) -> q3
    (q2, b) -> q4
    (q3, a) -> q1
    (q3, b) -> q2
    (q4, a) -> q3
    (q4, b) -> q4

Note that this DFA does not accept the empty string (because $q1$ is not an accepting state), $a$, $aa$, $b$, $ab$, $aaa$, $aab$, $baa$, $bab$, but does accept $ba$, $bb$, $aba$, $abb$, $bba$, $bbb$, so this machine seems quite promising. Once the machine has read at least 2 symbols of the input, then the function of each of the states are as follows:

    q1 -- previous two input symbols were aa
    q2 -- previous two input symbols were ab
    q3 -- previous two input symbols were ba
    q4 -- previous two input symbols were bb

This is enough information to decide whether the state should be accepting or not.

A more verbose, but also correct, DFA for this language is as follows.

    alphabet = {a,b}
    states = {q1, q2, q3, q4, q5, q6, q7}
    start state = q1
    accepting states = {q6, q7}
    transitions:
    (q1, a) -> q2
    (q1, b) -> q3
    (q2, a) -> q4
    (q2, b) -> q5
    (q3, a) -> q6
    (q3, b) -> q7
    (q4, a) -> q4
    (q4, b) -> q5
    (q5, a) -> q6
    (q5, b) -> q7
    (q6, a) -> q4
    (q6, b) -> q5
    (q7, a) -> q6
    (q7, b) -> q7

In this DFA, if we are in the state q1, then no symbols have been read from the input. If we are in state $q2$, then just a single symbol a has been read from the input; similarly, in state $q3$, just a single symbol $b$ has been read from the input. We are in state $q4$ if at least two symbols have been read, and the last two of them were $aa$. Similarly, state $q5$ "remembers" that the last two symbols were $ab$, $q6$ that the last two symbols were $ba$, and $q7$ that the last two symbols were $bb$. This DFA also accepts the language $L((a|b)*b(a|b))$.

Suppose we generalize the language to test a position further from the end of the string, as follows.

    L2 = L((a|b)*b(a|b)(a|b))
    L3 = L((a|b)*b(a|b)(a|b)(a|b))
    L4 = L((a|b)*b(a|b)(a|b)(a|b)(a|b))
       ...

The regular expression just gets longer by one more $(a|b)$ at each step, but the corresponding DFA would have to have states to remember the last 3 symbols, or the last 4 symbols, or the last 5 symbols, and so on. Because there are $2^n$ combinations of $a$ and $b$ for the possibilities for the last $n$ symbols, these DFAs will have an exponentially increasing number of states, $2^3$, $2^4$, $2^5$, and so on. This shows that regular expressions can be exponentially more concise than DFAs for some families of regular languages. Other examples (which we won't be covering) show that DFAs can be exponentially more concise than regular expressions for some other families of regular languages.

Strings and languages V.

Summary.

Algorithms relating to regular languages.
Are there non-regular languages? YES.
Examples of non-regular languages.

There are some useful algorithms related to regular languages, which we will not be covering in this course. There are (not necessarily efficient) algorithms to convert a DFA to a regular expression and to convert a regular expression to a DFA. There are efficient algorithms for the following two problems. Given a DFA and a string, determine whether the DFA accepts the string. This is called the membership problem (is the string a member of the language of the DFA?) and is part of the homework for this topic. Given two DFAs, say M1 and M2, determine whether they accept the same language. This is called the equivalence problem (do M1 and M2 have the same behavior, even though as machines they may look quite different?) It is not part of the homework for this topic.

Are there non-regular languages? YES.

Reason (1): there are uncountably many languages, but only countably many regular languages.
Reason (2): there are languages whose membership problem is uncomputable, but the membership problem is computable (efficiently!) for every regular language.
Reason (3): there are specific languages we can show to be non-regular, for example,
- the language of all strings of a's whose length is a Fibonacci number
- the language of all strings of n a's followed by n b's, for any nonnegative integer n,
- the language of all strings of balanced parentheses.

For reason (1): if we fix a (finite) alphabet A of symbols, then the set of all regular expressions over the alphabet A is countable. To see this, note that we can list all the strings over the alphabet A together with the (finite set) of symbols used to express the empty string, concatenation, union, Kleene star, and parentheses in an infinite sequence by first listing the strings of length 1, then the strings of length 2, then the strings of length 3, and so on. Many of these strings are not correct regular expressions, but we can imagine just crossing all of them off, and the resulting subsequence of correct regular expressions will still be numberable by the positive integers, and will include EVERY regular expression over the alphabet A.

Thus, the set of all regular languages over the alphabet A is also countable. However, the set of ALL languages over the alphabet A is uncountable. To see this, we first list all of the strings over A (in order of increasing length, as above), say s1, s2, s3, s4, ... . Then we can represent any language L over the alphabet A by an infinite sequence of 0's and 1's, say b1, b2, b3, b4, ..., where bi = 1 if si is in L, and bi = 0 if si is not in L. Then a diagonalization argument similar to the one we saw before shows that the set of ALL languages is not countable. Hence there must be (uncountably many, but at least one) language over the alphabet A that is not regular.

Reason (2) (not covered in lecture, not material for exams) is that for every regular set L there exists a procedure P_L that takes as input a string s and decides whether s is in L and outputs #t if so and #f if not. P_L is the membership procedure for L. This follows from the fact that there is a procedure to decide for a DFA M and a string s, whether or not M accepts s. However, we can define a language L_H (that is, set of strings) for which there is no membership procedure. L_H consists of the set of all strings s encoding Turing machines such that s on input s halts. Because a membership procedure for L_H would allow us to solve the Halting Problem for Turing machines, we know that no membership procedure can exist for L_H. Thus L_H is a non-regular language. This is a specific non-regular language (as opposed to the non-specific existence proof based on countability and uncountability), but still pretty abstract!

Reason (3): there are specific, concrete languages we can show to be non-regular.

One example, covered in lecture, is the language of all strings of a's that have a length that is a Fibonacci number. In symbols:

    L_F = {a^n : n is a Fibonacci number}

Here we've used the "exponential" notation for repeated concatenation, so that aⁿ represents a string of n a's. The Fibonacci numbers are 1, 1, 2, 3, 5, 8, 13, 21, ..., defined by the recurrence:

    F(n) = F(n-1) + F(n-2)  for n > 1
    F(1) = F(0) = 1

Each Fibonacci number in the sequence (except for the first two, which are stipulated to be 1) is the sum of its two predecessors.

To see why the language L_F is non-regular, we further explore the properties of regular languages. If L is a regular language over the finite alphabet A = {a}, then there is a DFA M that recognizes L. M has a start state, and a transition on symbol "a" to another (possibly the same) state, and a transition on symbol "a" to another (possibly the same) state, and so on. Because M has only finitely many states, there must eventually be a transition back to a previously visited state, at which point the rest of the behavior of M is completely determined. In pictures, assuming q1 is the start state of M:

           a      a      a         a      a        a       a
       q1 --> q2 --> q3 -->  .... --> qr --> qr+1 --> ... --> qs
                                      ^                        |
                                      |           a            |
                                      --------------------------

Here we've renumbered the states so that q_i is the state reached after (i-1) a's. From the start state, a string of a's visits new states up through (s-1) a's, and then on the next a, goes back to a previously-visited state, namely q_r. Then more a's just goes around the loop of states from q_r back to q_r. If there is no accepting state in the loop, then once a string of a's gets longer than (r-1), it will not be accepted, so in this case, only finitely many strings could be in the language. However, if there is an accepting state in the loop, say qm, then every string of a's of length (m + k(s-r+1)) for k = 0, 1, 2, 3, ... will also be accepted, because the length of the loop is (s-r+1). Neither of these cases is compatible with recognizing L_F, the Fibonacci language, because L_F is infinite (so a DFA that recognizes a finite language is not suitable) and because the distances between successive Fibonacci numbers increase without bound, so no matter how large (s-r+1) is, for sufficiently large Fibonacci numbers it will be smaller than the distance between consecutive Fibonacci numbers. Hence, F_L cannot be regular.

Similar types of proofs can be given for other specific concrete languages, for example

    L = {a^nb^n : n >= 0}

that is, the language of all strings of n a's followed by n b's, for every nonnegative integer n. Another example of a nonregular language is the set of all strings of a's and b's that are palindromes, that is, such that the string is equal to its reverse. Please see the notes: Deterministic finite acceptors for a sketch of a proof that the language of palindromes is not regular.

Strings and languages VI.

Summary.

The Chomsky hierarchy of regular, context free, context sensitive, and type 0 languages.
An example context free grammar: G1.
Definitions of context free grammar, derivation, parse tree, the language of a context free grammar.
A DFA to recognize the language L(G1).
Examples of context free languages that are NOT regular: the language of n a's followed by n b's, the language of balanced parentheses.
What about the language of all strings of a's whose length is a Fibonacci number? It is not regular, but is it context free?

The Chomsky hierarchy (named after the linguist Noam Chomsky) consists of four families of languages, each family a proper subset of the next. The smallest family is the regular languages, next is the context free languages, third is the context sensitive languages, and the largest is the type-0 or "recursively enumerable" languages. This lecture will consider the second family, the context free languages. Please see the notes: Context free languages, context free grammars, and BNF.

We'll start with an example of a context free grammar, and then explain the general definition.

    S -> NP VP
    NP -> Det N | PN
    Det -> the | a
    N -> cat | dog | mouse
    PN -> it
    VP -> VI | VT NP | V3 that S
    VI -> slept | swam
    VT -> chased | evaded
    V3 -> believed | dreamed

A context free grammar has two disjoint finite alphabets, the nonterminal symbols (above these are: S, NP, Det, N, PN, VP, VI, VT, V3) and the terminal symbols (above these are: the, a, cat, dog, mouse, it, that, slept, swam, chased, evaded, believed, dreamed). One of the nonterminal symbols is distinguished as the start nonterminal (above: S, traditionally abbreviating "sentence" in linguistic applications of context free grammars.) These symbols appear in rules, each of which has a left hand side that is a nonterminal, and a right hand side that is a concatenation of terminals and nonterminals. The first rule above is "S -> NP VP" which has lefthand side S and right hand side "NP VP", which is a concatenation of the nonterminals NP and VP. The rule "PN -> it" has left hand side PN, and righthand side just the terminal symbol "it". Several rules with the same left hand side may be abbreviated by listing the left hand side once, the arrow ->, then the several right hand sides separated by the symbol | (for "or", as in regular expressions.) Thus, the line "NP -> Det N | PN" represents two rules, "NP -> Det N" and "NP -> PN". Similarly, the line "VP -> VI | VT NP | V3 that S" represents three rules: "VP -> VI", "VP -> VT NP", and "VP -> V3 that S".

What can we do with a context free grammar? We can "derive" strings of terminal symbols from the start symbol. Starting with the start symbol, we can repeatedly apply the following procedure: find a nonterminal symbol in the current string, and a rule with that nonterminal symbol as its left hand side. Replace one occurrence of the nonterminal symbol with the right hand side of the rule to yield a new string. This stops when the string consists entirely of terminal symbols -- this is the string derived by the sequence of steps we took. So, for the example grammar, starting with the start symbol S, we have just the string:

We choose a nonterminal symbol (not much choice: just S) and a rule with S on the left hand side (again, not much choice, just S -> NP VP) and substitute the right hand side for one occurrence of the nonterminal to get:

    NP VP

Because this string has at least one nonterminal symbol, we can keep going. If we choose nonterminal symbol NP to rewrite, we have a choice between the rules "NP -> Det N" and "NP -> PN". If we choose the former and replace NP with the right hand side, we get:

    Det N VP

If we now choose the nonterminal VP, we have a choice of three possible rules. If we choose the rule "VP -> V3 that S", we get:

    Det N V3 that S

Note that the terminal symbol "that" will stay put -- the rules do not allow us to rewrite terminal symbols. We could choose the nonterminal Det and the rule "Det -> the" to get:

    the N V3 that S

Now we show several steps of this process:

    the N dreamed that S             (using V3 -> dreamed)
    the mouse dreamed that S         (using N -> mouse)
    the mouse dreamed that NP VP     (using S -> NP VP)
    the mouse dreamed that PN VP     (using NP -> PN)
    the mouse dreamed that it VP     (using PN -> it)
    the mouse dreamed that it VI     (using VP -> VI)
    the mouse dreamed that it slept  (using VI -> slept)

At this point, the process stops because the string consists entirely of terminal symbols. We have derived the string "the mouse dreamed that it slept" from the start symbol S.

The language of a context free grammar is the set of all strings of terminal symbols that can be derived in this way from the start nonterminal. So "the mouse dreamed that it slept" is a string in the language of this grammar. There are infinitely many others, for example, "the cat believed that a mouse dreamed that the cat evaded a dog" and "it swam". A language (ie, set of strings) is context free if and only if there exists a context free grammar of which it is the language. It's not too hard to show (though beyond the scope of this course) that every regular language is context free. Shortly we'll see a context free grammar for the language {aⁿbⁿ : n > 0}, that is, the set of strings of n a's followed by n b's, for all positive integers n. This language is NOT regular, and is therefore evidence that not all context free languages are regular.

An alternative way to indicate that a terminal string can be derived from the start nonterminal is to give a parse tree for the string with respect to the grammar. For example, a parse tree for the string "the mouse dreamed that it slept" with respect to the example grammar is the following.

            S
           ----
          /    \
        NP      VP
       ----     --------
      /   \    /   \    \
    Det   N   V3  that   S
    ---   --  --         ----
     |    |   |         /    \
    the mouse dreamed  NP     VP
                       --     --
                       |      |
                       PN     VI
                       --     --
                       |      |
                       it     slept

The root node is labeled by the start nonterminal, S. Each node is either an internal node (and has children) or a leaf node (and has no children). Each leaf node is labeled by a terminal symbol. Each internal node is labeled by a nonterminal symbol. Taking the nonterminal symbol on an internal node as the left hand side, and the sequence of symbols (in order) of its (direct) children as the right hand side, the result is a rule in the grammar. Thus, at the root node, the label is S, the children are labeled NP and VP, and there is a rule "S -> NP VP" and so on for all the internal nodes of the parse tree. Note that this does demonstrate that there is a derivation of the string "the mouse dreamed that it slept" from S, though it does not completely specify the order in which the grammar rules were used to rewrite the nonterminals. So, in a derivation, we could expand NP before VP or vice versa, but in the parse tree it does not matter which comes first.

In fact, the language of this particular example grammar happens to be regular. To verify this, we can give a deterministic finite acceptor (DFA) to recognize the language: Diagram of DFA for mouse, cat, dog context free language.

To see that there are context free languages that are not regular, we can consider the example of n a's followed by n b's, for all positive integers n, or the language of balanced parentheses, neither of which is regular. A context free grammar for {aⁿbⁿ : n > 0} is as follows, where there is one nonterminal symbol S, which is also the start symbol, and two terminal symbols {a, b}, and two rules given as follows.

    S -> ab | aSb

For example, a parse tree for aaabbb with respect to this grammar is the following.

A grammar of balanced parentheses can be given as follows.

    S -> () | (S) | SS

Here there is one nonterminal, S, which is also the start symbol, and two terminal symbols "(" and ")", and three rules. To see that the string (())() is in the language of this grammar, we consider the following parse tree.

            S
           ---
          /   \
         /     \
        S       S
       ---     ---
      / | \    / \
     (  S  )  (   )
       ---
       / \
      (   )

A slightly more complex grammar could generate strings of balanced parentheses and square brackets, whose language would include strings like ()). Question: What about the non-regular language of {aⁿ : n is a Fibonacci number} -- is this language context free or not?

Compilers.

Summary.

Overview of what a compiler does.
For a description of Backus-Naur Form (BNF) notation for context free grammars, please see the notes: Context free languages, context free grammars, and BNF.
Example of a BNF grammar: BNF for MiniJava.
Example of a Statement in MiniJava: MiniJava Statement.
Character-level view of the file: characters comprising MiniJava Statement. (This was produced by the Racket program:

show-chars.rkt.)
Description of how a simple compiler might generate assembly-language code from the parse tree for the MiniJava Statement.

A compiler takes as input a program in a higher-level language like Java or C, and produces as output an equivalent program in assembly language (or machine language) for a particular kind of computer, which can then be assembled and loaded into the memory of that kind of computer, and run. A program in a higher-level language is typically represented by a (longish) sequence of characters, that is, a string. (See characters comprising MiniJava Statement. for a character-level view of the MiniJava Statement.) [Execute (show) in racket]

The work of a compiler can be understood in terms of phases. First, during the lexical analysis phase, the input string is divided into separate "tokens" including identifiers, keywords, numbers, strings, operator symbols (eg, +, &&), delimiters (eg, parentheses, braces, commas, semicolons) and the like; this process is often done with the aid of a DFA. In the example given, the lexical analysis would ignore the first 4 spaces, and produce a token "{", then ignore the newline character and the next 5 spaces, then collect the characters "s", "u", "m", followed by space, to produce the IDENTIFIER "sum", ignore the trailing space, produce the token "=", ignore the next space, collect the character "0" followed by ";" to produce the INTEGER_LITERAL "0", then produce the token ";", and so on through the file. This process transmutes the string of characters into a string of tokens -- the BNF grammar is written in terms of a terminal alphabet of tokens rather than characters.

Next is the parsing phase; the sequence of tokens is parsed using a context free grammar of the higher-level language, producing a parse tree for the program. If the program does not parse correctly according to the grammar (ie, is not in the language of all correct programs in that higher-level language), the compiler attempts to return an informative error message allowing the programmer to isolate the error. There is a phase of semantic checking, which includes matching the types of arguments with the types of functions; an error message is returned if the types do not match properly. There is a phase of code generation; assembly language instructions (or machine instructions) are generated from the parse tree to construct an equivalent program in assembly language (or machine language.) There is a phase of "optimization" in which the compiler attempts various improvements of the resulting program, eliminating redundant or unreachable instructions, moving unnecessarily repeated instructions out of loops, and many others. A production compiler usually offers the programmer options for more or less aggressive optimization of his or her program. Unfortunately, even the most aggressive optimization cannot compensate for a poor choice of algorithm, so the results may be far from "optimal". In modern compilers these and other phases may be repeated and intermingled.

We have not described how to parse a string according to a context-free grammar to get a parse tree. Instead, we will assume we already have a parse tree for a MiniJava Statement, and sketch the process of generating an equivalent TC-201 assembly language program. This sketch is intended to give some insight into the code-generation phase of a compiler. We consider the MiniJava Statement:

    {
     sum = 0;
     n = 1;
     while (n < 10)
        {sum = sum + n;
        n = n + 1;}
     System.out.println(sum);
    }

This is a correct MiniJava Statement, which has a parse tree as follows (with abbreviations St, Exp, ID, and IL for Statement, Expression, IDENTIFIER, and INTEGER_LITERAL.) Note that since Identifier can only be replaced by IDENTIFIER, they have been identified below.

  St
  ---------------------------------------------
 / |         \           \                 \   \
{  |          \           \                 \   }
   St         St          St                St
  ------      ------      -------------     ------------------------
 / |  \ \    / | \  \     |    \ \  \  \    |                 \ \  \ \
ID = Exp ;  ID = Exp ;   while ( Exp )  \  System.out.println ( Exp ) ;
 |   ---     |   ---             ---     \                      ---
sum   |      n    |             / | \     \                      |
     IL          IL           Exp < Exp    \                    ID
      |           |           ---   ---     \                    |
      0           1            |     |       \                  sum
                              ID    IL        \  
                               |     |        St
                               n    10      ---------------------
                                            /    /      \         \
                                           {    /        \         }
                                              St          St
                                            ------        -------
                                           / |  \ \      / |  \  \
                                          ID = Exp ;    ID =  Exp ;  
                                           |   ---       |    ---
                                         sum  / | \      n   / | \
                                             /  +  \        /  +  \
                                           Exp    Exp     Exp    Exp
                                           ---    ---     ---    ---
                                            |      |       |      |
                                           ID     ID      ID     IL
                                            |      |       |      |
                                           sum     n       n      1

Note that we have interpreted productions of the form

    Statement ::= "{" (Statement)* "}"

as allowing a parse tree in which a node labeled by the nonterminal symbol Statement may have a leftmost child labeled by {, a rightmost child labeled by }, and zero or more children (in between) labeled by Statement. We also assume that the lexical items ID and IL are annotated by the actual identifier or integer literal they correspond to. (For another example of a MiniJava Statement parse tree, see Another MiniJava parse tree.)

Now the goal is to sketch the process of generating code for the TC-201 computer for this parse tree to suggest how it could be automated. We need to reserve one memory location in the TC-201 for each identifier and integer literal in the program, so we can first gather them up and construct data statements for them:

  sum: data 0
  n:   data 0
  c0:  data 0
  c1:  data 1
  c10: data 10

We've assigned simple symbolic names to the integer literals 0, 1, 10. According to our usual convention, these data statements will be placed after the instructions for the program. Then we may proceed recursively, processing the parse tree from the root (top node) down.

At the root we have a Statement that consists of {, four Statements, and }. Clearly the { and }, though necessary for the syntax of the program, do not further contribute to the meaning of the program. A Statement that consists of a sequence of Statements can be compiled by (recursively) generating code for each of constituent Statements, and then placing that code in sequence in the memory, because the meaning of a sequence of Statements in MiniJava is: execute the first Statement, then execute the second Statement, then execute the third Statement, and so on, for each of the Statements in the sequence.

The first of the four Statements is an assignment, which is parsed as

   ID = Exp ;

(Recall that we've replaced Identifier by IDENTIFIER, abbreviated by ID.) The instructions for an assignment are (recursively) generate code to calculate the value of the Expression on the right hand side and get that value into the accumulator, and follow that by a store instruction into the memory location for the Identifier, in this case, sum. Schematically, we have

    (code to get value of Expression in accumulator)
    store sum

In this case, the Expression is just an INTEGER_LITERAL, so the instructions to get its value into the accumulator consist of just a load instruction from the memory location that holds the integer literal, in this case, c0. Thus, the code for the first of the four Statements is:

    load c0
    store sum

This correctly represents the intended meaning of the MiniJava Statement "sum = 0;".

The generation of instructions for the next MiniJava Statements, "n = 1;" is similar, and results in the instructions:

    load c1
    store n

The third Statement of the four is more complex, it is a while Statement that is parsed as

    "while" "(" Expression ")" Statement

The Expression gives the condition for the while body to be executed, and the Statement is the while body. The instructions generated form a loop that tests the while condition -- if it is false, execution jumps to the next Statement in sequence, and if it is true, the instructions for the while body are executed, and at the end there is a jump back to the instructions to test the while condition again. Schematically, we have:


loop: (instructions to put value of Expression in accumulator)
      skippos
      jump next
      (instructions for the Statement (while body))
      jump loop
next: ...

Here we have assumed that the while body will be executed as long as the value of the Expression is positive. If we make the convention that positive numbers represent true and non-positive numbers represent false, then an Expression with a positive value will cause the while body to be executed again, and an Expression with a zero or negative value will cause the while Statement to terminate.

The Expressions we have in this program are just IDENTIFIERS, INTEGER_LITERALS, sum Expressions, of the form

    Exp + Exp

and less than Expressions, of the form

    Exp < Exp

Sum expressions can be evaluated by recursively evaluating the first Expression and saving its value in a temporary location, recursively evaluating the second Expression, and using an add instruction to add the value in the temporary location to it. Schematically, we have

    (instructions to put value of first Expression in accumulator)
    store temp1
    (instructions to put value of second Expression in accumulator)
    add temp1

Of course, we have to be careful with the allocation of temporary locations -- we don't want the instructions calculating the value of the second Expression (which itself could involve numerous recursive calculations) to overwrite the value of temp1. For this program, the sum Expressions are simple. For Expression "sum + n" we get

    load sum
    store temp1
    load n
    add temp1

It is clear that strictly following our scheme, we will produce redundant instructions -- above, there is no need to use temp1, since we could just use "load sum" and "add n". This is one kind of local optimization that a compiler might make. For Expression "n + 1" we get

    load n
    store temp1
    load c1
    add temp1

which is similarly redundant.

For a less than Expression, we can (recursively) calculate the value of the first Expression and save it in a temporary location, (recursively) calculate the value of the second Expression, and subtract the value in the temporary location from it. This will have the effect of putting the value of the second Expression minus the value of the first Expression in the accumulator. Barring arithmetic overflow (which we here ignore) this will be positive if and only if the first Expression is less than the second Expression. Schematically:

    (instructions to calculate the value of the first Expression)
    store temp1
    (instructions to calculate the value of the second Expression)
    sub temp1

For the Expression "n < 10" we get the instructions:

    load n
    store temp1
    load c10
    sub temp1

Once again, rather redundant -- we could just use "load c10" followed by "sub n" in this case. This will leave a positive number in the accumulator if n < 10, and a zero or negative number if n >= 10.

Because the body of the while Statement is itself a sequence of two Statements that are assignments, and we have considered all the types of Expressions that occur, we see that the instructions generated for the while Statement (without optimization) are:

loop: load n          while (n < 10)
      store temp1
      load c10
      sub temp1
      skippos
      jump next
      load sum        {sum = sum + n;
      store temp1
      load n
      add temp1
      store sum
      load n           n = n + 1;}
      store temp1
      load c1
      add temp1
      store n
      jump loop
next: ...

where next: is the label of the following statement. The fourth Statement at the top level is a print statement, parsed as:

    "System.out.println" "(" Expression ")" ";"

For this we have the schematic:

    (code to put value of Expression in accumulator)
    output

This will print out the integer value of the Expression. Putting it all together (with an added halt instruction to stop the program after it has executed), we get the following TC-201 assembly-language program:

      load c0        {sum = 0;
      store sum
      load c1
      store n         n = 1;
loop: load n          while (n < 10)
      store temp1
      load c10
      sub temp1
      skippos 0
      jump next
      load sum        {sum = sum + n;
      store temp1
      load n
      add temp1
      store sum
      load n           n = n + 1;
      store temp1
      load c1
      add temp1
      store n
      jump loop        }
next: load sum         System.out.println(sum);
      output 0
      halt 0          } halt added to stop program
sum:  data 0
n:    data 0
c0:   data 0
c1:   data 1
c10:  data 10
temp1: data 0        additional temporary location needed