CPSC 467a -- Fall 2008
Letter Frequency Files for Problem Set 1
----------------------------------------

I have provided four data files -- two for testing purposes and two
realistic files on which to run your experiments.

TEST FILES

half_a_freq.dat: 

    In this file, the letter 'A' occurs with probability 1/2; the
    other 25 letters each have probability 1/50.  On every 1-letter
    message, your cryptanalysis program should guess the message to be
    'A'.  It will be correct 1/2 of the time.

    On 2-letter messages, when it sees ciphertext 'yz', it should
    guess that y is the encryption of 'A', so the key must be y.  It
    should guess D_k(yz) as the message.  If 'y' really is the
    encryption of 'A', it will be correct, otherwise not, so again it
    will be correct 1/2 of the time.

    On 3-letter messages, it should be correct more than 1/2 of the
    time.  This is because whenever a ciphertext has two letters the
    same, those two letters will almost always correspond to 'A', so
    in those cases the best guess is the key which decrypts those
    letters to 'A'.

    To analyze this case, let's first see what the algorithm will
    guess in each case and then try to find the probability that each
    of those good cases occur.  The sum of these probabilities is the
    overall correctness probability.  In the following, x, y, z
    represent distinct letters.

    If the algorithm sees xyz, it may as well guess that x='A'.  If it
    sees a repeated letter, it guesses that the repeated letter is
    'A', so for xxy, xxx, yxx, etc., it will guess x='A'.

    Now let's count the cases where it is correct.  Whenever the
    plaintext has at least two occurrences of 'A', the algorithm will
    guess correctly.  The probability of exactly 2 occurrences of 'A'
    is 3/8, and the probability that all three occurrences are 'A' is
    1/8, for a total of 1/2.  In addition, the algorithm guesses
    correctly in the case that the first letter is A and the other two
    letters are distinct and not 'A'.  This happens with probability
    (1/2)*(25/50)*(24/50) = 3/25.  Adding these cases up, the
    probability that the guess is correct is 1/2 + 3/25 = 0.62.

uniform.dat:

    In this file, all letters are equally probable, so also all
    messages are equally likely.  Given a ciphertext c, the 26
    possible messages are equally likely, so whatever the program
    guesses is correct with probability 1/26, no matter how long the
    message.


"REAL" FREQUENCY FILES

Run your experiments on these two files.

merrywivesofwindsor_freq.dat:

    The frequencies in this file are the actual letter counts from an
    on-line version of Shakespeare's play, "The Merry Wives of
    Windsor".

ulysses_freq.dat:

    The frequencies in this file are the actual letter counts from an
    on-line version of James Joyce's novel, "Ulysses".  This novel is
    know for its huge vocabulary and rich use of the English language.
    Note for example that while Ulysses has only about 10% more
    occurrences of the letter 'E', it has about 29% more uses of the
    letter 'D'.  This shows that the notion of "English language
    letter frequency" is a great oversimplification of reality.