CPSC 467b -- Spring 2010
Letter Frequency Files for Problem Set 1
----------------------------------------
I have provided four data files -- two for testing purposes and two
realistic files on which to run your experiments.
TEST FILES
half_a_freq.dat:
In this file, the letter 'A' occurs with probability 1/2; the
other 25 letters each have probability 1/50. On every 1-letter
message, your cryptanalysis program should guess the message to be
'A'. It will be correct 1/2 of the time.
On 2-letter messages, when it sees ciphertext 'yz', it should
guess that y is the encryption of 'A', so the key must be y. It
should guess D_k(yz) as the message. If 'y' really is the
encryption of 'A', it will be correct, otherwise not, so again it
will be correct 1/2 of the time.
On 3-letter messages, it should be correct more than 1/2 of the
time. This is because whenever a ciphertext has two letters the
same, those two letters will almost always correspond to 'A', so
in those cases the best guess is the key which decrypts those
letters to 'A'.
To analyze this case, let's first see what the algorithm will
guess in each case and then try to find the probability that each
of those good cases occur. The sum of these probabilities is the
overall correctness probability. In the following, x, y, z
represent distinct letters.
If the algorithm sees xyz, it may as well guess that x='A'. If it
sees a repeated letter, it guesses that the repeated letter is
'A', so for xxy, xxx, yxx, etc., it will guess x='A'.
Now let's count the cases where it is correct. Whenever the
plaintext has at least two occurrences of 'A', the algorithm will
guess correctly. The probability of exactly 2 occurrences of 'A'
is 3/8, and the probability that all three occurrences are 'A' is
1/8, for a total of 1/2. In addition, the algorithm guesses
correctly in the case that the first letter is A and the other two
letters are distinct and not 'A'. This happens with probability
(1/2)*(25/50)*(24/50) = 3/25. Adding these cases up, the
probability that the guess is correct is 1/2 + 3/25 = 0.62.
uniform.dat:
In this file, all letters are equally probable, so also all
messages are equally likely. Given a ciphertext c, the 26
possible messages are equally likely, so whatever the program
guesses is correct with probability 1/26, no matter how long the
message.
"REAL" FREQUENCY FILES
Run your experiments on these two files.
merrywives_freq.dat:
The frequencies in this file are the actual letter counts from an
on-line version of Shakespeare's play, "The Merry Wives of
Windsor".
ulysses_freq.dat:
The frequencies in this file are the actual letter counts from an
on-line version of James Joyce's novel, "Ulysses". This novel is
know for its huge vocabulary and rich use of the English language.
Note for example that while Ulysses has only about 10% more
occurrences of the letter 'E', it has about 29% more uses of the
letter 'D'. This shows that the notion of "English language
letter frequency" is a great oversimplification of reality.