CPSC 467a -- Fall 2008 Letter Frequency Files for Problem Set 1 ---------------------------------------- I have provided four data files -- two for testing purposes and two realistic files on which to run your experiments. TEST FILES half_a_freq.dat: In this file, the letter 'A' occurs with probability 1/2; the other 25 letters each have probability 1/50. On every 1-letter message, your cryptanalysis program should guess the message to be 'A'. It will be correct 1/2 of the time. On 2-letter messages, when it sees ciphertext 'yz', it should guess that y is the encryption of 'A', so the key must be y. It should guess D_k(yz) as the message. If 'y' really is the encryption of 'A', it will be correct, otherwise not, so again it will be correct 1/2 of the time. On 3-letter messages, it should be correct more than 1/2 of the time. This is because whenever a ciphertext has two letters the same, those two letters will almost always correspond to 'A', so in those cases the best guess is the key which decrypts those letters to 'A'. To analyze this case, let's first see what the algorithm will guess in each case and then try to find the probability that each of those good cases occur. The sum of these probabilities is the overall correctness probability. In the following, x, y, z represent distinct letters. If the algorithm sees xyz, it may as well guess that x='A'. If it sees a repeated letter, it guesses that the repeated letter is 'A', so for xxy, xxx, yxx, etc., it will guess x='A'. Now let's count the cases where it is correct. Whenever the plaintext has at least two occurrences of 'A', the algorithm will guess correctly. The probability of exactly 2 occurrences of 'A' is 3/8, and the probability that all three occurrences are 'A' is 1/8, for a total of 1/2. In addition, the algorithm guesses correctly in the case that the first letter is A and the other two letters are distinct and not 'A'. This happens with probability (1/2)*(25/50)*(24/50) = 3/25. Adding these cases up, the probability that the guess is correct is 1/2 + 3/25 = 0.62. uniform.dat: In this file, all letters are equally probable, so also all messages are equally likely. Given a ciphertext c, the 26 possible messages are equally likely, so whatever the program guesses is correct with probability 1/26, no matter how long the message. "REAL" FREQUENCY FILES Run your experiments on these two files. merrywivesofwindsor_freq.dat: The frequencies in this file are the actual letter counts from an on-line version of Shakespeare's play, "The Merry Wives of Windsor". ulysses_freq.dat: The frequencies in this file are the actual letter counts from an on-line version of James Joyce's novel, "Ulysses". This novel is know for its huge vocabulary and rich use of the English language. Note for example that while Ulysses has only about 10% more occurrences of the letter 'E', it has about 29% more uses of the letter 'D'. This shows that the notion of "English language letter frequency" is a great oversimplification of reality.