Assignment 5: Language Detection by Digram Frequency

Objectives

to use maps
to use function pointers
to be aware of character encodings

Introduction

Analysis of microblogs (for example, Twitter) has found applications from finance to tracking disease activity. An important step in analyzing a microblog is determining what language it is written in. Examining frequency of letter pairs (digrams) is a simple yet effective way of determining language (for languages that are written using alphabets) when dictionaries are not available or not useful.

Assignment

Write a program called Classify that classifies sample texts in unknown languages by comparing their digram frequencies to those in texts in known languages. The names of UTF-8 encoded files containing the known and unknown texts will be given on the command line. The known texts are those at the beginning command line up to the option -classify or the last argument if none equals -classify. The unknown texts are those after -classify.

For each unknown text, compute its similarity score to each of the known texts as follows:

let the set $\{d_1, \ldots, d_k\}$ be the digrams that appear in either language;
for both the unknown and known text, represent the text as a $k$-dimensional vector where component $i$ is the proportion of digrams in the text equal to $d_i$;
the similarity score is the cosine of the angle between those vectors: if the known text is $(a_1, \ldots, a_k)$ and the unknown text is $(b_1, \ldots, b_k)$ then the cosine of the angle between them is $$ \frac{(a_1 \cdot b_1 + \cdots + a_k \cdot b_k)} {\sqrt{a_1^2 + \cdots + a_k^2} \cdot \sqrt{b_1^2 + \cdots + b_k^2}}. $$

Except if there are no digrams in either text then the similarity score is 0.0.

Consider any consecutive alphabetic characters in the input files to be a digram. Which characters are alphabetic is determined by the iswalpha function from the standard wctype library for the en_US.UTF-8 locale. The calculations should be case-insensitive and digrams should be considered equal if their lower-case versions as determined by towlower are equal, so that "UP" is considered the same digram as "up" or "Up" or "uP". Some languages allow multiple written forms for the same letter in the same case and we will consider those different for the purposes of counting occurrences of digrams. The utf8 library given in /c/cs223/hw5/Optional contains a function utf8_process_digrams that will read a UTF-8 encoded file and pass the digrams to a function passed as an argument. The digrams as passed as strings of length up to 8 – the UTF-8 encoding uses multiple chars for some characters and the mapping is one-to-one from strings to digrams. Your program will have to call setlocale(LC_ALL, "") once in order to use anything from utf8; you may also have to set your locale to en_US.UTF-8 on the command line with export LC_ALL="en_US.UTF-8".

Also in /c/cs223/hw5/Optional is a library called smap that implements a map with strings as keys and pointers to integers as values. You may use that implementation for this assignment, but for the next assignment you will write a more efficient implementation. Note that having the values be pointers to integers allows the values to be single dynamically allocated integers or arrays of integers.

The output should be a table of similarity scores for each pair of known text and unknown text in the format shown in the example below. If there are no known texts given as input or no unknown texts then there should be no output .

If there is an error opening any of the files then output the message Classify: error opening file FILE to standard error, where FILE is replaced by the name of the offending file. Output the error only for the first file on the command line that could not be opened. It does not matter what your program writes to standard output in these cases. If there are other errors reading the files then the behavior is undefined but your program must not crash or go into an infinite loop.

Your program should use $O(an + b)$ memory, where $a$ is the sum over each known input file of the number of unique digrams in that file, $b$ is the same but for unknown files, and $n$ is the number of known files. Your program should run in expected time $O(anm + bn + c)$ where $a$ and $b$, and $n$ are as above, $m$ is the number of unknown files, $c$ is the total length of all the input files, and assuming that the length of the filenames is bounded and that the digrams within each file are distributed uniformly and randomly. If you use the given smap implementation, you can compute your running time assuming that smap is implemented with a hash table with worst-case $O(1)$ time for smap_size; expected $O(1)$ time for smap_contains_key, smap_put and smap_get; and $O(n)$ time for smap_for_each (the time bounds assume that malloc and free work in $O(1)$ time, and that the usual reasonable assumptions about the distribution of the keys apply).

Your code will be tested with Valgrind, so make sure there are no memory leaks or other errors reported by Valgrind.

You may assume that no file contains the same digram more than 2 billion times. You may assume that there are no more than 100 million unique digrams total across all the input files. But your program must not crash or go into an infinite loop if any of those assumptions or other specifications for the input (file or command-line) are violated.

Example

[jrg94@jaguar ~]$ ./Classify as.txt -classify sheep.txt
                     as.txt
    sheep.txt: *0.8660254038* as.txt
[jrg94@jaguar ~]$ ./Classify english.txt finnish.txt german.txt spanish.txt tagalog.txt -classify tweet_1.txt tweet_2.txt tweet_3.txt wiki_1.txt wiki_2.txt wiki_3.txt wiki_4.txt wiki_5.txt wiki_6.txt
                english.txt   finnish.txt    german.txt   spanish.txt   tagalog.txt
 tweet_1.txt: *0.4975463163* 0.3225260247  0.3427189923  0.3542029114  0.2885507396  english.txt
 tweet_2.txt:  0.3108789788 *0.5275762028* 0.2722052701  0.3614243979  0.2637448606  finnish.txt
 tweet_3.txt:  0.0580723120  0.0252738526  0.0509494305  0.0524433391 *0.0956045458* tagalog.txt
  wiki_1.txt: *0.7768837682* 0.5197372457  0.5281032753  0.5969386450  0.4093087286  english.txt
  wiki_2.txt:  0.3541121144 *0.7000188146* 0.3566081750  0.4730732294  0.3478415488  finnish.txt
  wiki_3.txt:  0.5877254718  0.4514735721 *0.8004235617* 0.5168737580  0.2592914348  german.txt
  wiki_4.txt:  0.3646335562  0.4459622465  0.2346541019  0.3659501028 *0.8936207106* tagalog.txt
  wiki_5.txt:  0.5234652024  0.5125050541  0.5679950968 *0.8975808467* 0.3283109837  spanish.txt
  wiki_6.txt: *0.6402748956* 0.5684112875  0.6186220470  0.6347616602  0.4271444234  english.txt

Note that the columns of scores are 12 characters wide with two spaces between them. The column headings are up to the first 12 characters of the names of the known text files, right-justified, and in the order they were given on the command line. The row headings are up to the first 12 characters of the names of the unknown text files, right justified in 12 spaces and followed by a colon and two spaces, and in the order they appeared on the command-line. Each row ends with two spaces and the complete name of the known text with the highest similarity score, with ties broken in favor of the file that appeared earlier on the command line. The best match is also highlighted by asterisks around the score (replacing some of the spaces between columns). The similarity scores are displayed rounded to 10 digits after the decimal point.

The files in the first example contain aaabaaabaaababaaaabaa and baa and are available in /c/cs223/hw5. There are three distinct digrams in each: aa, ab, and ba. The frequencies for those three digrams in each file are then $(\frac{10}{20}, \frac{5}{20}, \frac{5}{20}) = (\frac{1}{2}, \frac{1}{4}, \frac{1}{4})$ and $(\frac{1}{2}, 0, \frac{1}{2})$. So the similarity score is $$ \frac{\frac{1}{2} \cdot \frac{1}{2} + \frac{1}{4} \cdot 0 + \frac{1}{4} \cdot \frac{1}{2}}{\sqrt{\frac{1}{4} + \frac{1}{16} + \frac{1}{16}} \cdot \sqrt{\frac{1}{4} + 0 + \frac{1}{4}}} = 0.8660254037844... $$

The files in the second example are also available in /c/cs223/hw5 and are taken from Project Gutenberg, Wikipedia,

Twitter

Baltimore Orioles (@Orioles; English)
Henna Virkkunen (@HennaVirkkunen; English and Finnish)
Mandarin Voice (@pinyin; English and Mandarin with some pinyin transliterations).

Submissions

Submit a makefile with target Classify, log file, and your source code necessary to build the executable using your makefile. There is no need to submit the provided utf8 or smap libraries unless you modified them. Your makefile can assume that they have been copied into the current directory, but should not assume they are still in their original locations.