Assignment 5: Language Detection by Digram Frequency
Objectives
- to use maps
- to use function pointers
- to be aware of character encodings
Introduction
Assignment
Write a program calledClassify
that classifies sample texts
in unknown languages by comparing their digram frequencies to those in
texts in known languages. The names of UTF-8 encoded
files containing the known and unknown
texts will be given on the command line. The known texts are those
at the beginning command line up to the option -classify
or the last argument if none equals -classify
.
The unknown texts are those after -classify
.
For each unknown text, compute its similarity score to each of the known texts as follows:
- let the set $\{d_1, \ldots, d_k\}$ be the digrams that appear in either language;
- for both the unknown and known text, represent the text as a $k$-dimensional vector where component $i$ is the proportion of digrams in the text equal to $d_i$;
- the similarity score is the cosine of the angle between those vectors: if the known text is $(a_1, \ldots, a_k)$ and the unknown text is $(b_1, \ldots, b_k)$ then the cosine of the angle between them is $$ \frac{(a_1 \cdot b_1 + \cdots + a_k \cdot b_k)} {\sqrt{a_1^2 + \cdots + a_k^2} \cdot \sqrt{b_1^2 + \cdots + b_k^2}}. $$
Consider any consecutive alphabetic characters in the input files
to be a digram. Which characters are alphabetic is determined by
the iswalpha
function from the standard
wctype
library for the en_US.UTF-8
locale.
The calculations should be case-insensitive and digrams should
be considered equal if their lower-case versions as determined
by towlower
are equal, so that
"UP" is considered the same digram as "up" or "Up" or "uP". Some
languages allow multiple written forms for the same letter in the
same case and
we will consider those different for the purposes of counting
occurrences of digrams.
The utf8
library given in
/c/cs223/hw5/Optional
contains a function utf8_process_digrams
that will read
a UTF-8 encoded file and pass the digrams to a function passed as an
argument. The digrams as passed as strings of length up to 8 –
the UTF-8 encoding
uses multiple char
s for some characters and the mapping
is one-to-one from strings to digrams. Your program
will have to call setlocale(LC_ALL, "")
once in order
to use anything from utf8
; you may also have to set
your locale to en_US.UTF-8
on the command line with
export LC_ALL="en_US.UTF-8"
.
Also in /c/cs223/hw5/Optional
is a library called
smap
that implements a map
with strings as keys and pointers to integers as values. You may
use that implementation for this assignment, but for the next
assignment you will write a more efficient implementation.
Note that having the values be pointers to integers allows the values
to be single dynamically allocated integers or arrays of integers.
The output should be a table of similarity scores for each pair of known text and unknown text in the format shown in the example below. If there are no known texts given as input or no unknown texts then there should be no output .
If there is an error opening any of the files then output the message
Classify: error opening file FILE
to standard error,
where FILE
is replaced by the name of the offending file.
Output the error only for the first file on the command line that
could not be opened. It does not matter what your program writes
to standard output in these cases. If there are other errors reading
the files then the behavior is undefined but your program must not crash or
go into an infinite loop.
Your program should use $O(an + b)$ memory, where
$a$ is the sum over each known input file of the number of unique digrams
in that file,
$b$ is the same but for unknown files, and
$n$ is the number of known files.
Your program
should run in expected time $O(anm + bn + c)$ where
$a$ and $b$, and $n$ are as above,
$m$ is the number of unknown files,
$c$ is the total length of all the input files, and assuming that
the length of the filenames is bounded and that the digrams within
each file are distributed uniformly and randomly.
If you use the given smap
implementation, you can compute your running time assuming that
smap
is implemented with a hash table with
worst-case $O(1)$ time for smap_size
;
expected $O(1)$ time for smap_contains_key
,
smap_put
and smap_get
;
and $O(n)$ time for smap_for_each
(the time bounds
assume that malloc
and free
work in $O(1)$ time,
and that the usual reasonable assumptions
about the distribution of the keys apply).
Your code will be tested with Valgrind, so make sure there are no memory leaks or other errors reported by Valgrind.
You may assume that no file contains the same digram more than 2 billion times. You may assume that there are no more than 100 million unique digrams total across all the input files. But your program must not crash or go into an infinite loop if any of those assumptions or other specifications for the input (file or command-line) are violated.
Example
[jrg94@jaguar ~]$ ./Classify as.txt -classify sheep.txt as.txt sheep.txt: *0.8660254038* as.txt [jrg94@jaguar ~]$ ./Classify english.txt finnish.txt german.txt spanish.txt tagalog.txt -classify tweet_1.txt tweet_2.txt tweet_3.txt wiki_1.txt wiki_2.txt wiki_3.txt wiki_4.txt wiki_5.txt wiki_6.txt english.txt finnish.txt german.txt spanish.txt tagalog.txt tweet_1.txt: *0.4975463163* 0.3225260247 0.3427189923 0.3542029114 0.2885507396 english.txt tweet_2.txt: 0.3108789788 *0.5275762028* 0.2722052701 0.3614243979 0.2637448606 finnish.txt tweet_3.txt: 0.0580723120 0.0252738526 0.0509494305 0.0524433391 *0.0956045458* tagalog.txt wiki_1.txt: *0.7768837682* 0.5197372457 0.5281032753 0.5969386450 0.4093087286 english.txt wiki_2.txt: 0.3541121144 *0.7000188146* 0.3566081750 0.4730732294 0.3478415488 finnish.txt wiki_3.txt: 0.5877254718 0.4514735721 *0.8004235617* 0.5168737580 0.2592914348 german.txt wiki_4.txt: 0.3646335562 0.4459622465 0.2346541019 0.3659501028 *0.8936207106* tagalog.txt wiki_5.txt: 0.5234652024 0.5125050541 0.5679950968 *0.8975808467* 0.3283109837 spanish.txt wiki_6.txt: *0.6402748956* 0.5684112875 0.6186220470 0.6347616602 0.4271444234 english.txtNote that the columns of scores are 12 characters wide with two spaces between them. The column headings are up to the first 12 characters of the names of the known text files, right-justified, and in the order they were given on the command line. The row headings are up to the first 12 characters of the names of the unknown text files, right justified in 12 spaces and followed by a colon and two spaces, and in the order they appeared on the command-line. Each row ends with two spaces and the complete name of the known text with the highest similarity score, with ties broken in favor of the file that appeared earlier on the command line. The best match is also highlighted by asterisks around the score (replacing some of the spaces between columns). The similarity scores are displayed rounded to 10 digits after the decimal point.
The files in the first example contain
aaabaaabaaababaaaabaa
and baa
and are available in /c/cs223/hw5
.
There are three distinct digrams in each: aa, ab, and ba. The
frequencies for those three digrams in each file are then
$(\frac{10}{20}, \frac{5}{20}, \frac{5}{20}) = (\frac{1}{2}, \frac{1}{4}, \frac{1}{4})$ and $(\frac{1}{2}, 0, \frac{1}{2})$.
So the similarity score is
$$
\frac{\frac{1}{2} \cdot \frac{1}{2} + \frac{1}{4} \cdot 0 + \frac{1}{4} \cdot \frac{1}{2}}{\sqrt{\frac{1}{4} + \frac{1}{16} + \frac{1}{16}} \cdot \sqrt{\frac{1}{4} + 0 + \frac{1}{4}}} = 0.8660254037844...
$$
The files in the second example are also available in /c/cs223/hw5
and
are taken from
Project Gutenberg,
Wikipedia,
- BDL (English)
- Finnish baseball (Finnish)
- Bratwurst (German)
- Manila (Tagalog)
- Horacio Quiroga (Spanish)
- Ikea (Swedish), and
- Baltimore Orioles (@Orioles; English)
- Henna Virkkunen (@HennaVirkkunen; English and Finnish)
- Mandarin Voice (@pinyin; English and Mandarin with some pinyin transliterations).
Submissions
Submit a makefile with targetClassify
,
log file, and your source code necessary to build the
executable using your makefile.
There is no need to submit the provided utf8
or smap
libraries unless you modified them.
Your makefile can assume that they have been copied
into the current directory, but should not assume they
are still in their original locations.