CS470/CS570
Reading: AIMA Chapter 18. You should also read chapters 19 and 20.
Edit this file and submit it on the zoo.
For this assignment, you need to work with a jupyter notebook. Some of you have already done this. For others, here are some resources for getting started:
I invite the class to offer suggestions on piazza for other helpful resources.
The cells in a jupyter notebook come in two flavors: code and markdown. Code is usually a bit of python code (or Julia or R, hence, jupyter). The markdown text can be text, html tags, or $\LaTeX{}$. For this assignment, I want you to use $\LaTeX{}$ as much as possible in your exposition and explanations.
If you are new to $\LaTeX{}$, see (https://www.latex-tutorial.com/). There are many other online resources.
As a computer scientist, you be fluent in $\LaTeX{}$, just like you know UNIX, Excel, github, and other common utilities or languages, such as say, jupyter notebooks.
You have some time on your hands now to become fluent in $\LaTeX{}$.
In Davies and on my home computer, I use mobaXterm to create an X-window terminal connection to the zoo. This is a secure shell (ssh), but also allows the zoo X windows programs to display their graphics locally on my remote machine.
jupyter notebooks is one such application. By connecting to the zoo with mobaXterm, I can run jupyter notebooks on the zoo and display the graphics (namely the Mozilla Firefox browser) locally.
For this assignment, you need to use jupyter notebooks. You are welcome to run jupyter on your own machine. However, you might find it simpler to load all the right modules by running it off the zoo. Here's what you need to do.
jupyter notebook &
The mac equivalent of mobaXterm is XQuartz. See (https://www.xquartz.org/)
Once you have installed it on your mac, start it up and run the terminal option. Enter the following command:
ssh -Y netid@frog.cs.yale.edu
Where netid is your netid. You can then proceed from step 2 above.
Thie assigment is an exercise in supervised learning. You will use a training data set of passengers of the Titanic titanic.csv. The target or label is the binary outcome: survived or perished.
There is also a testing data set, titanictest.csv which is another subset of the passengers. Note: I have updated the original test.csv file to include the target value, which is mostly correct. Your task is to (a) clean up the data, and (b) apply a variety of learning algorithms to predict whether a passenger survived or perished, based on the parameters of the model.
This example comes from Kaggle (https://www.kaggle.com/c/titanic) You are welcome to enter the Kaggle competition and to view the other material there, including the youtube video (https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be) which walks you through one way to scrub the data.
Much of the Kaggle code is based on sci-kit learn (sklearn.py) which is a popular machine learning package. For this assignment, you have to use the aima learning.py library. You are not allowed to import other modules.
The data has missing values, such as the age of all the passengers. The youtube video offers various ways to fill in the missing data. You are permitted to look up the actual data. See (https://www.encyclopedia-titanica.org/titanic-passenger-list/) I tried to add the target to all the test cases. I may have missed a couple.
I have lightly edited the aima modules in the /c/cs470/hws/hw6 directory. I changed utils.py to load data from the current directory instead of the aima-data directory. I changed learning.py to avoid loading other datasets, like orings and iris.
You should work with copies of these files. Do not make any changes to these modules.
from learning import *
from notebook import *
I can now call the DataSet() constructor from learning.py on the local file, titanic.csv
titanic = DataSet(name = 'titanic')
dir(titanic)
The default attribute names are integers. The first line of the .csv file contained the names. I adjust the data.
titanic.attrnames
titanic.attrnames = titanic.examples[0]
titanic.attrnames
The default target index is the last element, 11. In our case, the Survived label index is 1.
titanic.target
titanic.target = 1
The default input indexes are all the columns except the last. We adjust that as well.
titanic.inputs
titanic.examples[1]
titanic.inputs = [2,4,5,6,7,8,9,10]
The first row of examples contains the headers. We strip that away.
titanic.examples = titanic.examples[1:]
We need to update the values to remove the header strings.
titanic.update_values()
We will use the err_ratio() function to measure the accuracy of a given model's predictions.
psource(err_ratio)
Now we try a simple model: the plurality learner, which predicts the mode of the dataset.
pl = PluralityLearner(titanic)
print("Error ratio for plurality learning: ", err_ratio(pl, titanic))
Next we try the k-nearest neighbor model, with k = 5 and then k = 9.
kNN5 = NearestNeighborLearner(titanic,k=5)
print("Error ratio for k nearest neighbors 5: ", err_ratio(kNN5, titanic))
kNN9 = NearestNeighborLearner(titanic,k=9)
print("Error ratio for k nearest neighbors 9: ", err_ratio(kNN9, titanic))
Here is the decision tree learner. It is horrible.
DTL = DecisionTreeLearner(titanic)
print("Error ratio for decision tree learner: ", err_ratio(DTL, titanic))
Next is the random forest model, with 5 trees. (We have edited RandomForest to eliminate the debugging message for each round.)
RFL = RandomForest(titanic, n=5)
print("Error ratio for random forest learner: ", err_ratio(RFL, titanic))
We now try a naive Bayes model.
dataset = titanic
target_vals = dataset.values[dataset.target]
target_dist = CountingProbDist(target_vals)
attr_dists = {(gv, attr): CountingProbDist(dataset.values[attr])
for gv in target_vals
for attr in dataset.inputs}
for example in dataset.examples:
targetval = example[dataset.target]
target_dist.add(targetval)
for attr in dataset.inputs:
attr_dists[targetval, attr].add(example[attr])
def predict(example):
def class_probability(targetval):
return (target_dist[targetval] *
product(attr_dists[targetval, attr][example[attr]]
for attr in dataset.inputs))
return argmax(target_vals, key=class_probability)
print("Error ratio for naive Bayes discrete: ", err_ratio(predict, titanic))
titanic.classes_to_numbers()
perceptron = PerceptronLearner(titanic)
There is a bug here. You need to convert the string values to integers.
Run other algorithms, such as NeuralNetLearner, LinearLearner, and adaboost as well.