hw6 - machine learning

CS470/CS570

  • Assigned Monday March 23rd.
  • Due: Thursday April 9th, 11:59pm

Reading: AIMA Chapter 18. You should also read chapters 19 and 20.

Edit this file and submit it on the zoo.

  • Name: [enter]
  • Email address: [enter]
  • Hours: [enter]

jupyter notebooks

For this assignment, you need to work with a jupyter notebook. Some of you have already done this. For others, here are some resources for getting started:

I invite the class to offer suggestions on piazza for other helpful resources.

$\LaTeX{}$

The cells in a jupyter notebook come in two flavors: code and markdown. Code is usually a bit of python code (or Julia or R, hence, jupyter). The markdown text can be text, html tags, or $\LaTeX{}$. For this assignment, I want you to use $\LaTeX{}$ as much as possible in your exposition and explanations.

If you are new to $\LaTeX{}$, see (https://www.latex-tutorial.com/). There are many other online resources.

As a computer scientist, you be fluent in $\LaTeX{}$, just like you know UNIX, Excel, github, and other common utilities or languages, such as say, jupyter notebooks.

You have some time on your hands now to become fluent in $\LaTeX{}$.

mobaXterm

In Davies and on my home computer, I use mobaXterm to create an X-window terminal connection to the zoo. This is a secure shell (ssh), but also allows the zoo X windows programs to display their graphics locally on my remote machine.

jupyter notebooks is one such application. By connecting to the zoo with mobaXterm, I can run jupyter notebooks on the zoo and display the graphics (namely the Mozilla Firefox browser) locally.

For this assignment, you need to use jupyter notebooks. You are welcome to run jupyter on your own machine. However, you might find it simpler to load all the right modules by running it off the zoo. Here's what you need to do.

  1. Install mobaXterm and connect to your favorite zoo machine.
  2. Create a hw6 directory in your home directory, e.g., ~/hw6
  3. Copy the files in /c/cs470/hws/hw6 to ~/hw6
  4. Run the command:

    jupyter notebook &

  5. The ampersand means that it will run in the background. You can continue to issue commands at the bash prompt.
  6. Edit your copy of this file (hw6.ipynb).
  7. When you are done, exit cleanly from jupyter. Once back at the command prompt, issue a ps command and a kill to be sure the jupyter process is dead. Otherwise, it may hand around and prevent you from running your next jupyter session. jupyter does not let you have simultaneous sessions.
  8. Once you have completed the assignment, submit this file: hw6.ipynb. You should put all of your python code in this file. Do not include other files or edit the aima modules, such as learning.py or utils.py.

XQuartz for macintosh computers

The mac equivalent of mobaXterm is XQuartz. See (https://www.xquartz.org/)

Once you have installed it on your mac, start it up and run the terminal option. Enter the following command:

ssh -Y netid@frog.cs.yale.edu

Where netid is your netid. You can then proceed from step 2 above.

The Titanic Dataset - predicting the survivors

Thie assigment is an exercise in supervised learning. You will use a training data set of passengers of the Titanic titanic.csv. The target or label is the binary outcome: survived or perished.

There is also a testing data set, titanictest.csv which is another subset of the passengers. Note: I have updated the original test.csv file to include the target value, which is mostly correct. Your task is to (a) clean up the data, and (b) apply a variety of learning algorithms to predict whether a passenger survived or perished, based on the parameters of the model.

This example comes from Kaggle (https://www.kaggle.com/c/titanic) You are welcome to enter the Kaggle competition and to view the other material there, including the youtube video (https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be) which walks you through one way to scrub the data.

Much of the Kaggle code is based on sci-kit learn (sklearn.py) which is a popular machine learning package. For this assignment, you have to use the aima learning.py library. You are not allowed to import other modules.

The data has missing values, such as the age of all the passengers. The youtube video offers various ways to fill in the missing data. You are permitted to look up the actual data. See (https://www.encyclopedia-titanica.org/titanic-passenger-list/) I tried to add the target to all the test cases. I may have missed a couple.

aima modules modified

I have lightly edited the aima modules in the /c/cs470/hws/hw6 directory. I changed utils.py to load data from the current directory instead of the aima-data directory. I changed learning.py to avoid loading other datasets, like orings and iris.

You should work with copies of these files. Do not make any changes to these modules.

In [35]:
from learning import *
from notebook import *

I can now call the DataSet() constructor from learning.py on the local file, titanic.csv

In [36]:
titanic = DataSet(name = 'titanic')
In [37]:
dir(titanic)
Out[37]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slotnames__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'add_example',
 'attrnames',
 'attrnum',
 'attrs',
 'check_example',
 'check_me',
 'classes_to_numbers',
 'distance',
 'examples',
 'find_means_and_deviations',
 'got_values_flag',
 'inputs',
 'name',
 'remove_examples',
 'sanitize',
 'setproblem',
 'source',
 'split_values_by_classes',
 'target',
 'update_values',
 'values']

The default attribute names are integers. The first line of the .csv file contained the names. I adjust the data.

In [38]:
titanic.attrnames
Out[38]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
In [39]:
titanic.attrnames = titanic.examples[0]
In [40]:
titanic.attrnames
Out[40]:
['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

The default target index is the last element, 11. In our case, the Survived label index is 1.

In [41]:
titanic.target
Out[41]:
11
In [42]:
titanic.target = 1

The default input indexes are all the columns except the last. We adjust that as well.

In [43]:
titanic.inputs
Out[43]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
In [44]:
titanic.examples[1]
Out[44]:
[1,
 0,
 3,
 '"Braund',
 'Mr. Owen Harris"',
 'male',
 22,
 1,
 0,
 'A/5 21171',
 7.25,
 '',
 'S']
In [45]:
titanic.inputs = [2,4,5,6,7,8,9,10]

The first row of examples contains the headers. We strip that away.

In [46]:
titanic.examples = titanic.examples[1:]

We need to update the values to remove the header strings.

In [47]:
titanic.update_values()

We will use the err_ratio() function to measure the accuracy of a given model's predictions.

In [13]:
psource(err_ratio)

def err_ratio(predict, dataset, examples=None, verbose=0):
    """Return the proportion of the examples that are NOT correctly predicted.
    verbose - 0: No output; 1: Output wrong; 2 (or greater): Output correct"""
    examples = examples or dataset.examples
    if len(examples) == 0:
        return 0.0
    right = 0
    for example in examples:
        desired = example[dataset.target]
        output = predict(dataset.sanitize(example))
        if output == desired:
            right += 1
            if verbose >= 2:
                print('   OK: got {} for {}'.format(desired, example))
        elif verbose:
            print('WRONG: got {}, expected {} for {}'.format(
                output, desired, example))
    return 1 - (right/len(examples))

Now we try a simple model: the plurality learner, which predicts the mode of the dataset.

In [48]:
pl = PluralityLearner(titanic)
In [49]:
print("Error ratio for plurality learning: ", err_ratio(pl, titanic))
Error ratio for plurality learning:  0.38383838383838387

Next we try the k-nearest neighbor model, with k = 5 and then k = 9.

In [50]:
kNN5 = NearestNeighborLearner(titanic,k=5)
In [51]:
print("Error ratio for k nearest neighbors 5: ", err_ratio(kNN5, titanic))
Error ratio for k nearest neighbors 5:  0.14253647586980922
In [52]:
kNN9 = NearestNeighborLearner(titanic,k=9)
In [53]:
print("Error ratio for k nearest neighbors 9: ", err_ratio(kNN9, titanic))
Error ratio for k nearest neighbors 9:  0.17059483726150393

Here is the decision tree learner. It is horrible.

In [54]:
DTL = DecisionTreeLearner(titanic)
In [55]:
print("Error ratio for decision tree learner: ", err_ratio(DTL, titanic))
Error ratio for decision tree learner:  0.0

Next is the random forest model, with 5 trees. (We have edited RandomForest to eliminate the debugging message for each round.)

In [56]:
RFL = RandomForest(titanic, n=5)
In [57]:
print("Error ratio for random forest learner: ", err_ratio(RFL, titanic))
Error ratio for random forest learner:  0.07070707070707072

We now try a naive Bayes model.

In [58]:
dataset = titanic

target_vals = dataset.values[dataset.target]
target_dist = CountingProbDist(target_vals)
attr_dists = {(gv, attr): CountingProbDist(dataset.values[attr])
              for gv in target_vals
              for attr in dataset.inputs}
for example in dataset.examples:
        targetval = example[dataset.target]
        target_dist.add(targetval)
        for attr in dataset.inputs:
            attr_dists[targetval, attr].add(example[attr])
In [59]:
def predict(example):
    def class_probability(targetval):
        return (target_dist[targetval] *
                product(attr_dists[targetval, attr][example[attr]]
                        for attr in dataset.inputs))
    return argmax(target_vals, key=class_probability)
In [60]:
print("Error ratio for naive Bayes discrete: ", err_ratio(predict, titanic))
Error ratio for naive Bayes discrete:  0.0718294051627385
In [61]:
titanic.classes_to_numbers()

perceptron = PerceptronLearner(titanic)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-61-189be6a56e31> in <module>
      1 titanic.classes_to_numbers()
      2 
----> 3 perceptron = PerceptronLearner(titanic)

/home/classes/cs470/hws/hw6/learning.py in PerceptronLearner(dataset, learning_rate, epochs)
    811     hidden_layer_sizes = []
    812     raw_net = network(i_units, hidden_layer_sizes, o_units)
--> 813     learned_net = BackPropagationLearner(dataset, raw_net, learning_rate, epochs)
    814 
    815     def predict(example):

/home/classes/cs470/hws/hw6/learning.py in BackPropagationLearner(dataset, net, learning_rate, epochs, activation)
    742                 for node in layer:
    743                     inc = [n.value for n in node.inputs]
--> 744                     in_val = dotproduct(inc, node.weights)
    745                     node.value = node.activation(in_val)
    746 

/home/classes/cs470/hws/hw6/utils.py in dotproduct(X, Y)
    133 def dotproduct(X, Y):
    134     """Return the sum of the element-wise product of vectors X and Y."""
--> 135     return sum(x * y for x, y in zip(X, Y))
    136 
    137 

/home/classes/cs470/hws/hw6/utils.py in <genexpr>(.0)
    133 def dotproduct(X, Y):
    134     """Return the sum of the element-wise product of vectors X and Y."""
--> 135     return sum(x * y for x, y in zip(X, Y))
    136 
    137 

TypeError: can't multiply sequence by non-int of type 'float'

There is a bug here. You need to convert the string values to integers.

Run other algorithms, such as NeuralNetLearner, LinearLearner, and adaboost as well.

What you need to do

  • Clean up the data both training (titanic.csv) and test (titanictest.csv), as discussed in the youtube video. For example, fill in missing data, combine categories, create age ranges. Help the algorithms learn better.
  • Run the learning algorithms on both the training data and more importantly on the test data.
  • Once you have settled on a good algorithm, run it on different sizes of the training data, e.g., 10%, 25%, 50%, 75%, 100%, and measure the change in error rate. The general rule is, the more data, the better the prediction. See if that holds.
  • You should try to do as well as the youtube code.
  • Write a coherent summary of what you did and your results. Try to explain what worked and what did not. Remember to use $\LaTeX{}$.
  • Do all of this inside this jupyter notebook.
In [ ]: