{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# hw6 - machine learning\n",
    "CS470/CS570\n",
    "\n",
    "- Assigned Monday March 23rd.\n",
    "- Due: Thursday April 9th, 11:59pm\n",
    "\n",
    "Reading: AIMA Chapter 18.  You should also read chapters 19 and 20.\n",
    "\n",
    "Edit this file and submit it on the zoo.\n",
    "\n",
    "- Name: [enter]\n",
    "- Email address: [enter]\n",
    "- Hours: [enter]\n",
    "\n",
    "## jupyter notebooks\n",
    "\n",
    "For this assignment, you need to work with a jupyter notebook.  Some of you have already done this.  For others, here are some resources for getting started:\n",
    "\n",
    "- (https://jupyter-notebook.readthedocs.io/en/stable/)\n",
    "- (https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html)\n",
    "- (https://realpython.com/jupyter-notebook-introduction/)\n",
    "- (https://www.edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf)\n",
    "\n",
    "I invite the class to offer suggestions on piazza for other helpful resources.\n",
    "\n",
    "## $\\LaTeX{}$\n",
    "\n",
    "The cells in a jupyter notebook come in two flavors: **code** and **markdown**.  Code is usually a bit of python code (or Julia or R, hence, jupyter).  The markdown text can\n",
    "be text, html tags, or $\\LaTeX{}$.  For this assignment, I want you to use $\\LaTeX{}$ as much as possible in your exposition and explanations.\n",
    "\n",
    "If you are new to $\\LaTeX{}$, see (https://www.latex-tutorial.com/).  There are many \n",
    "other online resources.\n",
    "\n",
    "As a computer scientist, you be fluent in $\\LaTeX{}$, just like you know UNIX, Excel, \n",
    "github, and other common utilities or languages, such as say, jupyter notebooks.\n",
    "\n",
    "You have some time on your hands now to become fluent in $\\LaTeX{}$.\n",
    "\n",
    "\n",
    "\n",
    "## mobaXterm\n",
    "\n",
    "In Davies and on my home computer, I use mobaXterm to create an X-window\n",
    "terminal connection to the zoo.  This is a secure shell (ssh), but also allows\n",
    "the zoo X windows programs to display their graphics locally on my remote machine.\n",
    "\n",
    "jupyter notebooks is one such application.  By connecting to the zoo with mobaXterm,\n",
    "I can run jupyter notebooks on the zoo and display the graphics (namely the Mozilla\n",
    "Firefox browser) locally.\n",
    "\n",
    "For this assignment, you need to use jupyter notebooks.  You are welcome to run\n",
    "jupyter on your own machine.  However, you might find it simpler to load all\n",
    "the right modules by running it off the zoo.  Here's what you need to do.\n",
    "\n",
    "1. Install mobaXterm and connect to your favorite zoo machine.\n",
    "2. Create a hw6 directory in your home directory, e.g., ~/hw6\n",
    "3. Copy the files in /c/cs470/hws/hw6 to ~/hw6\n",
    "4. Run the command: \n",
    "> jupyter notebook &\n",
    "5. The ampersand means that it will run in the background.  You can continue to\n",
    "issue commands at the bash prompt in the foreground.\n",
    "6. Edit your copy of this file (hw6.ipynb).  That is, use the jupyter notebook interface to modify the code and markdown cells.  You may add or delete or rearrange cells as needed.  The actual jupyter notebook source file, e.g., hw6.ipynb, is \n",
    "in json format.  You should **NOT** edit that directly.  Just use the jupyter interface.\n",
    "7. When you are done, exit cleanly from jupyter.  Once back at the command prompt, \n",
    "issue a ps command and a kill to be sure the jupyter process and firefox process are dead. Otherwise, it\n",
    "may hang around and prevent you from running your next jupyter session.  jupyter does not let you have simultaneous sessions.\n",
    "8. Once you have completed the assignment, submit this file: hw6.ipynb.  You should put all of your python code in this file.  Do not include other files or edit \n",
    "the aima modules, such as learning.py or utils.py.\n",
    "\n",
    "## JSON\n",
    "\n",
    "The jupyter notebook source code is in JSON format: (https://en.wikipedia.org/wiki/JSON) that is, JavaScript Object Notation.  For example, see (https://zoo.cs.yale.edu/classes/cs470/materials/hws/hw6/hw6.ipynb) \n",
    "\n",
    "Other markup languages include YAML (which is a superset of JSON), and XML, which\n",
    "is a predecessor.\n",
    "\n",
    "Along with UNIX, github, jupyter notebooks and $\\LaTeX{}$, you should add JSON to your toolbox.  \n",
    "\n",
    "## XQuartz for macintosh computers\n",
    "\n",
    "The mac equivalent of mobaXterm is XQuartz.  See (https://www.xquartz.org/)\n",
    "\n",
    "Once you have installed it on your mac, start it up and run the terminal option.  Enter the following command:\n",
    "\n",
    "> ssh -Y netid@frog.cs.yale.edu\n",
    "\n",
    "Where netid is your netid.  You can then proceed from step 2 above.\n",
    "\n",
    "## The Titanic Dataset - predicting the survivors\n",
    "\n",
    "Thie assigment is an exercise in supervised learning.  You will use a training data set of passengers of the Titanic <tt>titanic.csv</tt>.  The target or label is the binary outcome: survived or perished.  \n",
    "\n",
    "There is also a testing data set, <tt>titanictest.csv</tt> which is another subset of the passengers.  Note: I have updated the original test.csv file to include the\n",
    "target value, which is mostly correct. Your task is to (a) clean up the data, and (b) apply a variety of \n",
    "learning algorithms to predict whether a passenger survived or perished, based on the\n",
    "parameters of the model.  **If you edit the .csv files, you may submit them as well.**\n",
    "\n",
    "This example comes from Kaggle (https://www.kaggle.com/c/titanic) You are welcome to enter the Kaggle competition and to view the other material there, including the youtube video (https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be) which walks you through one way to scrub the data.\n",
    "\n",
    "Much of the Kaggle code is based on sci-kit learn (sklearn.py) which is a popular machine learning package.  For this assignment, you have to use the aima learning.py library. You **are** allowed to import other modules, such as numpy and pandas.\n",
    "\n",
    "The data has missing values, such as the age of all the passengers.  The youtube video offers various ways to fill in the missing data. You are permitted to look up the actual data.  See (https://www.encyclopedia-titanica.org/titanic-passenger-list/)  I tried to add the target to all the test cases.  I may have missed a couple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### aima modules modified\n",
    "I have lightly edited the aima modules in the /c/cs470/hws/hw6 directory.  I changed utils.py to load data from the current directory instead of the aima-data directory.  I changed learning.py to avoid loading other datasets, like orings and iris.\n",
    "\n",
    "You should work with copies of these files.  Do not make any changes to these modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from learning import *\n",
    "from notebook import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I can now call the DataSet() constructor from learning.py on the local file, titanic.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic = DataSet(name = 'titanic')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['__class__',\n",
       " '__delattr__',\n",
       " '__dict__',\n",
       " '__dir__',\n",
       " '__doc__',\n",
       " '__eq__',\n",
       " '__format__',\n",
       " '__ge__',\n",
       " '__getattribute__',\n",
       " '__gt__',\n",
       " '__hash__',\n",
       " '__init__',\n",
       " '__init_subclass__',\n",
       " '__le__',\n",
       " '__lt__',\n",
       " '__module__',\n",
       " '__ne__',\n",
       " '__new__',\n",
       " '__reduce__',\n",
       " '__reduce_ex__',\n",
       " '__repr__',\n",
       " '__setattr__',\n",
       " '__sizeof__',\n",
       " '__str__',\n",
       " '__subclasshook__',\n",
       " '__weakref__',\n",
       " 'add_example',\n",
       " 'attrnames',\n",
       " 'attrnum',\n",
       " 'attrs',\n",
       " 'check_example',\n",
       " 'check_me',\n",
       " 'classes_to_numbers',\n",
       " 'distance',\n",
       " 'examples',\n",
       " 'find_means_and_deviations',\n",
       " 'got_values_flag',\n",
       " 'inputs',\n",
       " 'name',\n",
       " 'remove_examples',\n",
       " 'sanitize',\n",
       " 'setproblem',\n",
       " 'source',\n",
       " 'split_values_by_classes',\n",
       " 'target',\n",
       " 'update_values',\n",
       " 'values']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir(titanic)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default attribute names are integers.  The first line of the .csv file contained the names.  I adjust the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.attrnames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.attrnames = titanic.examples[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PassengerId',\n",
       " 'Survived',\n",
       " 'Pclass',\n",
       " 'Name',\n",
       " 'Sex',\n",
       " 'Age',\n",
       " 'SibSp',\n",
       " 'Parch',\n",
       " 'Ticket',\n",
       " 'Fare',\n",
       " 'Cabin',\n",
       " 'Embarked']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.attrnames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default target index is the last element, 11.  In our case, the Survived label index is 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.target = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default input indexes are all the columns except the last.  We adjust that as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.inputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1,\n",
       " 0,\n",
       " 3,\n",
       " '\"Braund Mr. Owen Harris\"',\n",
       " 'male',\n",
       " 22,\n",
       " 1,\n",
       " 0,\n",
       " 'A/5 21171',\n",
       " 7.25,\n",
       " '',\n",
       " 'S']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.examples[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.inputs = [2,4,5,6,7,8,9,10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first row of examples contains the headers.  We strip that away."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.examples = titanic.examples[1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We need to update the values to remove the header strings.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Survived', 0, 1]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.values[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.update_values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.values[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the err_ratio() function to measure the accuracy of a given model's predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n",
       "   \"http://www.w3.org/TR/html4/strict.dtd\">\n",
       "\n",
       "<html>\n",
       "<head>\n",
       "  <title></title>\n",
       "  <meta http-equiv=\"content-type\" content=\"text/html; charset=None\">\n",
       "  <style type=\"text/css\">\n",
       "td.linenos { background-color: #f0f0f0; padding-right: 10px; }\n",
       "span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }\n",
       "pre { line-height: 125%; }\n",
       "body .hll { background-color: #ffffcc }\n",
       "body  { background: #f8f8f8; }\n",
       "body .c { color: #408080; font-style: italic } /* Comment */\n",
       "body .err { border: 1px solid #FF0000 } /* Error */\n",
       "body .k { color: #008000; font-weight: bold } /* Keyword */\n",
       "body .o { color: #666666 } /* Operator */\n",
       "body .ch { color: #408080; font-style: italic } /* Comment.Hashbang */\n",
       "body .cm { color: #408080; font-style: italic } /* Comment.Multiline */\n",
       "body .cp { color: #BC7A00 } /* Comment.Preproc */\n",
       "body .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */\n",
       "body .c1 { color: #408080; font-style: italic } /* Comment.Single */\n",
       "body .cs { color: #408080; font-style: italic } /* Comment.Special */\n",
       "body .gd { color: #A00000 } /* Generic.Deleted */\n",
       "body .ge { font-style: italic } /* Generic.Emph */\n",
       "body .gr { color: #FF0000 } /* Generic.Error */\n",
       "body .gh { color: #000080; font-weight: bold } /* Generic.Heading */\n",
       "body .gi { color: #00A000 } /* Generic.Inserted */\n",
       "body .go { color: #888888 } /* Generic.Output */\n",
       "body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */\n",
       "body .gs { font-weight: bold } /* Generic.Strong */\n",
       "body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */\n",
       "body .gt { color: #0044DD } /* Generic.Traceback */\n",
       "body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */\n",
       "body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */\n",
       "body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */\n",
       "body .kp { color: #008000 } /* Keyword.Pseudo */\n",
       "body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */\n",
       "body .kt { color: #B00040 } /* Keyword.Type */\n",
       "body .m { color: #666666 } /* Literal.Number */\n",
       "body .s { color: #BA2121 } /* Literal.String */\n",
       "body .na { color: #7D9029 } /* Name.Attribute */\n",
       "body .nb { color: #008000 } /* Name.Builtin */\n",
       "body .nc { color: #0000FF; font-weight: bold } /* Name.Class */\n",
       "body .no { color: #880000 } /* Name.Constant */\n",
       "body .nd { color: #AA22FF } /* Name.Decorator */\n",
       "body .ni { color: #999999; font-weight: bold } /* Name.Entity */\n",
       "body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */\n",
       "body .nf { color: #0000FF } /* Name.Function */\n",
       "body .nl { color: #A0A000 } /* Name.Label */\n",
       "body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */\n",
       "body .nt { color: #008000; font-weight: bold } /* Name.Tag */\n",
       "body .nv { color: #19177C } /* Name.Variable */\n",
       "body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */\n",
       "body .w { color: #bbbbbb } /* Text.Whitespace */\n",
       "body .mb { color: #666666 } /* Literal.Number.Bin */\n",
       "body .mf { color: #666666 } /* Literal.Number.Float */\n",
       "body .mh { color: #666666 } /* Literal.Number.Hex */\n",
       "body .mi { color: #666666 } /* Literal.Number.Integer */\n",
       "body .mo { color: #666666 } /* Literal.Number.Oct */\n",
       "body .sa { color: #BA2121 } /* Literal.String.Affix */\n",
       "body .sb { color: #BA2121 } /* Literal.String.Backtick */\n",
       "body .sc { color: #BA2121 } /* Literal.String.Char */\n",
       "body .dl { color: #BA2121 } /* Literal.String.Delimiter */\n",
       "body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */\n",
       "body .s2 { color: #BA2121 } /* Literal.String.Double */\n",
       "body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */\n",
       "body .sh { color: #BA2121 } /* Literal.String.Heredoc */\n",
       "body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */\n",
       "body .sx { color: #008000 } /* Literal.String.Other */\n",
       "body .sr { color: #BB6688 } /* Literal.String.Regex */\n",
       "body .s1 { color: #BA2121 } /* Literal.String.Single */\n",
       "body .ss { color: #19177C } /* Literal.String.Symbol */\n",
       "body .bp { color: #008000 } /* Name.Builtin.Pseudo */\n",
       "body .fm { color: #0000FF } /* Name.Function.Magic */\n",
       "body .vc { color: #19177C } /* Name.Variable.Class */\n",
       "body .vg { color: #19177C } /* Name.Variable.Global */\n",
       "body .vi { color: #19177C } /* Name.Variable.Instance */\n",
       "body .vm { color: #19177C } /* Name.Variable.Magic */\n",
       "body .il { color: #666666 } /* Literal.Number.Integer.Long */\n",
       "\n",
       "  </style>\n",
       "</head>\n",
       "<body>\n",
       "<h2></h2>\n",
       "\n",
       "<div class=\"highlight\"><pre><span></span><span class=\"k\">def</span> <span class=\"nf\">err_ratio</span><span class=\"p\">(</span><span class=\"n\">predict</span><span class=\"p\">,</span> <span class=\"n\">dataset</span><span class=\"p\">,</span> <span class=\"n\">examples</span><span class=\"o\">=</span><span class=\"bp\">None</span><span class=\"p\">,</span> <span class=\"n\">verbose</span><span class=\"o\">=</span><span class=\"mi\">0</span><span class=\"p\">):</span>\n",
       "    <span class=\"sd\">&quot;&quot;&quot;Return the proportion of the examples that are NOT correctly predicted.</span>\n",
       "<span class=\"sd\">    verbose - 0: No output; 1: Output wrong; 2 (or greater): Output correct&quot;&quot;&quot;</span>\n",
       "    <span class=\"n\">examples</span> <span class=\"o\">=</span> <span class=\"n\">examples</span> <span class=\"ow\">or</span> <span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">examples</span>\n",
       "    <span class=\"k\">if</span> <span class=\"nb\">len</span><span class=\"p\">(</span><span class=\"n\">examples</span><span class=\"p\">)</span> <span class=\"o\">==</span> <span class=\"mi\">0</span><span class=\"p\">:</span>\n",
       "        <span class=\"k\">return</span> <span class=\"mf\">0.0</span>\n",
       "    <span class=\"n\">right</span> <span class=\"o\">=</span> <span class=\"mi\">0</span>\n",
       "    <span class=\"k\">for</span> <span class=\"n\">example</span> <span class=\"ow\">in</span> <span class=\"n\">examples</span><span class=\"p\">:</span>\n",
       "        <span class=\"n\">desired</span> <span class=\"o\">=</span> <span class=\"n\">example</span><span class=\"p\">[</span><span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">target</span><span class=\"p\">]</span>\n",
       "        <span class=\"n\">output</span> <span class=\"o\">=</span> <span class=\"n\">predict</span><span class=\"p\">(</span><span class=\"n\">dataset</span><span class=\"o\">.</span><span class=\"n\">sanitize</span><span class=\"p\">(</span><span class=\"n\">example</span><span class=\"p\">))</span>\n",
       "        <span class=\"k\">if</span> <span class=\"n\">output</span> <span class=\"o\">==</span> <span class=\"n\">desired</span><span class=\"p\">:</span>\n",
       "            <span class=\"n\">right</span> <span class=\"o\">+=</span> <span class=\"mi\">1</span>\n",
       "            <span class=\"k\">if</span> <span class=\"n\">verbose</span> <span class=\"o\">&gt;=</span> <span class=\"mi\">2</span><span class=\"p\">:</span>\n",
       "                <span class=\"k\">print</span><span class=\"p\">(</span><span class=\"s1\">&#39;   OK: got {} for {}&#39;</span><span class=\"o\">.</span><span class=\"n\">format</span><span class=\"p\">(</span><span class=\"n\">desired</span><span class=\"p\">,</span> <span class=\"n\">example</span><span class=\"p\">))</span>\n",
       "        <span class=\"k\">elif</span> <span class=\"n\">verbose</span><span class=\"p\">:</span>\n",
       "            <span class=\"k\">print</span><span class=\"p\">(</span><span class=\"s1\">&#39;WRONG: got {}, expected {} for {}&#39;</span><span class=\"o\">.</span><span class=\"n\">format</span><span class=\"p\">(</span>\n",
       "                <span class=\"n\">output</span><span class=\"p\">,</span> <span class=\"n\">desired</span><span class=\"p\">,</span> <span class=\"n\">example</span><span class=\"p\">))</span>\n",
       "    <span class=\"k\">return</span> <span class=\"mi\">1</span> <span class=\"o\">-</span> <span class=\"p\">(</span><span class=\"n\">right</span><span class=\"o\">/</span><span class=\"nb\">len</span><span class=\"p\">(</span><span class=\"n\">examples</span><span class=\"p\">))</span>\n",
       "</pre></div>\n",
       "</body>\n",
       "</html>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "psource(err_ratio)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we try a simple model: the plurality learner, which predicts the mode of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "pl = PluralityLearner(titanic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for plurality learning:  0.38383838383838387\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for plurality learning: \", err_ratio(pl, titanic))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we try the k-nearest neighbor model, with k = 5 and then k = 9."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "kNN5 = NearestNeighborLearner(titanic,k=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for k nearest neighbors 5:  0.1447811447811448\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for k nearest neighbors 5: \", err_ratio(kNN5, titanic))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "kNN9 = NearestNeighborLearner(titanic,k=9)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for k nearest neighbors 9:  0.17059483726150393\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for k nearest neighbors 9: \", err_ratio(kNN9, titanic))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the decision tree learner.  It is nearly perfect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "DTL = DecisionTreeLearner(titanic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for decision tree learner:  0.001122334455667784\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for decision tree learner: \", err_ratio(DTL, titanic))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next is the random forest model, with 5 trees.  (We have edited RandomForest to eliminate the debugging message for each round.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "RFL = RandomForest(titanic, n=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for random forest learner:  0.06285072951739623\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for random forest learner: \", err_ratio(RFL, titanic))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now try a naive Bayes model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = titanic\n",
    "\n",
    "target_vals = dataset.values[dataset.target]\n",
    "target_dist = CountingProbDist(target_vals)\n",
    "attr_dists = {(gv, attr): CountingProbDist(dataset.values[attr])\n",
    "              for gv in target_vals\n",
    "              for attr in dataset.inputs}\n",
    "for example in dataset.examples:\n",
    "        targetval = example[dataset.target]\n",
    "        target_dist.add(targetval)\n",
    "        for attr in dataset.inputs:\n",
    "            attr_dists[targetval, attr].add(example[attr])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(example):\n",
    "    def class_probability(targetval):\n",
    "        return (target_dist[targetval] *\n",
    "                product(attr_dists[targetval, attr][example[attr]]\n",
    "                        for attr in dataset.inputs))\n",
    "    return argmax(target_vals, key=class_probability)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error ratio for naive Bayes discrete:  0.10998877665544338\n"
     ]
    }
   ],
   "source": [
    "print(\"Error ratio for naive Bayes discrete: \", err_ratio(predict, titanic))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "can't multiply sequence by non-int of type 'float'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-32-189be6a56e31>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mtitanic\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclasses_to_numbers\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mperceptron\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPerceptronLearner\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtitanic\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/home/classes/cs470/hws/hw6/learning.py\u001b[0m in \u001b[0;36mPerceptronLearner\u001b[0;34m(dataset, learning_rate, epochs)\u001b[0m\n\u001b[1;32m    811\u001b[0m     \u001b[0mhidden_layer_sizes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    812\u001b[0m     \u001b[0mraw_net\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnetwork\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mi_units\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhidden_layer_sizes\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mo_units\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 813\u001b[0;31m     \u001b[0mlearned_net\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mBackPropagationLearner\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraw_net\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlearning_rate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    814\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    815\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mexample\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/classes/cs470/hws/hw6/learning.py\u001b[0m in \u001b[0;36mBackPropagationLearner\u001b[0;34m(dataset, net, learning_rate, epochs, activation)\u001b[0m\n\u001b[1;32m    742\u001b[0m                 \u001b[0;32mfor\u001b[0m \u001b[0mnode\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mlayer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    743\u001b[0m                     \u001b[0minc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mn\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mnode\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 744\u001b[0;31m                     \u001b[0min_val\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdotproduct\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnode\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mweights\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    745\u001b[0m                     \u001b[0mnode\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnode\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mactivation\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0min_val\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    746\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/classes/cs470/hws/hw6/utils.py\u001b[0m in \u001b[0;36mdotproduct\u001b[0;34m(X, Y)\u001b[0m\n\u001b[1;32m    133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdotproduct\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    134\u001b[0m     \u001b[0;34m\"\"\"Return the sum of the element-wise product of vectors X and Y.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 135\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0my\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    136\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    137\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/classes/cs470/hws/hw6/utils.py\u001b[0m in \u001b[0;36m<genexpr>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m    133\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdotproduct\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    134\u001b[0m     \u001b[0;34m\"\"\"Return the sum of the element-wise product of vectors X and Y.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 135\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0my\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    136\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    137\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: can't multiply sequence by non-int of type 'float'"
     ]
    }
   ],
   "source": [
    "titanic.classes_to_numbers()\n",
    "\n",
    "perceptron = PerceptronLearner(titanic)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a bug here.  You need to convert the string values to integers.\n",
    "\n",
    "Run other algorithms, such as NeuralNetLearner, LinearLearner, and adaboost as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What you need to do\n",
    "\n",
    "- Clean up the data **both training (titanic.csv) and test (titanictest.csv)**, as discussed in the youtube video.  For example, fill in missing data, combine categories, create age ranges.  Help the algorithms learn better.   If you edit\n",
    "the .csv files, you may submit them as well.  \n",
    "- Run the learning algorithms on both the training data and more importantly on the test data.\n",
    "- Once you have settled on a good algorithm, run it on different sizes of the training data, e.g., 10%, 25%, 50%, 75%, 100%, and measure the change in error rate.  The general rule is, the more data, the better the prediction.  See if that holds.\n",
    "- You should try to do as well as the youtube code.\n",
    "- Write a coherent summary of what you did and your results.  Try to explain what worked and what did not. Remember to use $\\LaTeX{}$.  You might find it useful to visualize the data, e.g., with mathplotlib.\n",
    "- Do all of this inside this jupyter notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}