{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CME193 Assignment 2\n",
    "\n",
    "### Due Sunday 17th Feb 5PM\n",
    "\n",
    "In this assignment you will implement some machine learning algorithms on the [Congressional Voting Records Dataset](https://archive.ics.uci.edu/ml/datasets/congressional+voting+records). The goal of the assignment is to write a python script that reads in the dataset from the internet, process it and build a few models and output some graphs.\n",
    "\n",
    "You can use this notebook to write code and check that it works but once you are sure that everything works you will put all your code in a script, that can be called from the command line. It is always a good habit to convert the code you write in notebooks into clean scripts so that it can be used with relative use later on.\n",
    "\n",
    "Note : Most programming courses always have starter code to help students in completing the assignments, this is done so that students do not waste time coding up boilerplate code and also to help graders by standardising the code they have to read, but unfortunately this leaves many students with only the ability to fill in code while they lack confidence in creating a project from scratch. It is in this interest that only minimal starter code is provided in this assignment and you are required to submit a script.\n",
    "\n",
    "Make sure you refer to the lecture notebooks in case you forgot how to do any of the operations mentioned below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "The dataset we will be working with on this assignment is the [Congressional Voting Records Dataset](https://archive.ics.uci.edu/ml/datasets/congressional+voting+records) for 1984, open the link and read the description of the dataset, make sure you understand what the columns and rows represent.\n",
    "\n",
    "The following code will quickly download the dataset into a pandas dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>party</th>\n",
       "      <th>Vote_0</th>\n",
       "      <th>Vote_1</th>\n",
       "      <th>Vote_2</th>\n",
       "      <th>Vote_3</th>\n",
       "      <th>Vote_4</th>\n",
       "      <th>Vote_5</th>\n",
       "      <th>Vote_6</th>\n",
       "      <th>Vote_7</th>\n",
       "      <th>Vote_8</th>\n",
       "      <th>Vote_9</th>\n",
       "      <th>Vote_10</th>\n",
       "      <th>Vote_11</th>\n",
       "      <th>Vote_12</th>\n",
       "      <th>Vote_13</th>\n",
       "      <th>Vote_14</th>\n",
       "      <th>Vote_15</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>republican</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>?</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>republican</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>democrat</td>\n",
       "      <td>?</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>?</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>democrat</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>?</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>democrat</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>n</td>\n",
       "      <td>y</td>\n",
       "      <td>?</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        party Vote_0 Vote_1 Vote_2 Vote_3 Vote_4 Vote_5 Vote_6 Vote_7 Vote_8  \\\n",
       "0  republican      n      y      n      y      y      y      n      n      n   \n",
       "1  republican      n      y      n      y      y      y      n      n      n   \n",
       "2    democrat      ?      y      y      ?      y      y      n      n      n   \n",
       "3    democrat      n      y      y      n      ?      y      n      n      n   \n",
       "4    democrat      y      y      y      n      y      y      n      n      n   \n",
       "\n",
       "  Vote_9 Vote_10 Vote_11 Vote_12 Vote_13 Vote_14 Vote_15  \n",
       "0      y       ?       y       y       y       n       y  \n",
       "1      n       n       y       y       y       n       ?  \n",
       "2      n       y       n       y       y       n       n  \n",
       "3      n       y       n       y       n       n       y  \n",
       "4      n       y       ?       y       y       y       y  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "fname = \"https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data\"\n",
    "df = pd.read_csv(fname, names = [\"party\"]+[\"Vote_%d\"% i for i in range(16)])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Models\n",
    "\n",
    "Using the `dmatrices` function from the `patsy` package, create a design matrix to predict the political party of the member of congress based on the votes cast by each of the members. Remember to treat the votes as a categorical variable as there are three possibilities for each vote (y,n,?).\n",
    "\n",
    "Next split the dataset into training set and test set, with 30% reserved for the test set.\n",
    "\n",
    "Train two different models on the training set.\n",
    "1. A Logisitic Regresssion Model\n",
    "2. A Support Vector Machine (SVM)\n",
    "    \n",
    "You can use default parameters for both the models.\n",
    "\n",
    "Output the accuracy of each of the model on the test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Models\n",
    "\n",
    "Now we will compare the predictions of the models to each other and the true values. The approach we will use is to use a scatter plot of the predicted probabilites.\n",
    "\n",
    "First compute the predicted probabilites, from both the models, for the political party being a specific value (say democrat). Now we can use one model as the X axis and the other as the Y axis of the scatter plot. Also colour the dots based on the true political party, i.e. red dots for republicans and blue dots for democrats.\n",
    "\n",
    "If both the models are accurate and consistent, you should see all the blue dots in one corner and red dots in the other corner, with some sparse points in the middle of both colours.\n",
    "\n",
    "Save the scatter plot in a file called \"scatter.png\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python Script\n",
    "\n",
    "Copy all your code into a python script and make sure you add some comments describing your code.\n",
    "Save the python script as `assign2.py`.\n",
    "\n",
    "Test run your script by typing `python assign2.py` in your terminal. The code should output the accuracy of both models on the test set and save the graph `scatter.png` in the current directory. (Make sure you have activated your environment when you run the script)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission Requirements\n",
    "\n",
    "Submit the following on canvas\n",
    "1. A python script (assign2.py file) which will load the dataset, fit both the models and save the graph \"scatter.png\"\n",
    "2. The \"scatter.png\" that you produced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (3.6-cme193_new)",
   "language": "python",
   "name": "cme193_new"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}