Fall 2016 Computer Science 458: Machine Learning


Machine Learning

Supervised Learning

Here's how it works. In a classification task, you are given the following.

  1. A data set of samples, each of which has a vector of features as well as a label, indicating to which class it belongs. Here is an example from the Udacity course based on self-driving cars.

    The training data are sets of points in the 2-d space, indicating the bumpiness and grade of the terrain. Associated with each point is a label indicating if the car should drive slow (red) or fast (blue).

  2. A classifier algorithm. This algorithm will process the features and the labels, and construct a formula for mapping features to labels. In sklearn, there are lots of possibilities. Here is the code for the Naive Bayes classifier.
    from sklearn.naive_bayes import GaussianNB
    clf = GaussianNB()
    We create an object clf which is an instance of the Naive Bayes classifier. Each sklearn classifier has a fit() method which has parameters for the training features and labels.
    clf.fit(features_train, labels_train)
    The classifier creates a decision boundary which segregates the data into the different classes or categories. In the image above the area with the blue background indicates the conditions the program has learned under which the car can drive fast and the red background indicates the conditions where the car should slow down.
  3. A prediction function. The prediction function simply applies the classifier function created by fit to a new data element, and predicts the label for the new element. In sklearn, classifiers have a predict() method.
  4. A scoring function. Once you have used the classifier to fit the data, you can run it on some test data, for which you know the correct answer, and determine how well your classifier performs. Each sklearn classifier has a score() method which allows you to test the classifier with data different from the training data. This process tests the accuracy of your derived model.
    clf.score(features_test, labels_test)
    You can also create a separate set of predictions for the test data and independently assess the accuracy.
    pred = clf.predict(features_test)
    from sklearn.metrics import accuracy_score
    accuracy_score(labels_test, pred)	  
    We added some timing code to see how fast the classification and training goes.
    from time import time
    print ("\nNaive bayes classifier: \n")
    from sklearn.naive_bayes import GaussianNB
    clf = GaussianNB()
    t0 = time()
    clf.fit(features_train, labels_train)
    print "training time:", round(time()-t0, 3), "s"
    t1 = time()
    print (clf.score(features_test, labels_test))
    print "scoring time:", round(time()-t1, 3), "s"
    Here is the output:
    Naive bayes classifier: 
    training time: 0.001 s
    scoring time: 0.008 s
    The accuracy is 88.4%. You probably want your self-driving car to do better than that.
Naive Bayes
Support Vector Machines (SVM)
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Creates a hyperplane separating the classes of data, such that it maximizes the margin of the decision surface separating the data.

SVC = support vector classifier

If features are not linearly seperable, create new feature of features: If y points surround x’s in the plane: x^2 + y^2 = z generates linear svm (new feature)

Kernel trick - map non-linear features onto linear functions.

Parameters: The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

Parameters: Gamma A standard SVM is a type of linear classification using dot product. However, in 1992, Boser, Guyan, and Vapnik proposed a way to model more complicated relationships by replacing each dot product with a nonlinear kernel function (such as a Gaussian radial basis function or Polynomial kernel). Gamma is the free parameter of the Gaussian radial basis function.

A small gamma means a Gaussian with a large variance so the influence of x_j is more, i.e. if x_j is a support vector, a small gamma implies the class of this support vector will have influence on deciding the class of the vector x_i even if the distance between them is large. If gamma is large, then variance is small implying the support vector does not have wide-spread influence. Technically speaking, large gamma leads to high bias and low variance models, and vice-versa.

Avoid over fitting
Decision Trees
K nearest neighbors
classic, simple, easy to understand
  • digits.py digits dataset - handwritten digits
  • your_algorithm.py naivebayes.png svm.png decisiontree.png knn.png randomforest.png adaboost.png
    Naive bayes classifier: 
    training time: 0.001 s
    scoring time: 0.01 s
    SVM classifier: 
    ('C = ', 10000.0)
    training time: 0.04 s
    scoring time: 0.002 s
    Decision trees classifier: 
    training time: 0.001 s
    scoring time: 0.001 s
    K nearest neighbors classifier: 
    training time: 0.001 s
    scoring time: 0.001 s
    Random forest classifier: 
    training time: 0.038 s
    scoring time: 0.003 s
    Adaboost classifier: 
    training time: 0.127 s
    scoring time: 0.006 s
    Random Forest
    ensemble method
    ensemble method

    More data will improve accuracy / performance more than a fine tuned algorithm

    Regression - (Ordinary Least Squares - OLS)
    Find the best fit line.

    Unsupervised Learning

    K Means
    Text Processing
  • TF-IDF
  • Principal Component Analysis (PCA)
  • Systemetized way to transform input features into principal components
  • Use principal components as new features.
  • PC's are directions in data that maximize variance (minimize information loss) when you project / compress down onto them.
  • More variance along a PC, higher that PC is ranked.
  • Most variance / most information => first PC. Second most variance (without overlapping with first PC) => second PC. Orthogonal.
  • Maximum number of PC's: number of input features.
  • Reduced dimensions of feature set. the max number of PCs is min(n_features, n_data_points).
  • General Algorithm for feature transformation
  • When to use PCA
  • Facial recognition in pictures is good for PCA
    Evaluation Metrics

    Neural Networks and TensorFlow