Fall 2016 Computer Science 458: Machine Learning

[Home]

Machine Learning

Why? Both practical and scientific reasons. First, when discussing rule-based expert systems, we mentioned the knowledge engineering bottleneck. It was difficult to get the rules out of the head of the domain expert and into the computer. Case-based systems offered a way to simplify that task by transferring experience instead of rules. The human domain expert had access to her own cases or stories or experience, and could recapitulate that information.
Machine learning suggests that we move one more step in that direction. Take the human out of the loop and let the machine learn from its own experiences.
Second, as we study the process of learning, we can develop better scientific theories of human learning. How do people learn? How can we improve the process of learning? How can we improve education and training?

Categories.
- Supervised learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
- Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal. Another example is learning to play a game by playing against an opponent.

Tasks.
- Classification Inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one or more (multi-label classification) of these classes. This is typically tackled in a supervised way. Spam filtering is an example of classification, where the inputs are email (or other) messages and the classes are "spam" and "not spam".
- Regression (also a supervised problem), the outputs are continuous rather than discrete.
- Clustering A set of inputs is to be divided into groups. Unlike in classification, the groups are not known beforehand, making this typically an unsupervised task.

Algorithms. Below is a list of learning algorithms we will discuss. (There are many more.)
Supervised:
- Naive Bayes
- Support Vector Machines (SVM)
- Decision Trees
- K nearest neighbors
- Random Forest
- Adaboost
- Regression - (Ordinary Least Squares - OLS)
Unsupervised:
- Clustering
- Principle Component Analysis (PCA)
Other notable learning algorithms:
- Hidden Markov Models (used by Alexa)
- TensorFlow (used by Google)
Also, it is often necessary to scrub the data before applying the learning algorithm This process is also known as data pre-processing. You may need to filter missing data or outliers. Though sometimes outliers are more interesting, such as in credit card fraud detection.

Tools. There are many software tools for machine learning Some are free and others are proprietary. We will focus on one of the free ones: the scikit-learn package for Python, aka, sklearn. It is available on the zoo.

Examples. We will be using examples from the Udacity introduction to machine learning course developed by Sebastian Thrun, of self-driving car fame. The course exercises are available online at the zoo at /c/cs458/www/lectures/sklearn/ud/ud120-projects-master

Applications. Machine learning has been applied to a boatload of domains. Here are some examples.
- Adaptive websites
- Affective computing
- Brain-machine interfaces
- Classifying DNA sequences
- Computer vision, including object recognition
- Detecting credit card fraud
- Game playing
- Marketing
- Medical diagnosis
- Natural language processing
- Natural language understanding
- Online advertising
- Recommender systems
- Robot locomotion
- Search engines
- Self-driving cars (Google)
- Sentiment analysis (or opinion mining)
- Speech and handwriting recognition (Amazon's Alexa)
- Stock market analysis

Supervised Learning

Here's how it works. In a classification task, you are given the following.

A data set of samples, each of which has a vector of features as well as a label, indicating to which class it belongs. Here is an example from the Udacity course based on self-driving cars.

The training data are sets of points in the 2-d space, indicating the bumpiness and grade of the terrain. Associated with each point is a label indicating if the car should drive slow (red) or fast (blue).

A classifier algorithm. This algorithm will process the features and the labels, and construct a formula for mapping features to labels. In sklearn, there are lots of possibilities. Here is the code for the Naive Bayes classifier.
```
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
```
We create an object clf which is an instance of the Naive Bayes classifier. Each sklearn classifier has a fit() method which has parameters for the training features and labels.
```
clf.fit(features_train, labels_train)
```
The classifier creates a decision boundary which segregates the data into the different classes or categories. In the image above the area with the blue background indicates the conditions the program has learned under which the car can drive fast and the red background indicates the conditions where the car should slow down.

A prediction function. The prediction function simply applies the classifier function created by fit to a new data element, and predicts the label for the new element. In sklearn, classifiers have a predict() method.
```
clf.predict(features_test)
```

A scoring function. Once you have used the classifier to fit the data, you can run it on some test data, for which you know the correct answer, and determine how well your classifier performs. Each sklearn classifier has a score() method which allows you to test the classifier with data different from the training data. This process tests the accuracy of your derived model.
```
clf.score(features_test, labels_test)
```
You can also create a separate set of predictions for the test data and independently assess the accuracy.
```
pred = clf.predict(features_test)
from sklearn.metrics import accuracy_score
accuracy_score(labels_test, pred)	  
```
We added some timing code to see how fast the classification and training goes.
```
from time import time
print ("\nNaive bayes classifier: \n")
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
print (clf.score(features_test, labels_test))
print "scoring time:", round(time()-t1, 3), "s"
```
Here is the output:
```
Naive bayes classifier: 

training time: 0.001 s
0.884
scoring time: 0.008 s
```
The accuracy is 88.4%. You probably want your self-driving car to do better than that.

Naive Bayes

6 easy steps to learn naive bayes algorithm
sklearn naive bayes
nb_author_id.py Classify author of email text. Sara [0] or Chris [1].
test.py test.pdf test2.pdf

gender.py (from Wiki data) gender.pdf gender2.pdf gender3.pdf gender4.pdf

  note plots: different colors for different classes
  for x in dir(clf):
    print (x, getattr(clf, x))
  sigma_ and theta_ -- compare with wiki values
  what's the deal with sigma_?
  3d scatter plots!!

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Creates a hyperplane separating the classes of data, such that it maximizes the margin of the decision surface separating the data.

SVC = support vector classifier

If features are not linearly seperable, create new feature of features: If y points surround x’s in the plane: x^2 + y^2 = z generates linear svm (new feature)

Kernel trick - map non-linear features onto linear functions.

Parameters: The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

Parameters: Gamma A standard SVM is a type of linear classification using dot product. However, in 1992, Boser, Guyan, and Vapnik proposed a way to model more complicated relationships by replacing each dot product with a nonlinear kernel function (such as a Gaussian radial basis function or Polynomial kernel). Gamma is the free parameter of the Gaussian radial basis function.

A small gamma means a Gaussian with a large variance so the influence of x_j is more, i.e. if x_j is a support vector, a small gamma implies the class of this support vector will have influence on deciding the class of the vector x_i even if the distance between them is large. If gamma is large, then variance is small implying the support vector does not have wide-spread influence. Technically speaking, large gamma leads to high bias and low variance models, and vice-versa.

Avoid over fitting

sklearn SVM: SVC

svm_author_id.py

* Vary size of training set.
* Vary kernel (linear or rbf - radial basis function or others)
* Vary gamma or C

Output of svm_author_id.py:

linear kernel:
with entire data set:

training time: 163.169 s
0.984072810011
scoring time: 17.188 s


with smaller data set:

training time: 0.089 s
0.884527872582
scoring time: 0.961 s

rbf kernel:

training time: 0.101 s
0.616040955631
scoring time: 1.101 s

C=10.0

training time: 0.102 s
0.616040955631
scoring time: 1.101 s

C=100.0

training time: 0.101 s
0.616040955631
scoring time: 1.101 s

C=1000.0


training time: 0.097 s
0.821387940842
scoring time: 1.052 s

C=10000.0

training time: 0.096 s
0.892491467577
scoring time: 0.886 s

('C = ', 10000.0)
training time: 0.099 s
0.892491467577
scoring time: 0.885 s

for x in dir(clf):
  print (x, getattr(clf, x))

complete dataset:

('C = ', 10000.0)
training time: 109.979 s
0.990898748578
scoring time: 11.163 s

sklearn rbf parameters

test.py test.pdf test2.pdf
testx.py data2.py testx.pdf testx2.pdf
plot_svm_margin.py svmfigure1.pdf svmfigure2.pdf

Decision Trees

Min_samples_split = 2 +++. The minimum number of samples required to split an internal node. The smaller, the more granular and prone to over fitting.
Entropy calculation Information gain = entropy (parent) - [weighted average] entropy(children)
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
Criterion for split = gini or entropy
Bias / variance trade off
Prone to overfitting
Can explain the decision
Create bigger systems using ensemble methods
test.py test2.png test50.png
dt_author_id.py

K nearest neighbors

classic, simple, easy to understand

digits.py digits dataset - handwritten digits

your_algorithm.py naivebayes.png svm.png decisiontree.png knn.png randomforest.png adaboost.png

  
Naive bayes classifier: 

training time: 0.001 s
0.884
scoring time: 0.01 s

SVM classifier: 

('C = ', 10000.0)
training time: 0.04 s
0.932
scoring time: 0.002 s

Decision trees classifier: 

training time: 0.001 s
0.912
scoring time: 0.001 s

K nearest neighbors classifier: 

training time: 0.001 s
0.936
scoring time: 0.001 s

Random forest classifier: 

training time: 0.038 s
0.912
scoring time: 0.003 s

Adaboost classifier: 

training time: 0.127 s
0.924
scoring time: 0.006 s

Random Forest

ensemble method

Adaboost

ensemble method

More data will improve accuracy / performance more than a fine tuned algorithm

Regression - (Ordinary Least Squares - OLS)

Find the best fit line.

Unsupervised Learning

K Means

K means clustering visualization
Local minima -- hill climbing

Text Processing

TF-IDF

Principal Component Analysis (PCA)

Systemetized way to transform input features into principal components

Use principal components as new features.

PC's are directions in data that maximize variance (minimize information loss) when you project / compress down onto them.

More variance along a PC, higher that PC is ranked.

Most variance / most information => first PC. Second most variance (without overlapping with first PC) => second PC. Orthogonal.

Maximum number of PC's: number of input features.

Reduced dimensions of feature set. the max number of PCs is min(n_features, n_data_points).

General Algorithm for feature transformation

When to use PCA

latent features driving the patterns in data
dimensionality reduction:
visualize high-dimension data (plot first two PC's)
reduce noise
make other algorithms (regression, classification) work better because of fewer inputs

Facial recognition in pictures is good for PCA

pictures of faces generally have high input dimensionality (many pixels)
faces have general patterns that could be captured in smaller number of dimensions (two eyes on top, mouth/chin on bottom etc.)

A PCA for facial recognition is called an eigenface. sklearn Faces recognition example using eigenfaces and SVMs eigenfaces.py eigenfaces.pdf eigenfaces2.pdf
eigenfacesv2.py

===================================================
Faces recognition example using eigenfaces and SVMs
===================================================

The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:

  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

  .. _LFW: http://vis-www.cs.umass.edu/lfw/

  original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html


2016-12-04 16:59:05,287 Loading LFW people faces from /home/accts/sbs5/scikit_learn_data/lfw_home
Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
done in 1.002s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.089s
Fitting the classifier to the training set
done in 20.811s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Predicting the people names on the testing set
done in 0.054s
                   precision    recall  f1-score   support

     Ariel Sharon       0.55      0.46      0.50        13
     Colin Powell       0.78      0.82      0.80        60
  Donald Rumsfeld       0.67      0.67      0.67        27
    George W Bush       0.87      0.92      0.90       146
Gerhard Schroeder       0.86      0.76      0.81        25
      Hugo Chavez       0.83      0.67      0.74        15
       Tony Blair       0.84      0.75      0.79        36

      avg / total       0.82      0.82      0.82       322

[[  6   3   2   1   0   1   0]
 [  1  49   2   8   0   0   0]
 [  3   1  18   4   0   0   1]
 [  1   3   4 135   0   0   3]
 [  0   3   0   1  19   1   1]
 [  0   2   0   1   2  10   0]
 [  0   2   1   5   1   0  27]]

Checking the accuracy.

  N_components  F1 score
  10            .1212
  15            .0869
  25            .5882
  50            .7199
  100           .6666
  150           .4999
  250           .5599

More PC's is not always better. Can have overfitting. precision and recall F1 score

Validation

You want to maximize your training data for best learning results, and maximize your test data for validation. Use 10-fold cross-validation
K-Fold partitions the data - reflecting the order of the data. Use GridSearchCV() instead.

Evaluation Metrics

Accuracy may not be a good metric for skewed classes, or avoiding type 1 and type 2 errors.

Confusion Matrix - Recall and Precision

(from above)
[[  6   3   2   1   0   1   0]
 [  1  49   2   8   0   0   0]
 [  3   1  18   4   0   0   1]
 [  1   3   4 135   0   0   3]
 [  0   3   0   1  19   1   1]
 [  0   2   0   1   2  10   0]
 [  0   2   1   5   1   0  27]]

Neural Networks and TensorFlow

Neural Networks and Deep Learning Uses MNIST data.
Visualizing MNIST: An Exploration of Dimensionality Reduction
The Great AI Awakening New York Times article about Google translate and tensorflow. Links to research articles.
Tensorflow.prg Contains tutorials.