Also, it is often necessary to scrub the data before applying the learning algorithm This process is also known as data pre-processing. You may need to filter missing data or outliers. Though sometimes outliers are more interesting, such as in credit card fraud detection.
The training data are sets of points in the 2-d space, indicating the bumpiness and grade of the terrain. Associated with each point is a label indicating if the car should drive slow (red) or fast (blue).
Generated with code: prep_terrain_data.py
from sklearn.naive_bayes import GaussianNB clf = GaussianNB()We create an object clf which is an instance of the Naive Bayes classifier. Each sklearn classifier has a fit() method which has parameters for the training features and labels.
clf.fit(features_train, labels_train)The classifier creates a decision boundary which segregates the data into the different classes or categories. In the image above the area with the blue background indicates the conditions the program has learned under which the car can drive fast and the red background indicates the conditions where the car should slow down.
clf.predict(features_test)
clf.score(features_test, labels_test)You can also create a separate set of predictions for the test data and independently assess the accuracy.
pred = clf.predict(features_test) from sklearn.metrics import accuracy_score accuracy_score(labels_test, pred)We added some timing code to see how fast the classification and training goes.
from time import time
print ("\nNaive bayes classifier: \n")
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t1 = time()
print (clf.score(features_test, labels_test))
print "scoring time:", round(time()-t1, 3), "s"
Here is the output:
Naive bayes classifier: training time: 0.001 s 0.884 scoring time: 0.008 sThe accuracy is 88.4%. You probably want your self-driving car to do better than that.
note plots: different colors for different classes
for x in dir(clf):
print (x, getattr(clf, x))
sigma_ and theta_ -- compare with wiki values
what's the deal with sigma_?
3d scatter plots!!
* Vary size of training set.
* Vary kernel (linear or rbf - radial basis function or others)
* Vary gamma or C
Output of svm_author_id.py:
linear kernel:
with entire data set:
training time: 163.169 s
0.984072810011
scoring time: 17.188 s
with smaller data set:
training time: 0.089 s
0.884527872582
scoring time: 0.961 s
rbf kernel:
training time: 0.101 s
0.616040955631
scoring time: 1.101 s
C=10.0
training time: 0.102 s
0.616040955631
scoring time: 1.101 s
C=100.0
training time: 0.101 s
0.616040955631
scoring time: 1.101 s
C=1000.0
training time: 0.097 s
0.821387940842
scoring time: 1.052 s
C=10000.0
training time: 0.096 s
0.892491467577
scoring time: 0.886 s
('C = ', 10000.0)
training time: 0.099 s
0.892491467577
scoring time: 0.885 s
for x in dir(clf):
print (x, getattr(clf, x))
complete dataset:
('C = ', 10000.0)
training time: 109.979 s
0.990898748578
scoring time: 11.163 s
sklearn rbf parameters
Naive bayes classifier:
training time: 0.001 s
0.884
scoring time: 0.01 s
SVM classifier:
('C = ', 10000.0)
training time: 0.04 s
0.932
scoring time: 0.002 s
Decision trees classifier:
training time: 0.001 s
0.912
scoring time: 0.001 s
K nearest neighbors classifier:
training time: 0.001 s
0.936
scoring time: 0.001 s
Random forest classifier:
training time: 0.038 s
0.912
scoring time: 0.003 s
Adaboost classifier:
training time: 0.127 s
0.924
scoring time: 0.006 s
More data will improve accuracy / performance more than a fine tuned algorithm
Regression - (Ordinary Least Squares - OLS)
Find the best fit line.
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
2016-12-04 16:59:05,287 Loading LFW people faces from /home/accts/sbs5/scikit_learn_data/lfw_home
Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
done in 1.002s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.089s
Fitting the classifier to the training set
done in 20.811s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Predicting the people names on the testing set
done in 0.054s
precision recall f1-score support
Ariel Sharon 0.55 0.46 0.50 13
Colin Powell 0.78 0.82 0.80 60
Donald Rumsfeld 0.67 0.67 0.67 27
George W Bush 0.87 0.92 0.90 146
Gerhard Schroeder 0.86 0.76 0.81 25
Hugo Chavez 0.83 0.67 0.74 15
Tony Blair 0.84 0.75 0.79 36
avg / total 0.82 0.82 0.82 322
[[ 6 3 2 1 0 1 0]
[ 1 49 2 8 0 0 0]
[ 3 1 18 4 0 0 1]
[ 1 3 4 135 0 0 3]
[ 0 3 0 1 19 1 1]
[ 0 2 0 1 2 10 0]
[ 0 2 1 5 1 0 27]]
Checking the accuracy.
N_components F1 score 10 .1212 15 .0869 25 .5882 50 .7199 100 .6666 150 .4999 250 .5599More PC's is not always better. Can have overfitting. precision and recall F1 score
(from above) [[ 6 3 2 1 0 1 0] [ 1 49 2 8 0 0 0] [ 3 1 18 4 0 0 1] [ 1 3 4 135 0 0 3] [ 0 3 0 1 19 1 1] [ 0 2 0 1 2 10 0] [ 0 2 1 5 1 0 27]]