Also, it is often necessary to scrub the data before applying the learning algorithm This process is also known as data pre-processing. You may need to filter missing data or outliers. Though sometimes outliers are more interesting, such as in credit card fraud detection.
The training data are sets of points in the 2-d space, indicating the bumpiness and grade of the terrain. Associated with each point is a label indicating if the car should drive slow (red) or fast (blue).
from sklearn.naive_bayes import GaussianNB clf = GaussianNB()We create an object clf which is an instance of the Naive Bayes classifier. Each sklearn classifier has a fit() method which has parameters for the training features and labels.
clf.fit(features_train, labels_train)The classifier creates a decision boundary which segregates the data into the different classes or categories. In the image above the area with the blue background indicates the conditions the program has learned under which the car can drive fast and the red background indicates the conditions where the car should slow down.
clf.predict(features_test)
clf.score(features_test, labels_test)You can also create a separate set of predictions for the test data and independently assess the accuracy.
pred = clf.predict(features_test) from sklearn.metrics import accuracy_score accuracy_score(labels_test, pred)We added some timing code to see how fast the classification and training goes.
from time import time print ("\nNaive bayes classifier: \n") from sklearn.naive_bayes import GaussianNB clf = GaussianNB() t0 = time() clf.fit(features_train, labels_train) print "training time:", round(time()-t0, 3), "s" t1 = time() print (clf.score(features_test, labels_test)) print "scoring time:", round(time()-t1, 3), "s"Here is the output:
Naive bayes classifier: training time: 0.001 s 0.884 scoring time: 0.008 sThe accuracy is 88.4%. You probably want your self-driving car to do better than that.
note plots: different colors for different classes for x in dir(clf): print (x, getattr(clf, x)) sigma_ and theta_ -- compare with wiki values what's the deal with sigma_? 3d scatter plots!!
* Vary size of training set. * Vary kernel (linear or rbf - radial basis function or others) * Vary gamma or C Output of svm_author_id.py: linear kernel: with entire data set: training time: 163.169 s 0.984072810011 scoring time: 17.188 s with smaller data set: training time: 0.089 s 0.884527872582 scoring time: 0.961 s rbf kernel: training time: 0.101 s 0.616040955631 scoring time: 1.101 s C=10.0 training time: 0.102 s 0.616040955631 scoring time: 1.101 s C=100.0 training time: 0.101 s 0.616040955631 scoring time: 1.101 s C=1000.0 training time: 0.097 s 0.821387940842 scoring time: 1.052 s C=10000.0 training time: 0.096 s 0.892491467577 scoring time: 0.886 s ('C = ', 10000.0) training time: 0.099 s 0.892491467577 scoring time: 0.885 s for x in dir(clf): print (x, getattr(clf, x)) complete dataset: ('C = ', 10000.0) training time: 109.979 s 0.990898748578 scoring time: 11.163 ssklearn rbf parameters
Naive bayes classifier: training time: 0.001 s 0.884 scoring time: 0.01 s SVM classifier: ('C = ', 10000.0) training time: 0.04 s 0.932 scoring time: 0.002 s Decision trees classifier: training time: 0.001 s 0.912 scoring time: 0.001 s K nearest neighbors classifier: training time: 0.001 s 0.936 scoring time: 0.001 s Random forest classifier: training time: 0.038 s 0.912 scoring time: 0.003 s Adaboost classifier: training time: 0.127 s 0.924 scoring time: 0.006 s
More data will improve accuracy / performance more than a fine tuned algorithm
Regression - (Ordinary Least Squares - OLS)
Find the best fit line.
=================================================== Faces recognition example using eigenfaces and SVMs =================================================== The dataset used in this example is a preprocessed excerpt of the "Labeled Faces in the Wild", aka LFW_: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB) .. _LFW: http://vis-www.cs.umass.edu/lfw/ original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html 2016-12-04 16:59:05,287 Loading LFW people faces from /home/accts/sbs5/scikit_learn_data/lfw_home Total dataset size: n_samples: 1288 n_features: 1850 n_classes: 7 Extracting the top 150 eigenfaces from 966 faces done in 1.002s Projecting the input data on the eigenfaces orthonormal basis done in 0.089s Fitting the classifier to the training set done in 20.811s Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0, decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Predicting the people names on the testing set done in 0.054s precision recall f1-score support Ariel Sharon 0.55 0.46 0.50 13 Colin Powell 0.78 0.82 0.80 60 Donald Rumsfeld 0.67 0.67 0.67 27 George W Bush 0.87 0.92 0.90 146 Gerhard Schroeder 0.86 0.76 0.81 25 Hugo Chavez 0.83 0.67 0.74 15 Tony Blair 0.84 0.75 0.79 36 avg / total 0.82 0.82 0.82 322 [[ 6 3 2 1 0 1 0] [ 1 49 2 8 0 0 0] [ 3 1 18 4 0 0 1] [ 1 3 4 135 0 0 3] [ 0 3 0 1 19 1 1] [ 0 2 0 1 2 10 0] [ 0 2 1 5 1 0 27]]Checking the accuracy.
N_components F1 score 10 .1212 15 .0869 25 .5882 50 .7199 100 .6666 150 .4999 250 .5599More PC's is not always better. Can have overfitting. precision and recall F1 score
(from above) [[ 6 3 2 1 0 1 0] [ 1 49 2 8 0 0 0] [ 3 1 18 4 0 0 1] [ 1 3 4 135 0 0 3] [ 0 3 0 1 19 1 1] [ 0 2 0 1 2 10 0] [ 0 2 1 5 1 0 27]]