Computer Science 463b/563b Lecture Log, Spring 2009


[Home]
4/13/09 Lecture 33. Markov chains and Hidden Markov Models, continued.

4/10/09 Lecture 32. Markov chains and Hidden Markov Models, continued.

4/8/09 Lecture 31. Learning graph structure: 5 round randomized algorithm concluded; started Markov chains and Hidden Markov Models.

(Reading: pages 216-221 of "Learning a hidden graph using O(log n) queries per edge" by Angluin and Chen, distributed in class.)

4/6/09 Lecture 30. Queries: learning graph structure, 5 round randomized algorithm, continued.

4/3/09 Lecture 29. Queries: learning graph structure, adaptive algorithm and start of 5-round randomized algorithm.

Last time we saw an adaptive algorithm that uses at most 4m(log n) edge-detecting queries to learn a bipartite graph on vertices (V_1,V_2). The next step is to generalize this to learn an arbitrary graph in O(m log n) queries. If we divide the vertices V arbitrarily into two halves V_1 and V_2, we can assume that we have recursively found all the edges E_1 with both endpoints in V_1 and all the edges E_2 with both endpoints in V_2, leaving just the edges with one endpoint in V_1 and one endpoint in V_2 to be found. To reduce this situation to that of a bipartite graph, we color the vertices of V_1 in such a way that no edge of E_1 has both endpoints the same color, and similarly for the vertices of V_2 and the edges E_2. Then if we consider all the vertices in V_1 of one color, there is no edge of E_1 between them, and similarly for a color class in V_2. Thus, to find the edges between V_1 and V_2 it suffices to use the bipartite graph algorithm on each pair (L,R), where L is a color class in V_1 and R is a color class in V_2. To keep the number of such pairs down, we'd like to color the graphs using *few* colors -- but the problem of coloring a graph with the minimum number of colors is NP-hard!

However, for our problem, the following heuristic gives a sufficiently good bound. Given a graph G, while there exist vertices u and v not joined by an edge, we collapse u and v (removing parallel) edges, until we arrive at a clique (a graph with an edge between every pair of vertices.) Note that if we collapse two vertices u and v of G not joined by an edge to form G' and we find a coloring of G', it can be "lifted" to a coloring of G because u and v may be the same color. It is easy to color the final clique: one color for each vertex. But this clique has at most m edges, and therefore at most 2*sqrt(m)+1 vertices. Thus, the total number of color classes for (V_1,E_1) is O(sqrt(|E_1|)) and the total number of color classes for (V_2,E_2) is O(sqrt(|E_2|). This allows a straightforward inductive proof that 12m(log n) edge-detecting queries suffice for this adaptive algorithm to learn an arbitrary graph with n vertices and m edges. (See handout for proof.)

Next we see a randomized algorithm using edge-detecting queries in which the queries can be made in 5 parallel "rounds", where the queries asked in a round may depend on the answers to queries in previous rounds. To warm up, we consider a query of the following kind: for a fixed probability p, we construct a set of vertices S_p by independently choosing to include each vertex with probability p. What can we say about Q(S_p)? Each edge (u,v) of G has both endpoints in S_p with probability p^2, so the expected number of edges with both endpoints in S_p is mp^2. To make our query as informative as possible, we'd like to arrange it so that the probability that Q(S_p) = 1 is 1/2. If we choose p = 1/sqrt(2m), then at least we can guarantee (using the union bound) that the probability that Q(S_p) = 1 is *at most* 1/2.

4/1/09 Lecture 28. Queries: learning graph structure, adaptive algorithm.

Introduction: the heavy coin problem revisited. Suppose we have n identical looking coins, of which (n - 1) weigh one ounce and one weighs a little more than one ounce. We want to find the heavy coin, we have a scale that will tell us very accurately the weight of any collection of the coins, and we want to minimize the number of weighings we do. As we saw in the introduction to the course, binary search can be used to find the heavy coin in log n weighings: divide the coins into two equal groups and weigh one of them -- either that group or the other contains the heavy coin -- continue with the group that contains the heavy coin until it is isolated. This is an *adaptive* algorithm -- which coins we weigh next depends on the results of previous weighings.

There is also a *nonadaptive* algorithm using just log n weighings. A nonadaptive algorithm determines all the groups of coins that will be weighed before the first weighing is done. Given the results of weighing all the specified groups, the algorithm determines the heavy coin. This algorithm may be understood as first numbering each coin using a length log n sequence of bits, eg, for 8 coins the numbers are 000, 001, 010, 011, 100, 101, 110, 111. Then the prescribed groups to weigh are (1) those with first bit = 0, (2) those with second bit = 0, ..., (k) those with kth bit = 0, where k is log_2 n, rounded up to an integer. Using the example of 8 coins, we would specify 3 weighings: {000,0001,010,011}, {000, 001, 100, 101} and {000, 010, 100, 110}. Each weighing tells us whether or not the heavy coin is in the group, which determines the bits of the identifier of the heavy coin. For example, if the heavy coin was not in the first or second group, but in the third, we would know that its identifier was 110.

These problems are studied in "combinatorial group testing". If we define M(n,d) to be the minimum number of weighings to find exactly d "defective" elements from n (in our example, d = 1, the heavy coin), then we have seen that M(n,1) is the ceiling of log_2 n (a lower bound is not difficult to prove), but the exact values of M(n,d) for d > 1 are not known.

A graph learning problem of a similar character is the following. Suppose we know the n vertices of a graph G but do not know the edges. We want to find out the edges of G, and are able to gather information using "edge detecting queries". That is, we can specify a set S of the vertices, and the answer is 0 if no edges of G has both endpoints in S, and the answer is 1 if at least one edge of G has both endpoints in S. (We could also consider "edge counting queries" -- in this case, instead of 1 or 0, the answer returned is the number of edges of G that have both endpoints in S.) Clearly we could discover the edges of G by making a query for every pair of vertices, to learn whether there is an edge between them -- this takes Theta(n^2) queries. This type of query was motivated by problems in DNA sequencing, where the target graph was a path or a matching -- the number of edges is just Theta(n), and it is desirable to use fewer than Theta(n^2) queries to find them. We will next see an adaptive algorithm to learn the edges of an arbitrary graph with m edges using O(m log n) edge-detecting queries.

Suppose G has just one edge -- how would we go about finding it? We can try binary search: divide the set V of vertices into two equal groups V_1 and V_2, and query both of them: Q(V_1) and Q(V_2). If one of them, say V_1, is answered 1, we just continue with the set of vertices V_1 If both of them are answered 0, then the edge has one endpoint in V_1 and the other in V_2. One way to proceed would be to subdivide V_1 and V_2 into two equal groups each, say V_11, V_12 and V_21, V_22, and to test all 4 possible combinations consisting of a union of half of V_1 and half of V_2. Q(V_11 U V_21), Q(V_11 U V_22), Q(V_12 U V_21) and Q(V_12 U V_22). Then we can continue with whichever is answered 1. Thus, with at most 4 queries at each step we can cut the set of possible vertices in half, finding the target edge of G in at most 4(log n) edge-detecting queries.

In fact this same idea works if the target graph is known to be a bipartite graph on the vertex sets V_1 and V_2. (That is, the only edges of G join a vertex in V_1 with a vertex in V_2.) We divide V_1 and V_2 into halves, query all 4 possible combinations consisting of a union of half of V_1 and half of V_2, and continue with *all* the combinations that are answered 1. If we think of this algorithm as constructing a tree where the root is the pair (V_1,V_2) and each node contains a pair (V_1',V_2') such that Q(V_1' U V_2') = 1 and V_1' is a subset of V_1 and V_2' is a subset of V_2, then the depth of the tree is log_2 n and each leaf has a distinct edge of G, and we can charge 4 queries to each nonleaf node, for a total of at most 4m(log n) queries to find all m edges in the bipartite graph G.

How can we generalize this to non-bipartite graphs G? One idea is to divide the vertices V into halves V_1 and V_2, and recursively find all the edges of G with both endpoints in V_1, recursively find all the edges of G with both endpoints in V_2, leaving just the edges with one endpoint in V_1 and one endpoint in V_2 to be found -- something like our bipartite graph situation. In fact, if we had edge-counting queries, we'd be able to just subtract the edges we know about, and learn the rest. But with edge-detecting queries, we have to find another way to keep the edges we've already found from interfering with our further queries. (Which we do in the next lecture.)

3/30/09 Lecture 27. Queries: learning finite automata, continued.

Development of a polynomial time algorithm to learn deterministic finite automata using equivalence and membership queries. Definition and example of multiplicity automata.

3/27/09 Lecture 26. Queries: learning finite automata.

(Reading: Angluin, "Learning regular sets from queries and counterexamples" -- see [Papers].)

Background on deterministic finite state acceptors (dfas). Proof that there is no polynomial time algorithm to learn dfas using only membership queries, even given a bound n on the number of states of the machine. (Bad case: "password" automata.) Relation to regular grammars and nondeterministic finite state acceptors -- they define the same sets of strings, but there may be an exponential gap in conciseness of nfas or grammars over dfas. (Bad case: the set of strings of 0's and 1's with a 1 n symbols from the end.) There is no polynomial time algorithm to learn dfas using only equivalence queries (proof not given.)

3/25/09 Lecture 25. Queries: Equivalence queries, PAC learning, the Halving Algorithm, and Mistake Bounds.

How outrageous are EQ's? Imagine a scenario in which a domain expert classifies X-rays as containing or not containing tumors, and the goal is to learn the concept class of "X-rays containing tumors." In this context, MQ's seem reasonable, since the expert just has to classify particular examples. However, an EQ would amount to typing out the C (or Haskell?) program representing the learner's current concept and asking for a counterexample (that is, a misclassified X-ray.) This does not seem at all reasonable!

However, consider the PAC model augmented with MQ's. In this model, we have X, the domain of possible examples, C, the class of possible target concepts, a target concept c from C, and a probability distribution D over X. The learner has as input the usual parameters epsilon and delta, and access to two oracles, the usual EXAMPLES oracle, that draws an example x according to D and returns the pair (x, b), where b = c(x), and a membership query oracle MQ(x) that returns the value of c(x) for an example x of the learner's choosing. The learner is expected to output with probability at least (1 - delta) a concept h such that error(h) is less than epsilon, and to do so in time polynomial in 1/delta, 1/epsilon, n (the length of an example), and the size of the target concept. Considering our X-ray example, we can imagine that the EXAMPLES oracle is supplied by a large collection of labelled examples, while MQ's might be answered by a human domain expert.

The nice thing about a polynomial-time exact learning algorithm using MQ's and EQ's is that it can be transformed into a not-much-less efficient algorithm in this PAC model with MQ's. The idea is that when the exact learning algorithm makes an equivalence query, say EQ(h), we instead draw a large sample from the EXAMPLES oracle, and test to see whether h labels all the examples as in the sample. If not, then we have a counterexample x such that h(x) is not equal to c(x) to return as the answer to the EQ. If so, we output h and halt. The key is to choose the size of the sample used to test h in such a way that with probability at least (1 - delta), the error of h will be at most epsilon.

To do this, we have to budget our confidence delta over a sequence of tests, of h_1, h_2, ... for each of the Equivalence Queries EQ(h_i) our algorithm asks. One way to do this is to require confidence delta/2 for the test of h_1, delta/4 for the test of h_2, and so on, with confidence delta/2^i for the test of h_i, giving a total confidence of delta. Another choice would be a less rapidly converging infinite sum like 1/n^2, whose sum converges to (pi)^2/6, according to Euler: so we could budget confidence 6/(pi)^2 times delta/i^2 for the test of h_i. With this choice, if we use a sample of size O((1/epsilon)(ln i + ln(1/delta))) for the i-th EQ, then the overall probability of returning a hypothesis h with error greater than epsilon is bounded by delta. This means that the EQ and MQ framework can be used to develop efficient algorithms for the PAC and MQ framework.

How reasonable are MQs? In some cases, not that reasonable. In an empirical test of training linear separators for handwritten digit recognition, people were paid to answer the MQs of a (theoretically) good learning algorithm. The problem is that to fine-tune its hypothesis, the algorithm asked MQs about examples it constructed, which tended to be things like a "blend" of a 2 and a 3, which people found difficult to answer (and did not answer consistently.) In this case, there is a natural distribution over the data and people are good at classifying the cases that occur with some reasonable probability, but not necessarily the low probability cases. A field called "active learning" addresses this problem: the setting is that the algorithm has access to a large collection of unlabeled data, and a smaller collection of labeled data (it is typically more expensive to get labeled data), and may request labels of *some* of the unlabeled data to help it refine its hypothesis. The goal then is to request just enough labels to get a good hypothesis. This avoids the problem of synthetic cases if the unlabeled data represents the natural data distribution.

The Halving Algorithm. Lower bounds on EQ's are representation-dependent. If C is finite and we place no restriction at all on the concept h queried in EQ(h) (except that it be some subset of X), then (log |C|) EQ's suffice, via the "Halving Algorithm." Given a finite set C' of concepts, we define the "majority vote" concept for C' as follows: h(x) = 1 if at least half the concepts c' in C' have c'(x) = 1, otherwise, h(x) = 0. Given a labelled sample S, we define VS(S), the version space of S, to be all those concepts c from C that are consistent with S. (See Mitchell's book for an expanded treatment of version spaces.) The Halving Algorithm works as follows: initially S is empty and VS(S) = C. Let h be the majority vote concept for VS(S) and query EQ(h). If the answer is "yes", then output h and halt. Otherwise, add the example x to S with the opposite of its label h(x), and repeat.

Consider a concrete example over X = {x1, x2, x3, x4} with C = {c1, c2, c3, c4, c5} where c1 = {x1, x4}, c2 = {x2, x3, x4}, c3 = {x4}, c4 = {x2, x3}, c5 = {x1, x3, x4}. We can picture this finite concept class as a table, as follows:

        x1    x2    x3    x4
    c1   1     0     0     1
    c2   0     1     1     1
    c3   0     0     0     1
    c4   0     1     1     0
    c5   1     0     1     1
Initially the set S is empty and the version space is all the concepts {c1, c2, c3, c4, c5}. The majority vote concept h_0 is obtained by taking the majority value in each column (preferring 1 if there is a tie):
        x1    x2    x3    x4
    c1   1     0     0     1
    c2   0     1     1     1
    c3   0     0     0     1
    c4   0     1     1     0
    c5   1     0     1     1
    h0   0     0     1     1
Note that h0 is not the same as any of the original concepts. Making an equivalence query with h0, EQ(h0), we receive some counterexample, say x3. Then, because h0(x3) = 1, we know that the correct value of the target concept on x3 is 0, so we can eliminate the concepts with value 1 on x3, namely c2, c4, and c5, reducing the version space to just c1 and c3:
        x1    x2    x3    x4
    c1   1     0     0     1
    c3   0     0     0     1
The new majority vote concept is h1:
        x1    x2    x3    x4
    c1   1     0     0     1
    c3   0     0     0     1
    h1   1     0     0     1
In this case, h1 is equal to c1. We make an equivalence query with h1, EQ(h1) and either receive "yes" (and output h1 and terminate) or a counterexample (which eliminates c1 and leaves c3 as the only remaining possibility.) In either case, after this query we are done.

It is not difficult to see that because each hypothesis h agrees with a majority of the remaining concepts on every point x, each counterexample x must reduce the size of the version space by a factor of at least 1/2. Thus the correct target concept must be tested after at most (log |C|) counterexamples. This is quite efficient in terms of the number of EQ's, but is not in general optimal (see Littlestone's description of the "Standard Optimal Algorithm.) What is wrong with the Halving Algorithm in a practical sense? The majority vote concept may be "too big" (no polynomial-size representation) or "too hard to find" (computationally.) Even attempting to approximate it by random sampling may run into computational obstacles.

Relation of EQs to Littlestone's model of on-line prediction with worst-case mistake bounds. In the online prediction setting, we have a set X of possible examples, a concept class C, and a target concept c from C. The learner repeatedly receives an example x from X, makes a prediction (0 or 1) of the label of x, and receives the correct classification of x (that is, c(x)). The quantity of interest is the total number of "mistakes" made by the learner, where each time the learner predicts a value different from c(x) is a mistake. This is a worst case bound, over all possible sequences of examples x, and all possible concepts c from C.

A learning algorithm A that learns a class of concepts C using (only) EQs can be transformed into a mistake-bounded prediction algorithm for the class C, where the number of mistakes is bounded by the number of EQs. Similarly, a mistake bounded algorithm A to learn the class C can be transformed into an EQ algorithm (using a different hypothesis class H) that makes at most one more EQ than the number of mistakes made by A. As an example, if C is finite, the Halving Algorithm can be implemented for prediction by keeping track of the current version space V and to predict the next element x_i, compute the majority vote of the concepts in V on the element x_i and make that the prediction.

3/23/09 Lecture 24. Queries: learning monotone DNF formulas.

We now consider a different model of learning, in which there is a target concept c and a learner who can ask queries of a teacher about the concept c. The teacher answers the queries truthfully, and the learner's goal is to identify exactly the target concept c. The two main types of queries we consider are Equivalence Queries (EQs) and Membership Queries (MQs).

Given a set X of possible examples, a class C of concepts, a class H of hypothesis concepts, and a target concept c from C, we define two types of queries as follows. A membership query has input x and returns the label of x according to c, that is, MQ(x) = c(x). An equivalence query has input h from H and returns "yes" if h and c are equivalent, or "no" if they are not equivalent, together with an (adversarially chosen) counterexample, that is, an element x of X such that c(x) is not equal to h(x).

To get a sense of these queries, we look at an algorithm to learn monotone DNF formulas using EQ's and MQ's, using the example (x1x2 + x3). (This is essentially the algorithm in Valiant's paper "A theory of the learnable.)

Note that if we consider the lattice of all assignments of 0 and 1 to the variables, with 111 at the top and 000 at the bottom, and the ordering relation of being less than or equal to in every component, then the set of assignments that satisfy a monotone DNF formula is an upward closed set in the lattice, and the minimum points in this lattice correspond to the terms of the reduced monotone DNF formula. Initially we hypothesize the empty formula (everywhere false), and receive a counterexample, say 111 (which represents the assignment x1 = 1, x2 = 1, and x3 = 1). Using MQ's, we search for a minimum positive point: MQ(011) = 1, MQ(001) = 1, MQ(000) = 0. Since 001 is a minimum positive point, we add the term x3 to the hypothesis and query EQ(x3). We receive the counterexample 110, and query MQ(010) = 0 and MQ(100) = 0 and determine that 110 is a minimum positive point. We add the term x1x2 to the hypothesis and query EQ(x3 + x1x2), which is answered "yes."

The algorithm to learn monotone DNF formulas using EQs and MQs can be formalized as follows. Initially h = the empty disjunction, which is 0 on every x. Query EQ(h) -- if the answer is "yes", output h and halt. Otherwise let t be the term corresponding to the point Minimize(x), where x is the counterexample returned for h, add the term t to h, and repeat. One implementation of Minimize(x): for each 1 in x, let x' be the result of setting it to 0, and query MQ(x'). If the answer is 1, then return Minimize(x'), otherwise, go on to the next 1 in x. If all the 1's in x are tested without finding a positive point, then return x. Assuming the example x is positive, this procedure is guaranteed to find a minimum positive point of the target concept c below x.

This procedure makes at most as many EQ's as there are terms in the canonical form of the target concept, and for each EQ, it makes O(n^2) MQ's. Thus time and queries are clearly polynomial in n and the number of terms of the target concept. (We can reduce the number of MQ's per EQ to O(n) by observing that once we have tested a 1 and not found a positive point below it, we need not test it again during this call to Minimize.) Thus: monotone DNF formulas are exactly learnable in polynomial time with EQ's and MQ's.

If we have access only to MQ's (no EQ's), then an adversary argument shows that at least (2^n - 1) queries may be necessary in the worst case for a formula over 2n variables. Consider the target class of formulas of the form x1y1 + x2y2 + ... + xnyn + T, where T is a term consisting of a conjunction of n variables, where the i-th variable is one of xi or yi. Suppose an algorithm exactly learns every target concept in this class using just MQ's. Consider the following adversary: (1) if a query sets both xi and yi equal to 1 for some i, then answer 1, (2) if a query sets at most one of xi and yi to 1 for each i, but does not set at least n variables to 1, then answer 0, (3) if a query sets exactly one of xi or yi to 1 for all i, then answer 0 unless this would eliminate the very last concept in the target class. This strategy means that each MQ eliminates at most one target concept, which means that until at least (2^n - 1) MQ's have been made, there are at least two target concepts consistent with all the answers -- thus, until the algorithm has made this many queries, it cannot guarantee exact learning of the target concept.

If we have access only to EQ's (no MQ's), then it is also possible to prove that no polynomial number of EQ's with polynomial-sized monotone DNF formulas can exactly identify all the monotone DNF formulas. The argument for this is considerably more involved (Angluin, "Negative results for equivalence queries") and consists of demonstrating that each hypothesis formula has an "approximate fingerprint", that is, an assignment with relatively few 1's that satisfies the formula, or relatively few 0's that falsifies the formula. By making the "approximate fingerprint" the counterexample, the adversary guarantees that the fraction of possible target concepts that is eliminated is smaller than any 1/p(n). These two results show that neither EQ's nor MQ's can be dispensed with in the above polynomial time algorithm for monotone DNF formulas.

3/6/09 Lecture 23. Support Vector Machines (concluded); the Perceptron Algorithm.

(Reading: the excerpt from Alpaydin's text on soft margin SVMs.) One bound on the generalization error for SVMs is (expected # support vectors)/(total number of samples), assuming that the samples are drawn IID from a fixed unknown distribution. Draw m samples S and find the SVM hypothesis hyperplane h. Now suppose instead we had drawn the first (m-1) samples and constructed the hyperplane h' and then used h' to predict the label of the last example, x_m. If x_m is not a support vector for h, then h' and h will be the same hyperplane (because they are determined by their support vectors) and h' will correctly predict the label of x_m (because h correctly classifies x_m.) If instead, x_m is a support vector of h, then h' might be different from h' and h' might not correctly predict the label of x_m. Thus, if we draw (m-1) samples and use the resulting SVM hyperplane h' to predict the label of the next example x_m, then the probability of an error of prediction is at most the probability that x_m is a support vector in the set of the m samples. Because under the IID assumption, every permutation of the first m samples is equally likely, the error of prediction is at most the expected number of support vectors in m samples, divided by m. This bound suggests that if the support vectors are a small fraction of the examples in the data, generalization will be good.

What if the data (even with a good kernel) turn out to be not linearly separable? How can we use SVMs in this case? We use a standard approach in optimization: we introduce "slack variables", (which in these notes will be denoted s_i) which will be assigned 0 if the data point x_i is on the correct side of the hyperplane and outside of the margin region, and otherwise will be assigned the distance x_i would have to be moved to put it outside the margin region on the correct side of the hyperplane. The "soft error" is the sum of s_i over all i. The constraints become y_i(<x_i,beta>+beta_0) >= (1 - s_i). Note that if (0 < s_i <= 1), then x_i is still correctly classified, but in the margin area. Thus, if x_i is incorrectly classified, (s_i > 1). Hence the soft error is an upper bound on the number of misclassified points. In order to discourage the optimization from creating a large soft error, it is added (with some constant multiplier K) to the original objective function to be minimized. The constant K is a penalty factor, which (roughly) trades off the number of support vectors (as measured by ||w||) against the number of non-separated data points (as measured by the soft error.) Once the new problem is transformed to the dual problem, the result is almost the same: to maximize the objective L_D = T_1 - T_2, where T_1 = \sum_{i=1}^N alpha_i and T_2 = (1/2)\sum_{i=1}^N\sum_{k=1}^N alpha_i alpha_k y_i y_k <x_i,x_k>, subject to the constraints (sum_i alpha_i y_i = 0) and (K >= alpha_i >= 0). Note that the only difference from the separable case is that the constant K is an upper bound in the constraints on the alpha_i variables.

(These notes are largely drawn from 2005 lecture notes.) The Perceptron training algorithm solves the following problem. We are given a training set {(x_1, y_1), ..., (x_m,y_m)}, where each x_i is a point (equivalently, vector) in R^d with length equal to 1, and each y_i is a label from {+1, -1} and there exists a vector w* of length 1 such that y_i = sign(w*x_i) for i=1,...,m. That is, we assume that the points labeled +1 are linearly separable from the points labeled -1. (If this assumption is not satisfied, the learning algorithm does not converge.) The goal is to find a vector w that linearly separates the +1 instances from the -1 instances. Recall that we discussed this problem before, and noted that it could be solved using linear programming, by treating the entries of w as the unknowns, and deriving the linear inequality constraints from the labeled examples, one constraint per example. Support vector machines give another approach to this problem. The Perceptron training algorithm is a simple iterative updating method to solve this problem.

We phrase the Perceptron training algorithm in the on-line mistake bound model, assuming that we repeatedly cycle through the examples in the training set until the algorithm stops making mistakes. Initially the hypothesis vector w is 0. On input x_i, the algorithm predicts the sign of wx_i, (that is, calculates the dot product of w and x_i, and predicts +1 if it is positive, and -1 otherwise.) If the prediction is incorrect (that is, not equal to y_i), then the hypothesis w is updated by adding y_ix_i (that is, adding x_i if y_i = +1 and subtracting x_i if y_i = -1.) We did a 2-dimensional example in lecture, and noted that the vector w seemed to grow longer and longer, while at the same time the angle between w and w* seemed to get closer and closer to 0.

Why should this work? If we look at w*w, the dot product of the "correct" separator w* and the current hypothesis w, we see that it is equal to |w|(cos A), where |w| is the length of w and A is the angle between w and w*. How much can |w| increase with each mistake? Look at |w|^2, which is the dot product of w and w. Let w be the current hypothesis and w' the result of updadting w after a mistake. After a mistake, we have w' = w + y_ix_i, so |w'|^2 = (w + y_ix_i)(w + y_ix_i) = |w|^2 + 2y_iwx_i + |x_i|^2. (This follows from the linearity of the dot product.) Now we can use the facts (1) the length of each x_i is 1, that is, |x_i|^2 = 1, (2) we know that wx_i has a different sign from y_i, because a mistake was made on this example, so the term 2y_iwx_i is nonpositive. Hence, |w'|^2 is at most |w|^2 + 1. This limits the growth rate of |w| -- after M mistakes, it can be at most sqrt(M).

To show that we make progress on reducing the angle A between w and w*, we make an assumption: that the margin of the training set is s > 0. The margin of an example (x_i,y_i) is the distance of x_i from the hyperplane defined by w*, which is the length of the projection of x_i onto w*, that is, |w*x_i|, that is, the absolute value of the dot product of w* and x_i. (Review dot product and projections, recalling that w* has length 1.) Taking s to be the minimum of the margins of the examples (x_i,y_i), we assume s > 0. Now we claim that w*w (the dot product of w* and w) increases by *at least* s with every mistake. Letting w' = w + y_ix_i be the updated value after a mistake, we have w*w' = w*(w + y_ix_i) = w*w +y_iw*x_i = w*w + |w*x_i|. (Note that y_i and w*x_i must have the same sign since w* classifies this example correctly, so y_iw*x_i must be the absolute value of w*x_i.) By assumption, |w*x_i| is at least s > 0, so w*w' is at least w*w + s. Hence, after M mistakes, w*w is at least Ms.

Putting these two bounds together, after M mistakes, |w| is at most sqrt(M), and w*w is at least Ms and at most |w|, so Ms <= sqrt(M), which implies that M <= 1/s^2. The total number of mistakes before the Perceptron training algorithm finds a hypothesis w consistent with the sample is bounded by the inverse of the square of the margin.

Some remarks. Note that the updating is reminiscent of Winnow, except that it is additive instead of multiplicative. Other modifications of the algorithm incorporate a *learning rate* between 0 and 1, to reduce the effect of any single training example on the hypothesis. The assumption about unit length vectors can be removed by normalizing each example (x_i, y_i) by dividing x_i by its length -- this does not affect the direction of the vector, only its length, and preserves separability. If the separator is of the form w*x >= b, then we can add a dimension to the examples, extending them by a constant coordinate = 1, where the learning algorithm will also learn the threshold (the coefficient of the new dimension.)

3/4/09 Lecture 22. Support Vector Machines.

Relevant reading is the excerpt from Russell and Norvig (Kernel Machines) distributed in class, and the tutorial on support vector machines "A Tutorial on Support Vector Machines for Pattern Recognition" by Christopher J. C. Burges. There are good treatments in Ethem Alpaydin's textbook "Introduction to Machine Learning" (on reserve in the EAS Library) and Hastie, Tibshirani and Friedman's text "The Elements of Statistical Learning" (which is the text for Stat 365.)

Support vector machines bring together two interesting ideas in a particular way, via "kernels." One is the idea of finding the "optimal separating hyperplane" (Vapnik, 1996) to separate two linearly separable classes of points in d-dimensional real space. The other is the idea of adding new (computed) features to our examples to simplify the form of the hypothesis in terms of the old and new features, in this case, to make the classes linearly separable (or more nearly so) in terms of the combined set of features. The role of the "kernel" is to limit the computational impact of the additional features.

If we have a linearly separable set of points classified as + and -, then we can set up a linear programming problem to find some separating hyperplane, but such a hyperplane is likely not to be at all unique. For example, in two dimensions we can consider the labeled points ((1,2),-), ((2,1),-), ((1,4),+), ((4,3),+) and ((5,5),+), which are clearly linearly separable, for example by the line x_2 = 2.5. However, this line doesn't seem to be a "good" separator -- a better choice might be a line that maximized the minimum distance from any data point to the line. This is the "optimal separating hyperplane", that is, the separating hyperplane that maximizes the minimum distance from any data point to the hyperplane. The points at the minimum distance are the "support vectors" -- note that if we perturb them (move them a little), the hyperplane in general may change, but if we perturb data points that are not support vectors, the optimal separating hyperplane does not change. The "margin" of the hyperplane is the minimum distance to any sample point.

Suppose our labeled sample is (x_1, y_1), ..., (x_N, y_N), where the x_i's are vectors of d real numbers and each y_i is either +1 (for a positive example) or -1 (for a negative example.) Then a hyperplane is determined by a normal vector beta (a vector of d reals) and a displacement beta_0 (a single real number), and consists of all vectors x in R^d that satisfy <x,beta>+beta_0 = 0, where we are using <v,w> to denote the inner (dot) product of the vectors v and w. The optimization that will determine the parameters beta and beta_0 for the optimal separating hyperplane is to maximize C with respect to beta and beta_0, subject to ||beta|| = 1 and the linear constraints y_i(<x_i,beta>+beta_0) >= C for i = 1, 2, ..., N. Restricting the length of beta to be 1 means that we can interpret (<x_i,beta>+beta_0) as the distance of the data point x_i to the hyperplane determined by beta and beta_0. The constraints require each data point to be a distance of at least C from the hyperplane we find, and we are attempting to maximize C. We can set ||beta|| to be 1/C and get an equivalent problem of minimizing (1/2)||beta||^2 with respect to beta and beta_0, subject to the constraints y_i(<x_i,beta>+beta_0) >= 1.

This problem has a quadratic objective and linear inequality constraints, and can be solved by quadratic optimization. It is convenient to consider its dual, with dual variables alpha_i. The problem is to maximize the objective L_D = T_1 - T_2, where T_1 = \sum_{i=1}^N alpha_i and T_2 = (1/2)\sum_{i=1}^N\sum_{k=1}^N alpha_i alpha_k y_i y_k <x_i,x_k>, subject to the constraints (sum_i alpha_i y_i = 0) and alpha_i >= 0. This formulation has the key property that the sample points x_i and their labels ONLY enter via the pairwise inner products of x_i and x_k times the product of their labels -- once those values are known, the optimization can proceed without any further reference to the sample points or their labels. This formulation using the inner products is what allows this algorithm to use kernel functions to extend the feature space.

Once the alpha_i values are found, the normal vector beta can be computed as \sum_{i=1}^N alpha_i y_i x_i. In this solution, the non-zero alpha_i's value correspond exactly to the support vectors -- the sample points at minimum distance to the optimal hyperplane. Thus, the normal vector beta is a linear combination of the (typically sparse) set of the support vectors. This (typically smaller) number of parameters is thought to be important to keeping support vector machines from overfitting the data, despite nominally using hypotheses with a very large number of features.

The example given in the Russell and Norvig handout considers sample points in two dimensions, where the underlying concept is the set of points (x_1,x_2) inside the circle of radius 1 centered at the origin. The positive and negative examples are not linearly separable, but if we introduce new features f_1(x_1,x_2) = x_1^2, f_2(x_1,x_2) = x_2^2, and f_3(x_1,x_2) = \sqrt{2}x_1x_2, then we can map each original sample point (x_1,x_2) to the point F(x_1,x_2) = (f_1(x_1,x_2),f_2(x_1,x_2),f_3(x_1,x_2)) in 3 dimensional Euclidean space. Now the positively and negatively labeled sample points are linearly separable (see picture in handout.) (In fact we don't actually need f_3 for this, but it works out well when we get to kernels.) The potential drawbacks of introducing a lot of new features are overfitting and computational expense. That is, when we solve the quadratic optimization problem to find alpha_i's, it seems that we must compute all the inner products of the potentially MUCH longer vectors F(x_i) and F(x_k). The computational problem is overcome by choosing our features in such a way that the inner product of F(x_i) and F(x_k) can be easily computed by a "kernel function" K(x_i,x_k). Conveniently, for the choice of features in the circle example, we have that the inner product of F(x_1,x_2) and F(x_1',x_2') is just the square of the inner product of (x_1,x_2) and (x_1',x_2'). The savings is not dramatic in going from 2 dimensions to 3, but in general we'd like to introduce a large number of new features.

How do we choose features, or, equivalently, a kernel function? There are some "standard" kernel functions, for example, the polynomial kernel of degree p: (1+<x_i,x_k>)^p, the radial basis kernel: exp(-||x_i - x_k||^2/c) and the neural network kernel: tanh(kappa_1<x_i,x_j> + kappa_2). There is also considerable effort devoted to developing new kernels appropriate to other types of objects, for example, strings and trees. Choosing a kernel seems generally to be a search among possible options guided by estimates of the generalization error.

Next lecture we'll briefly look at how this approach is modified to deal with samples that are not linearly separable.

3/2/09 Lecture 21. Decision trees.

These notes are mostly from the 2005 lecture notes, and contain a few topics not in the lecture of (3/2/09). Relevant reading is Quinlan's paper "Induction of decision trees" in [Papers].

Problem: given a training set {(x_1, y_1), ..., (x_m, y_m)}, where each instance x_i is specified by its values on a collection of attributes, and each label y_i is a class label for the instance x_i, find a decision tree that tests attributes of, and assigns class labels to, arbitrary instances. An example decision tree for the weather data (handout on paper): root vertex tests attribute "outlook?" and branches to vertices 1 (value = sunny), 2 (value = overcast), and 3 (value = rainy). At vertex 1 the attribute "humidity" is tested and branches to vertices 4 (value = high) and 5 (value = normal). At vertex 3 the attribute "windy?" is tested, and branches to vertices 6 (value = true) and 7 (value = false). Vertices 4,5,2,6,7 are leaf vertices, which are assigned class labels of no, yes, yes, no, and yes, respectively. This decision tree correctly classifies all 14 of the elements of the training sample.

Note that we can extract rules from the tree, one rule for each path from the root to a leaf. For example, the path root->1->4 gives the rule "if (outlook=sunny) and (humidity=high) then (play=no)" and the path root->3->7 gives the rule "if (outlook=rainy) and (windy?=false) then (play=yes)." The number of rules obtained in this way is just the number of leaves in the decision tree. Viewing this slightly differently, there is a DNF formula for each of the classes. For example, for (play=no) we have the DNF formula ((outlook=sunny and (humidity=high)) or ((outlook=rainy) and (windy?=true)), and for (play=yes) we have the DNF formula ((outlook=sunny) and (humidity=normal)) or (outlook=overcast) or ((outlook=rainy) and (windy?=false))).

Informally it is clear that we would like to build a "small" decision tree from the training data. Formally, this is related to the desire to keep the number of possible hypotheses (or the VC-dimension of the class of possible hypotheses) "small", which will allow the number of training examples to be "large enough" to guarantee good generalization performance. The optimization problem: find a "smallest" (in various senses) decision tree consistent with a given training sample, is NP-hard (Rivest and Hyafil showed this.)

However, decision trees are efficiently learnable in a stronger learning model (proved by Nader Bshouty), namely, improperly PAC-learnable with membership queries (or, more precisely, exactly learnable with membership queries and equivalence queries using depth 3 formulas as hypotheses.) One property that might make decision trees more tractable to learn than general DNF formulas is the following. In the 2-class case, both the concept (yes) and its complement (no) have "small" DNF representations (small in the sense of polynomial in the size of the decision tree itself.) For general DNF, a concept represented by a DNF formula may require an exponentially larger DNF formula to represent its complement.

Optimally small decision trees seem to be hard to come by, so we consider a widely used and studied heuristic that greedily attempts to construct small (shallow) decision trees, due to Ross Quinlan. The idea is to choose an attribute that (greedily) makes the "best split" for the root, and then proceed recursively with the subsets of the examples corresponding to the different values of that attribute. There are different proposals for a greedy "best split." We could consider the training errror; Breiman advocated the Gini index for his CART (classification and regression trees) algorithm; Quinlan suggested an information theoretic criterion. We'll look at Quinlan's suggestion in some detail.

Excursion on information theory: suppose I am trying to transmit to you the result of n flips of a very biased coin (probability of heads = 0.9, probability of tails = 0.1.) I could do so by sending you n bits, one bit representing the result of each coin flip. Is this the best I can do? Suppose n is even and I decide to represent the 4 possibilities for two consecutive flips, which, with their probabilities of occurrence are: HH (0.81), HT (0.09), TH (0.09), TT (0.01). I could choose a prefix free code for these possibilities as follows: (0->HH), (10->HT), (110->TH), and (111->TT). Being prefix free, this code can be uniquely decoded, so that 11001010111 is parsed as 110,0,10,10,111, which represents the coin flip sequence: TH,HH,HT,HT,TT. With this representation, what is the expected number of bits I will have to send you to transmit n coin flips? It will be (n/2) times the expected code length for a single pair of flips, which is 1*(0.81) + 2*(0.09) + 3*(0.10) = 1.29. That is, each pair of flips can be transmitted using an expected 1.29 bits, so the n flips can be transmitted using an expected 0.645*n bits, instead of the "naive" n bits. Is this the best? Well, no. The best is given asymptotically by the entropy (or information) of the distribution (0.9, 0.1), which yields about 0.469 bits per flip. Define H(p) = (p log_2(1/p)) + ((1-p) log_2(1/(1-p))). This is the entropy of the distribution (p, 1-p). It tells us how many bits per flip we'd need to transmit a sequence of coin flips biased as (p, 1-p). It is (by convention) 0 when p = 0 or p = 1. It reaches a unique maximum of 1 at p = 1/2; if the coin is fair, we may as well use the naive method of one bit per flip. It is symmetric around p = 1/2. More generally, we can define the entropy of a distribution (p_1, p_2, ..., p_k) as sum(i=1,k)p_i*log_2(1/p_i). This is the expected length (in bits) of the optimal coding of a single message in a sequence generated by picking (independently) a message m_i to transmit with probability p_i. (Note that since log(1/p) = -log(p), this can be written in an equivalent form with minus signs and no division, but the minus signs are really confusing, since everything is nonnegative!) (See information theory (Shannon) and arithmetic coding to follow up on these topics.)

Now we consider the application of information theory in Quinlan's ID3 algorithm for constructing decision trees. In the example of the weather data, the initial division of yes and no labels is 9 Y and 5 N. If we treat this as a distribution in which the probability of Y is 9/14 and the probability of N is 5/14, we can apply the reasoning above to say that the information required to specify the label of an instance randomly drawn from this distribution is H(9/14,5/14) bits, which is approximately 0.940 bits. If we consider the effect of placing the single attribute "outlook" at the root of our decision tree, then the result divides our original collection of examples into three parts: (2 Y, 3 N), (4 Y, 0 N), and (3 Y, 2 N). Again, looking at these as distributions, we can investigate the information required to specify an example's label in each of the three subpopulations: H(2/5,3/5), H(4/4,0/4), H(3/5,2/5), respectively, or approximately, 0.971, 0.0, and 0.971 bits. Assuming that instances are drawn from these populations with a probability equal to the empirical frequency, that is, 5/14 for the first node, 4/14 for the second node, and 5/14 for the third node, we can calculate the expected number of bits to specify an example's label as (5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971), or approximately 0.694 bits. Thus, we by testing the "outlook" attribute, we have gone from needing 0.940 bits to needing 0.694 bits to specify the label of an instance, a *gain* of 0.246 bits. If we do a similar calculation for the "humidity" attribute, we divide the original population into two parts: (3 Y, 4 N), and (6 Y, 1 N), with entropies of 0.985 and 0.592 bits respectively, which when combined yield an expected number of bits of 0.788, for a gain of 0.152 bits, smaller than the gain for the "outlook" attribute. We do a similar caculation for each attribute and pick the one with the largest information gain for the root. Then we treat separately and recursively the training examples that reach each of the children of the root.

When does the recursion stop? Clearly it may stop when all the class labels of training examples associated with a node are equal, and the class label is the common value. It may also stop when all the instances in a node agree on all the attributes that may be tested (although the class labels may still disagree, in the case of noise and/or insufficiency of the attributes to determine the class label.) In the latter case, the recursion stops with a class label equal to the majority (or plurality) value of the instances in the node.

A variety of questions arise in applying decision trees. (1) What about overfitting? The recommended course is to build a tree that possibly overfits, and then to prune the tree using estimations of the effect on the generalization performance. This seems to work better than proactively trying to avoid building the parts that are later to be pruned. (2) What about missing values? It depends partly on whether the domain supports treating "missing" as another potentially informative value for the corresponding attribute, or whether it is more reasonably thought of as a kind of random noise. In the second case, one can introduce fractional instances corresponding to the overall fraction of values of the given attribute, and send appropriate fractions of instances along the corresponding value edges. (3) What about attributes that have numeric values? One approach is to introduce binary splits such as "is age > 46?" based on information-theoretic considerations.

One improvement introduced with Quinlan's C4.5 was to use *gain ratio* instead of information gain as the criterion by which to choose a root attribute. The problem is that the information gain overly favors attributes that have many values. As an extreme example, if the examples of the weather data have an "id" attribute that is unique to the example, then when we test the "id" attribute, we get a 14-way split into perfectly "pure" singleton nodes. Thus, the information gain in this case is the whole 0.940 bits that we started with (since 0 bits are required after testing the attribute.) The heuristic fix for this is to divide the information gain by the entropy associated with the distribution of examples into the children associated with the attribute's values. For example, considering the "outlook" attribute at the root, testing it produces the distribution (5/14,4/14,5/14), which has an entropy of about 1.58 bits. The gain ratio of "outlook" is its gain divided by this entropy, or about 0.156. For the "humidity" attribute at the root, the corresponding distribution is (7/14,7/14), which has an entropy of exactly 1 bit. Thus, the gain ratio of "humidity" is 0.152, making it a much closer second to "outlook" with respect to this measure.

2/27/09 Lecture 20. Weighted majority; Blum's paper on Winnow and Weighted Majority for a calendar scheduling domain.

These notes are mostly drawn from the 2005 notes, and contain a few topics not covered in the lecture of (2/27/09).

In the Weighted Majority (WM) algorithm, we assume that there is a pool of experts, say A1, ..., An, each of which functions as an on-line prediction algorithm for instances x in X. The WM algorithm attempts to combine the predictions of all the experts in such a way that it makes not "too many" more mistakes on a given sequence of instances x1, x2, ... than the *best* (in hindsight) expert in the pool on this sequence. The WM algorithm maintains a weight wi for each expert Ai, all initially = 1. WM requests an instance x to predict, and gives the instance x to each expert Ai and receives its prediction. WM compares the total weight q1 of all experts predicting 1 for x to the total weight q0 of all experts predicting 0 for x. WM predicts 1 for x if q1 >= q0, and predicts 0 for x otherwise. WM then receives the correct label for x, which it passes along to the experts Ai. Whether or not WM made a mistake of prediction, it sets wi = (1/2)*wi for all those experts Ai that predicted incorrectly on x.

As an example, suppose the pool contains 3 experts, A1, A2, and A3, with initial weights (1,1,1). Suppose the first instance to predict is x1, and the predictions of A1, A2, and A3 are (1,1,0). Then WM predicts 1 (because the total weight of experts predicting 1 is 2 and the total weight of experts predicting 0 is 1.) Suppose the correct label for x1 is 0. Then WM passes along the label 0 for x1 to A1, A2, and A3, and also updates their weights to be (1/2,1/2,1). Suppose the second instance to predict is x2, and the predictions are (0,1,0). Then WM predicts 0 (total weight 3/2) and not 1 (total weight 1/2). If the correct label of x2 is 1, then WM passes along this label of x2 to A1, A2, and A3, and updates their weights to (1/4,1/2,1/2). Suppose the third instance to predict is x3, and the predictions are (0,1,1). Then WM predicts 1 (weight 1) and not 0 (weight 1/4). If the correct label is 1, then WM passes along the label for x3 and updates the weights to (1/8,1/2,1/2). Note that A2 and A3 have each made 1 mistake and have weight 1/2, while A1 has made 3 mistakes and has weight 1/8.

If we look at the number M of mistakes made by WM after a particular sequence of labelled instances, and the number mi of mistakes made by the i-th expert Ai on the same sequence, we see that wi, the weight of Ai, is (1/2)^(mi). Each mistake made by WM decreases the sum of the wi's by a factor of 1/4 (because at least half the weight predicted wrong, and half of that will be discarded). Therefore, after M mistakes, the sum of the wi's is at most n*(3/4)^M, because n is the initial sum of the weights. Because we have a lower bound on the sum of the weights and an upper bound on the sum of the weights, we must have (1/2)^(mi) <= n*(3/4)^M. Thus, M <= c*(log_2(n) + mi), where c = 1/(log_2(4/3)), which is bounded above by 2.41. Interpreting this, the total number, M, of mistakes made by WM on an arbitrary sequence of labelled examples is "nearly" as good as the number, mi, of mistakes made by the best expert (using hindsight to judge which would have been best.) "Nearly" means a constant factor, plus a term proportional to the log of the number of experts in the pool. (Note that the constant factor can be improved to be near 1, at the cost of increasing the constant multiplying the log term, by a strategy of randomized prediction (predict 1 with probability q1 and 0 with probability q0) combined with a less drastic penalty than 1/2.)

In "Empirical support for Winnow and Weighted Majority algorithms: results on a calendar scheduling domain" by Avrim Blum, we get to see Weighted Majority and Winnow in action. Tom Mitchell's calendar scheduling apprentice (CAP) provided both the data and a prediction algorithm for comparison. The data consists of attributes describing a sequence of meetings for each of two users (Tom Mitchell and User2.) The attributes include such things as event-type, position-attendees, lunch-time?, location, and so on. CAP used selected attributes to try to predict 4 specific attributes: location, duration, start-time, and day-of-week. CAP used the previous 180 days worth of data to build a decision tree, prune it, extract rules, and order the rules according to their empirical accurary. It was run once a day, overnight, to develop rules to predict the next day's instances. Observing that CAP tended to produce rules with few attributes, Blum decided to test the performance of Weighted Majority and Winnow on the same problem.

For WM, Blum created an expert for each pair of attributes, for example, (event-type, position-attendees). Thus, for predicting location, which was based on 12 selected attributes, there would be (12 choose 2), or 66 experts. Each expert kept a history: for each pair of values that had occurred in the data for its two attributes, it kept the last five values of the attribute to be predicted. When the expert was asked to predict a new instance x, it took the actual values of its two attributes in x and looked at its history-of-five for that pair of values, and predicted the most frequent value for the attribute to predict among those five. If it had no history, it simply predicted a global default (the most frequently occurring value for the predicted attribute.) Using this pool of experts, WM operated as described above. The prediction is not binary, so WM used the prediction with the largest sum of weights of experts making that prediction.

For Winnow, although it does not seem useful to model the task as predicting a monotone disjunction, Blum describes an adaptation of the ideas to this task. The individual predictions for Winnow are made by "specialists" which may abstain from prediction. There is one specialist for every pair of attribute-value pairs that has occurred in the data. For example, for predicting location, one specialist would be the pair (event-type=meeting, position-attendees=faculty). This results in a very much larger number of individual prediction algorithms (59731 for the first task, using a larger feature set.) However, for each prediction, only (n choose 2) specialists "wake up" to make predictions, where n is the number of attributes being used for the prediction. Each specialist keeps a history of the last five times its pair of attribute-value pairs has occurred in instances, and the resulting label for the attribute it is to predict, and predicts the most frequently occurring label among the stored history. When a specialist is first created (the first time its pair of attribute-value pairs occurs in the data), it is given a weight of 1 and abstains from prediction. Winnow collects the predictions of the non-abstaining specialists, and makes the prediction whose corresponding weight is largest. If Winnow makes a mistake, it updates the weights of the specialists, multiplying weights by 1/2 for specialists who predicted incorrectly, and by 3/2 for specialists who predicted correctly.

How closely does the actual task reflect the theoretical models? For WM, the analysis does not assume any target concept, just an arbitrarily labelled sequence of instances. We can think of this data as an arbitrarily labelled sequence of instances, but the performance guarantee we have seen for WM compares WM with the performance of the best expert (in hindsight) for the sequence of data. This analysis does not really capture the fact that the data is made up of semester-long chunks for which different experts might be best. Thus, a more sophisticated analysis of how well WM tracks temporally varying (or "drifting" in the literature) concepts might be more appropriate here. For Winnow, the departure from the theoretical model is more striking: there doesn't seem to be any monotone disjunction in sight, and we are using a vote of the specialists instead of a threshold.

What do the empirical results say? The overall average accuracy in predicting all 4 of the attributes to be predicted is 53% for Mitchell's program CAP, 57% for WM, and 63% for Winnow on the larger of the two data sets. Blum also consider a version of Winnow modified to update its weights only at day boundaries (to make it more comparable to CAP's operation), which reduces Winnow's overall prediction accuracy to 59% on the larger data set. He also considers the effects of disabling weight update in Winnow, which reduces its overall prediction accuracy on the larger data set to 52%. He also considered the combination of both modifications, as well as the effects of using a larger set of attributes than the hand-selected set used by CAP. He also considered a version of Winnow that would only predict if the highest weighted outcome was at least some specified fraction of the total weight, allowing for a tradeoff of accuracy of prediction versus coverage (fraction of instances on which Winnow predicted rather than abstained.) He also studied the effects various strategies for pruning low-weight experts in WM. In addition to these empirical comparisons, Blum gave a theoretical analysis of two pruning strategies for WM and a bound on mistakes for his modified version of Winnow.

2/25/09 Lecture 19. Winnow: agnostic and drifting.

Variants of Winnow can be shown to cope with errors and with "drifting" concepts. The term "agnostic learning" is used to describe a learning situation in which the target concept is not necessarily guaranteed to be in the concept class C, but (ideally) we'd like to find the concept h in C that has the smallest possible error rate with respect to the target concept. Thus, if the target concept happens to be a concept from C with some errors in the data, the best agnostic approximation will have an error rate bounded above by the data error rate. In general the problem of finding the best agnostic approximation from C for given data is computationally even harder than deciding whether any concept in C is consistent with the data, so we must generally settle for a good approximation rather than the best.

Winnow and agnostic learning of disjunctions of variables. Let x_1, x_2, ..., x_n be Boolean variables, and consider any sequence of examples consisting of Boolean n-vectors and labels from {0,1}. If c is any disjunction of variables, let m_c denote the number of mistakes made by c on the sequence of examples (that is, the number of times the prediction of c for an example disagrees with its actual label in the sequence.) Also, let A_c deonte the number of "attribute errors" with respect to c, defined as follows: add 1 every time the example label is 1 but no variable of c is assigned 1, and add k every time the example label is 0 but exactly k variables of c are assigned 1. Clearly, if c is a disjunction of r variables, we have m_c ≤= A_c ≤= rm_c, because each mistake adds at most r to A_c. Theorem: For any sequence of labeled examples and any disjunction c of variables, the number of mistakes made by Winnow is O(A_c + rlog n), where r is the number of variables in c. Because c is an arbitrary disjunction of variables and A_c is at most rm_c, Winnow is O(r) competitive with the best disjunction of variables for an arbitrary sequence of labeled examples. (We did not cover the proof of this theorem, which uses a potential function argument.)

The next result concerns the ability of Winnow to track a "drifting concept." In our analyses so far, we have assumed that the target concept remains the same throughout the learning or online prediction process. However, we could also consider a situation in which the target concept "drifts", that is, changes slowly with time. To model such a drifting concept, we assume that the target concept is chosen and changed by an adversary, but that the adversary must pay a cost for each change to the concept. Our goal is then to bound the number of mistakes of prediction by some reasonable function of the adversary's costs to any point.

In particular, suppose the target concept is initially a disjunction of no variables (i.e., the constant false concept.) In each round, the adversary may change the concept by adding or deleting a variable from the concept (for a cost of 1 per variable added or deleted.) Then an example is given to the learner, who predicts its label, and receives the correct label according to the adversary's current concept. Then the process goes to the next round.

We consider a variant of Winnow, which we denote Winnow*, that only halves weights that are at least 1, but is otherwise the same as Winnow. Thus, every weight is 1/2 or more throughout the algorithm. Theorem: At any point in the sequence, Winnow* has made at most O(C_a log n) mistakes, where C_a is the adversary's cost to that point. The proof of this theorem uses a slightly more complex potential function argument. At a given point in the sequence, let M_1 denote the number of mistakes made by Winnow* on examples whose correct label is 1, and let M_0 denote the number of mistakes made by Winnow* on examples whose correct label is 0. First consider the sum of the weights w_i. Initially it is n. If a mistake increments M_1, the Winnow* prediction is 0, and as before at most n is added to the sum of the weights. If a mistake increments M_0, the Winnow* prediction is 1 because the sum of the weights of variables with x_i = 1 is at least n. At most n/2 of that weight can come from variables whose weights are already 1/2, so at least n/2 of the weight will be cut in half, reducing the sum of the weights by at least n/4. Analyzing the initial balance (n), deposits (of at most n) and withdrawals (of at least n/4) as before, we see that M_0 ≤= 4(1 + M_1) and the total number of mistakes is M_0 + M_1 ≤= 4 + 5M_1.

To complete the proof, we need to bound M_1. Let R denote the set of variables currently in the adversary's concept, and let r denote the number of them. We define a potential function Phi = rlog(2n) - sum_{i in R} (log w_i), and look at how it changes when the adversary adds or deletes a variable, and when Winnow* makes a mistake. Note that Phi is initially 0 (because R is empty and r = 0 initially) and is always nonnegative (because no w_i can exceed 2n). When the adversary removes a variable from R, Phi decreases by log(2n) and increases by log(w_i), and therefore does NOT increase. When the adversary adds a variable to R, Phi increases by at most (1 + log(2n)), because log(2n) is added to Phi, and if w_i = 1/2, then subtracting log(w_i) from Phi adds 1 to it. If Winnow* makes a mistake that increments M_0, then the true label of the example is 0 and every current element of R is also 0, so none of their weights change and Phi is unchanged. Finally, if Winnow* makes a mistake that increments M_1, then at least one element of R has its weight doubled, and Phi decreases by at least 1. Thus, M_1 is at most (1 + log(2n)) times the number of times the adversary has added a variable to the target, which implies M_1 ≤= (1 + log(2n))C_a. Combining this with the bound above gives a bound on the total number of mistakes of (M_0 + M_1) ≤= 9 + 5log(2n)C_a, which is O(C_a log n), as claimed.

2/23/09 Lecture 18. On-line prediction, Winnow.

(Reading: Nick Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm, Machine Learning, Volume 2, Issue 4 (April 1988), 285-318.) In the on-line prediction setting, there is a class X of instances, a class C of concepts, and a target concept c. The learner repeatedly requests an instance x, predicts the label of x, and then receives the correct label, c(x). The learner makes a "mistake" when its prediction of the label of x is not equal to the correct label. We'd like to bound the total number of mistakes made by the learner, for any concept and *any* sequence of instances. Note that for this analysis, we do not make any probabilistic or other assumption on how the sequence of examples was generated.

Nick Littlestone introduced the Winnow algorithm, which makes few mistakes when predicting a concept that is a disjunction ("or") of variables, and can cope gracefully with both errors and "drifting" concepts. We consider a simple form of the algorithm to illustrate the ideas. Let x_1, x_2, ..., x_n be Boolean variables. The algorithm maintains a weight w_i for each variable x_i. The weights are all initialized to 1. Given an input vector (b_1, b_2, ..., b_n) from {0,1}^n, the algorithm predicts 1 if sum_i w_ib_i >= n, and predicts 0 otherwise. The algorithm then receives the correct answer for the input, and, if it made a mistake, it updates the weights as follows. If the algorithm predicted 0 on an example with correct label 1, then w_i is doubled for every i for which b_i = 1. If the algorithm predicted 1 on an example with correct label 0, then w_i is halved for every i for which b_i = 1. The algorithm then reads in the next example to be predicted.

In illustration, we consider the evolution of weights and predictions for the following sequence of examples.

Examples           w_1  w_2  w_3  w_4  w_5        actual labels

                     1    1    1    1    1
(0 1 1 1 1)        sum_i w_ib_i = 4:  predict 0          1

                     1    2    2    2    2
(0 0 1 1 0)        sum_i w_ib_i = 4:  predict 0          1

                     1    2    4    4    2
(1 0 0 0 1)        sum_i w_ib_i = 3:  predict 0          1

                     2    2    4    4    4
(0 1 0 1 0)        sum_i w_ib_i = 6:  predict 1          0

                     2    1    4    2    4
(1 0 0 1 0)        sum_i w_ib_i = 4:  predict 0          1

                     4    1    4    4    4
(0 1 0 1 0)        sum_i w_ib_i = 5:  predict 1          0

                     4    1/2  4    2    4
  . . .              

We can prove the following bound on the number of mistakes made by this algorithm on any sequence of examples (assuming that the labels are consistent with some concept that is a disjunction of variables.) Theorem: This variant of Winnow learns the class of disjunctions of variables in the Mistake Bound model, making at most (2 + 3r(1 + log n)) mistakes when the target concept is a disjunction of r variables.

This kind of bound is said to be "attribute efficient" because it depends only logarithmically on n, the total number of attributes. This is good when the concept actually depends on a small number of attributes (that is, r is small compared to n). By contrast, we could use a simple algorithm (dual to the one for conjunctions of variables) that starts with a disjunction of all the variables and, when it makes a mistake predicting 1 when the true label is 0, removes from the disjunction all variables that are 1 in the example. In the worst case, this algorithm might make (n-r) mistakes to learn a disjunction of r variables, encountering examples that removed only one variable per mistake. When r is small compared to n, O(r log n) is much preferable to (n-r) as a bound on the total number of mistakes.

How is the bound in the Theorem proved? First, we bound M_1, the number of mistakes on examples whose true label is 1. If x_i appears in the disjunction for the target concept, the weight w_i is never halved, because x_i is never 1 when the true label is 0. If w_i is doubled more than (log_2 n) times, then w_i is at least n, so it cannot be doubled again (because it would only be doubled if x_i = 1 and the prediction was 0, which is impossible if w_i is at least n.) Whenever a mistake is made predicting 0 on an example whose true label is 1, some x_i that actually appears in the target must be 1 and therefore has its weight doubled. A mistake on an example with true label 1 must double the weight of some x_i in the concept, and this can happen at most r(1 + log_2 n) times, because there are r variables in the target concept. Thus, M_1 is bounded above by r(1 + log_2 n).

Next we bound M_0, the number of mistakes on examples whose true label is 0, in terms of M_1. This argument uses the "potential function" technique of analyzing algorithms, with a potential function equal to the sum of the weights w_i. The sum of the weights is initially n. A mistake on an example with true label 1 adds at most n to the sum of the weights (because the sum of the weights that will be doubled was less than n, leading to the 0 prediction). A mistake on an example with true label 0 subtract at least n/2 from the sum of the weights (because the sum of the weights that will be halved is at least n, leading to the prediction of 1.) Thinking of this as a bank account, there is an initial balance of n, each M_1 mistake deposits at most n into the account, and each M_0 mistake withdraws at least n/2 from the account, and the "account" (the sum of the weights) is always positive. Thus, the total number of withdrawals is bounded by (2 + 2M_1), the initial 2 for the initial balance, and the 2M_1 for the subsequent deposits. That is, M_0 is bounded above by (2 + 2M_1), and thus the total number of mistakes, (M_1 + M_0) is bounded above by (2 + 3M_1). Using our previous bound on M_1, this bounds the total number of mistakes by (2 + 3r(1 + log_2n)), as claimed.

Choosing constants other than 2 and 1/2 and other improvements of Winnow are possible. Next lecture we see some guarantees for Winnow with errors or drifting concepts.

2/20/09 Lecture 17. AdaBoost, continued. (Lev Reyzin)

These notes are (mostly) from the 2005 CPSC 463a/563a lecture log.

To get a better understanding of AdaBoost, we can run it on a very simple example. Suppose the training set consists of the examples z_1 = (000, 1), z_2 = (110, 1), z_3 = (100, -1), and z_4 = (011, 1). Suppose the possible hypotheses considered by the weak learner are "decision stumps" consisting of one attribute. Thus, a1(x) = 1 if the first coordinate of x is 1, and -1 if the first coordinate of x is 0, and a1'(x) = (-1)a1(x), that is, 1 if the first coordinate of x is 0 and -1 if the first coordinate of x is 1. Similarly, a2, a2' are determined by the second coordinate of the input, and a3, a3' by the third coordinate of the input. Considering the hypothesis a1, it makes errors on z_1 (because a1(000) = -1), z_3 (because a1(100) = 1), and z_4 (because a1(011) = -1), and so has an error rate of 75% on the initial distribution on the training set. Its complement, a1', makes errors only on z_2 (because a1'(110) = -1), and so has an error rate of 25% on the initial distribution on the training set. Similarly, the error rate for a2 is 25%, for a2' is 75%, for a3 is 50%, and for a3' is 50% on the initial distribution. Assuming that the weak learner achieves an error rate less than 50% on this distribution, it must return either a1' or a2; suppose it returns a1'.

Then h_1 is a1', with epsilon_1 = .25, and AdaBoost calculates the weight alpha_1 as (to two decimal places), .55. It then updates the distribution over the examples to be: D_2(z_1) = .17, D_2(z_2) = .50, D_2(z_3) = .17, D_2(z_4) = .17. Note that the example z_2, on which a1' makes its only error, receives a large probability in this second stage. The error rates of the six possible hypotheses with respect to D_2, are 50% for a1 and a1', 17% for a2, 83% for a2', 67% for a3, and 33% for a3'. Suppose the weak learner now returns the hypothesis a2 (although a3' would also be a possibility, since it has an error rate less than 50%.) AdaBoost has h_2 = a2, with epsilon_2 = .17. This leads to a weight of alpha_2 = .79 and a new distribution D_3 on the examples, of D_3(z_1) = .50, D_3(z_2) = .30, D_3(z_3) = .10, and D_3(z_4) = .10. The error rates of the six possible hypotheses with respect to D_3 are: 70% for a1, 30% for a1', 50% for a2, 50% for a2', 80% for a3, 20% for a3'. Suppose that the weak learner now returns the hypothesis a3'. Then AdaBoost has h_3 = a3' with epsilon_3 = .20. This leads to a weight of alpha_3 = .69.

If AdaBoost only runs for T = 3 stages, it now forms the weighted sum of h_1, h_2, and h_3, namely, f(x) = .55*a1'(x) + .79*a2(x) + .69*a3'(x), and outputs the hypothesis H(x) = sign(f(x)). If we check this on the examples in the training set, we find that f(000) = .45, f(110) = 1.38, f(100) = -.65, and f(011) = .65. Thus, H(x) is completely consistent with the training sample, that is, it has training error equal to 0.

But what does error on the *training set* say about performance on other data? Could we not just be horribly overfitting the training data? For this question, we go back to the PAC model and assume that the training data was chosen (independently) from a distribution D on (instance, label) pairs (x,y), and ask for bounds on the error of AdaBoost's final hypothesis h(x), that is, bounds on the probability that h(x) is not equal to y if (x,y) is drawn from D. Freund and Schapire proved one bound on this error: it is bounded above by the error of h on the training set PLUS soft-Oh(sqrt(Td/m)), where m is the number of elements in the training set, T is the number of rounds of boosting, and d is the VC-dimension of the space of all possible base classifiers that could be returned by the weak learner. (Soft-Oh, frequently denoted by a tilde over a big-Oh, is like big-Oh, but suppresses factors that are polylogarithmic in the explicit variables, as well as constant factors. Thus, (n log n)^2 is Soft-Oh(n^2), for example.) This bound suggests a tradeoff between driving the training set error down and using lots of rounds to do so (increasing T). But the empirical evidence suggests that this bound is too pessimistic in certain circumstances; empirically, many rounds may sometimes NOT blow up the generalization error.

Where does the bound come from? Underlying it is a bound of 2(d+1)(T+1)log(e(T+1)) on the VC-dimension of thresholds of linear combination of T classifiers from a base class of dimension d, proved by Baum and Haussler. Using soft-Oh, we see that this is soft-Oh(dT). This is used in a theorem of Vapnik to get the generalization bound above. Thinking about this in the context of, say, a class of base classifiers consisting of decision stumps, there is a fixed number of possible base classifiers (for each attribute, each possible way of assigning +1/-1 to its possible values.) Thus, the number of classifiers appearing in the final linear combination is bounded by this quantity, and doesn't increase arbitrarily with T. Thus, in this specific case, the growth of T is irrelevant after a certain threshold.

There is another bound on the generalization error that refers to the margin of the final classifier on the training set, and does not involve the number (T) of rounds of training at all. That bound has a parameter theta > 0 and says that the generalization error of the final hypothesis h(x) returned by AdaBoost is bounded above (whp) by the SUM of the fraction of the training set with margin less than theta, and a term that is soft-Oh(sqrt(d/(m{theta}^2))). Again, d is the VC-dimension of the class of base classifiers, and m is the number of examples in the training set. Letting f(x) denote the weighted sum of the base classifiers before thresholding, so that f(x) = sum(t=1,T)(alpha_t)(h_t(x)), we have that the final hypothesis is h(x) = sign(f(x)). Then the *margin* of a pair (x,y) is y*f(x)/sum(t=1,T)|alpha_t|. The margin takes on a value between -1 and +1. It is negative if the prediction of h(x) disagrees with the value y in (x,y) and positive if it agrees. It is close to +1 if the vote is overwhelmingly in agreement with the label y, a positive value close to 0 if the vote is slightly in agreement with the label y, a negative value close to 0 if the vote slightly disagrees with the label y, and close to -1 if the vote overwhelmingly disagrees with the label y. We may view AdaBoost as attempting to find a classifier h(x) with large (close to 1) margins on lots (a fraction close to 1) of the the examples in the training set, which would mean a large fraction of the training set is correctly classified by a "confident" vote. (AdaBoost does not directly optimize this quantity.) Then in the bound above, for a specific theta > 0, we'd like the fraction of examples from the training set with margins below theta to be "small" (the first term of the bound) and yet theta itself to be not too small (since the second term is proportional to 1/theta.)

There are a number of other variations on the boosting framework that make different choices of the weights alpha_t to assign to the weak hypotheses h_t that are incorporated into the final hypothesis h. Lev talked about his theoretical and empirical research with Prof. Schapire on a variant of boosting based on maximizing the minimum margin of any of the sample points.

2/18/09 Lecture 16. Weak learning, boosting and AdaBoost. (Lev Reyzin)

These notes are from the 2005 CPSC 463a/563a lecture log. See also the references to boosting in [Papers].

Boosting is somewhat analogous to the unfortunate tendency of an oral exam to spend the most time material where the student's performance is weakest. Schapire, in his thesis in 1989, gave the first polynomial time method of converting a weak learning algorithm (able to get some error rate less than (1/2 - 1/p(n)) on ANY input distribution) into a (strong) PAC learning algorithm (able to get any error rate less than epsilon (at a cost polynomial in 1/epsilon.)) Freund in 1990 gave a more practical algorithm, and Freund and Schapire developed the idea into the AdaBoost algorithm. They received the 2003 Godel Prize for their 1995 AdaBoost paper.

We look at the AdaBoost algorithm as presented in ``A brief introduction to boosting'' by Schapire. We assume a domain X of examples. Labels will be +1 and -1 instead of 1 and 0; this makes some expressions more concise. In particular, if h is a hypothesis and (x,y) is a labeled example, then the expression y*h(x) will be 1 if h(x) = y and -1 otherwise. The input to AdaBoost is a fixed set of labelled examples, denoted (x_1, y_1), ..., (x_m, y_m), and called the training set. (We do NOT assume access to an EXAMPLES oracle.) We assume that there is a "weak" or "base" learner that takes as input a probability distribution over the training set and returns a hypothesis h mapping X to {+1, -1}.

AdaBoost operates in stages: t = 1,2,...,T. At each stage, it computes a probability distribution on the labelled examples in the training set. Let D_t(i) denote the probability assigned to the i-th labelled example, (x_i, y_i), in stage t. Initially, the probabilities are uniform, that is, D_1(i) = 1/m, for i = 1,2,...,m; all m labelled examples have equal probability. In stage 1, AdaBoost calls the weak learner on the training set with this initial distribution, and the weak learner returns its first hypothesis, h_1. If the weak learner cannot deal directly with weighted examples, then AdaBoost can simulate randomly and independently drawing examples from the training set according to the current probability distribution, to supply examples to the weak learner. After it receives the hypothesis of the weak learner, AdaBoost updates the distribution and repeats, until t = T.

To update the distribution D_t to get the new distribution D_(t+1), AdaBoost calculates the error rate epsilon_t of the weak learner's hypothesis h_t on the training sample. That is, epsilon_t is the sum of D_t(i) for all examples (xi,yi) from the training set such that h(xi) is not equal to yi. This is used to calculate a weight, alpha_t, to assign to the hypothesis h_t. In particular, alpha_t is chosen to be (1/2)(ln((1 - epsilon_t)/epsilon_t). Note that the weight alpha_t will be positive if epsilon_t is less than 1/2, and increases as epsilon_t decreases. Thus more weight in the final hypothesis will be given to those h_t with smaller error rates. To update the probability assigned to (x_i,y_i), AdaBoost multiplies D_t(i) by exp(alpha_t) if h_t(xi) is not equal to yi, and by 1/exp(alpha_t) if h_t(xi) = yi, and divides by a normalization factor (Z_t), equal to the sum of all the updated values, so that D_(t+1) will be a probability distribution. Thus, the probability of each example on which h_t makes an error is INCREASED (by a factor proportional to sqrt((1 - epsilon)/epsilon)) and the probability of each example on which h_t does not make an error is DECREASED (by an amount proportional to the inverse of the preceding factor.) Once the probability distribution D_(t+1) is computed, AdaBoost moves to stage t+1 and calls the weak learner again.

Once stage T is reached, the weak learner produces a hypothesis h_T and AdaBoost calculates its weight alpha_T. AdaBoost then combines the hypotheses h_1, h_2, ..., h_T as a weighted sum f(x) = sum(t=1,T) alpha_t*h_t(x). This is a real valued function (values not restricted to +1 and -1), so the final hypothesis output by AdaBoost is H(x) = sign(f(x)), where sign(x) is -1 if x is negative, and +1 otherwise. Thus, the final hypothesis of AdaBoost is in effect a weighted vote of the hypotheses h_1, h_2, ..., h_T, produced by the weak learner in response to the various modified distributions computed by AdaBoost, where alpha_t is the weight of the vote of h_t.

Note also that the final hypothesis is a weighted sum of hypotheses from the base class, so AdaBoost is typically NOT a "proper" learning algorithm with respect to the base class. (E.g., if the base class is half spaces, the final hypothesis is typically not a half space.)

We finished understanding the choice of weights in AdaBoost as a greedy minimization of the normalization factor Z_t at each round. Assuming that the error epsilon_t is bounded below 1/2 by gamma at each round, we have that the error of AdaBoost's hypothesis on the *training set* is at most exp(-2T{gamma}^2) after T rounds, which is decreasing exponentially fast as a function of T.

2/16/09 Lecture 15. Chernoff bounds; noise and errors.

Last lecture we were left with the following problem: given a hypothesis concept h and positive numbers epsilon and delta, and access to draws of labeled examples, output "yes" with probability at least (1 - delta) if the error rate of h is at most epsilon/2 and output "no" with probability at least (1 - delta) if the error rate of h is at least epsilon. (If the error rate of h is between epsilon/2 and epsilon, we may output either "yes" or "no".) If we draw a sequence of n labeled examples, we can define the random variable X_i to be 1 if h disagrees with the label of the i-th example, and 0 if h agrees with the label of the i-th example. Thus S = X_1 + X_2 + ... + X_n is a random variable representing the number of errors h makes on these n labeled examples. Each X_i can be thought of as an independent biased coin flip (or Bernoulli trial) with probability p of being heads (1), where p is the true error rate of the hypothesis h. The variable S is therefore binomially distributed; that is, Pr(S = k) is (n choose k)p^k(1-p)^{n-k}. The expected value of S is E(S) = pn. We can estimate the expected value of S by taking S/n, that is, the fraction of errors in the n examples. Suppose we decide that we'll compare S/n with (3/4)epsilon, and output "yes" if it is smaller and "no" if it is larger. We'd like to figure out how large a sample we should take to ensure that if p <= epsilon/2 then we'll say "yes" with probability at least (1 - delta) and if p >= epsilon then we'll say "no" with probability at least (1 - delta). To do this, the tool that we use are the "Chernoff bounds", which give an upper bound on the probability in the tails of the binomial distribution.

The Chernoff bounds. In particular, we have for the upper tail:

    Pr(S > (1 + gamma)np) <= exp(-(gamma^2)np/3)
and for the lower tail
    Pr(S < (1 - gamma)np) <= exp(-(gamma^2)np/2)
These inequalities show that if we fix p and gamma, as we increase n the probability of an outcome greater than (1+gamma) times the mean (recall E(S) = np) or smaller than (1-gamma) times the mean decreases exponentially. We won't prove these bounds, but we will see how they apply to the problem of finding a sufficient sample size for our hypothesis testing algorithm.

Suppose p > epsilon. What is the probability that our estimation of it, S/n, will be less than (3/4)epsilon? This would mean that S is less than (3/4)n*epsilon, which is less than (3/4)np = (1 - 1/4)np. Taking gamma = (1/4) and using the bound for the lower tail, the probability that this happens is less than or equal to exp(-np/32), which is less than or equal to exp(-n*epsilon/32). To make this probability less than or equal to delta, it suffices to take n >= (32/epsilon) ln(1/delta), which is O((1/epsilon) ln (1/delta)). When you see Chernoff bounds used in the literature, often the author will just say something like "O((1/epsilon) ln (1/delta)) samples suffice, by the Chernoff bounds."

Suppose instead that p < epsilon/2. What is the probability that our estimation of it, S/n, will be greater than (3/4)epsilon? This would mean that S is greater than (3/4)n*epsilon = (1 + 1/2)(1/2)n*epsilon. At this point, we use the monotonicity of the bounds: if p <= epsilon/2, the probability that S comes out to be greater than some bound B is at most the probability that S' comes out to be greater than the same bound B, where S' is the binomial random variable in which all the coin flips have probability of success of epsilon/2 rather than p. Thus, the worst case is when p = epsilon/2, and we apply the upper tail bound with gamma = 1/2 to say that the probability that S/n is greater than (3/4)epsilon is at most exp(-n*epsilon/24). Thus, if n >= (24/epsilon) ln (1/delta), the probability that S/n is greater than (3/4)epsilon in this case is at most delta. Once again, O((1/epsilon) ln (1/delta)) samples suffice.

Finally, putting these together, we see that if we take n >= (32/epsilon) ln (1/delta) samples to test our hypothesis h, saying "yes" if S/n is at most (3/4)epsilon and saying "no" otherwise, then if h is (epsilon/2)-good, we say "yes" with probability at least (1 - delta), and if h is epsilon-bad, we say "no" with probability at least (1 - delta). If the error rate of h is between epsilon/2 and epsilon, we may say "yes" or "no".

How can we use this test of a single hypothesis to help us get an algorithm for the class of unions of axis aligned rectangles when we do not have a bound on s, the number of rectangles in the union? As before, we guess values of s, say s = 1, 2, 4, 8, ..., and use our greedy covering algorithm to find a hypothesis h_s, with a sample size based on the value of s, error bound epsilon/2 and confidence bound (1/2)delta/2^s. We then test h_s as above, with a sample size based on epsilon and (1/2)delta/2^s. If h_s passes the test (i.e., the answer is "yes"), we output h_s and halt. Otherwise, we move on to the next value of s. It is possible for this procedure to run forever, just as it is possible for a sequence of independent unbiased coin flips to come up heads forever, but we can show that its expected running time and sample size will be polynomial in s, (1/epsilon) and (1/delta), and that when it halts, its probability of giving a hypothesis h_s with error at most epsilon is at least (1 - delta), that is, PAC learning is achieved for the class of unions of axis parallel rectangles.

Noise and errors in examples. The PAC model has been extended to situations with noise and errors in examples. One of the most benign models of noise that have been considered is the model of classification noise. In this model, when the algorithm draws a labeled example, the process is this: a point x is drawn independently according to the distribution D, and labeled according to the target concept c. Then, before the example is given to the algorithm, a biased coin is flipped (with probability of heads eta, where 0 < eta < 1/2) and if it comes up heads, the label on the example is reversed (+ to -, or - to +) before the example is given to the learning algorithm. In this situation, there may no longer be any concept c in C that is completely consistent with the noisy labeled examples, so we have to revise our strategy of finding a consistent concept. Also, many of our algorithms so far will fail quite spectacularly if there is noise in the examples -- think of the algorithm for learning conjunctions of literals that eliminates literals if they are inconsistent with even a single positive example. To improve such algorithms, we have to take a more statistical view of the evidence for or against a literal.

If we consider eta = 1/2, then all information about the label of an example is obliterated -- the label becomes just an unbiased coin flip. As eta approaches 1/2, the noise in the label threatens to overwhelm the information in the label. Hence, we relax our notion of PAC learning in this setting to allow sample size and running time that is polynomial in the usual parameters and also the new parameter 1/(1 - 2*eta). Note that this quantity approaches infinity as eta approaches 1/2. We also assume that we do not know the exact error rate eta, but we are given an upper bound eta_b on it, and the sample size and running time bounds are permitted to grow polynomially in 1/(1 - 2*eta_b). The goal still remains PAC learning: to output (with probability at least (1 - delta)) a hypothesis h whose error with respect to c and D is at most epsilon. That is, our goal is still to learn the "unnoisy" version of c.

Given that we can no longer require strict consistency with the sample, we can consider a strategy that finds a hypothesis h in C (or H) that minimizes disagreements of h with the sample, that is, minimizes the number of labeled examples for which the label given by h disagrees with the label in the example. Does this strategy make sense in terms of sample size? Yes, for example in the case of a finite concept class C, we can give a sample size bound that is polynomial in (ln |C|), (1/epsilon), (1/delta) and (1/(1 - 2*eta)) that guarantees that an algorithm that finds a c in C that minimizes disagreements with the sample still achieves PAC learning.

However, minimizing disagreements is computationally "even harder" than finding a consistent hypothesis. In particular, Feldman, Gopalan, Khot and Ponnuswami have shown that for any constant epsilon > 0, finding a conjunction of literals that agrees with an unknown function on a fraction (1/2) + epsilon of examples is NP-hard even when there exists a conjunction of literals that agrees with the function on a fraction (1 - epsilon) of examples. To be more concrete, even if we are assured that there is some conjunction of literals agreeing with 99% of the examples, finding one that agrees with 51% of the examples is NP-hard in general. (It is easy to get at least 50% -- just guess the constant function that agrees with the majority of the labels.) In this case, though there is a straightforward algorithm for finding a consistent conjunction of literals (when one exists), minimizing disagreements with a sample is computationally hard (at least if P is not equal to NP.)

Thus, instead of trying to minimize disagreements, we may try other strategies: in fact, conjunctions of literals are PAC learnable even with classification noise. There is a model, learning from statistical queries, that has been helpful in finding PAC learning algorithms that can cope with random classification noise.

The model of malicious errors was proposed by Valiant. In this model, a point is drawn according to D and labeled according to the target concept c, and then a biased coin with probability eta > 0 of heads is flipped. If the coin comes up tails, the example is given to the learner as is. If the coin comes up heads, the example and its label are replaced by an adversarially chosen ("malicious") point and label, where the point and label need have nothing at all to do with the original point, the distribution D or the target concept c. Thus, a fraction of up to eta of the examples are arbitrarily corrupted in this model. As you might imagine, it is much harder to PAC learn in this setting. There are also intermediate settings, in which the values of the attributes of a point may be randomly modified as well as its label.

In the model of agnostic learning, the target concept c is not necessarily drawn from a known class of concepts, but there is a specification C of the class of concepts that the learner should use. In this case, the learner's goal is to find a hypothesis h in C that is nearly as good as the best approximation in C to the target c. For example, if C is the class of half spaces, then even if c is not a half space, there is some half space h_0 that minimizes the error errror(h_0) = Pr_D(h_0 XOR c). The goal of learning is to find a half space h whose error, Pr_D(h XOR c) is "almost" as small as the minimum possible error for a half space, that is, error(h_0).

Weak and strong learning. We may relax the requirements of PAC learning to get the following notion of "weak" learning. A class C of concepts is weakly learnable if there exists some epsilon_0 < 1/2 and some delta_0 < 1 such that there is an algorithm A such that on any concept c from C and any probability distribution D on X, with probability at least (1 - delta_0), A outputs a hypothesis h such that the error of h is at most epsilon_0. In contrast to this, the original notion of PAC learning can be termed "strong learning." The equivalence of weak and strong learning was originally proved by Rob Schapire in his Ph. D. thesis, and further developments of it in collaboration with Yoav Freund led to the method of Boosting, which is the topic of the next lecture, which will be given by Lev Reyzin.

2/13/09 Lecture 14. Learning a union of axis parallel rectangles, continued.

Last lecture we saw that if the target concept is a union of at most s rectangles (concept class C) then a greedy covering method would let us efficiently find a union of at most (s ln m)+1 rectangles consistent with a sample of size m (concept class H). We'd like to use the VC dimension sample size bound for H to tell us how many examples are sufficient for PAC learning by this algorithm.

Recall that the VC dimension bound is

m >= k(1/epsilon)(d(H) ln (1/epsilon) + ln(1/delta))
where d(H) is the VC dimension of the hypothesis class. Using the bound on the VC dimension of unions of concepts from a base class, the VC dimension of unions of at most (s ln m)+1 axis parallel rectangles is at most
   8(s ln m + 1)log(3(s ln m + 1)).
Substituting this for d(H) in the sample size bound above, we get
m >= k(1/epsilon)(8(s ln m + 1)log(3(s ln m + 1)) ln (1/epsilon) + ln(1/delta))
which has m on both sides of the inequality. However, on the left we have m and on the right we have terms that are O(log m) and O(log log m), so it is clear that we can satisfy the inequality by choosing m large enough. But will m still be polynomial in (1/epsilon), (1/delta) and s?

To get some insight into this, consider the following simpler inequality:

    m >= x log m
where m and x are positive quantities. It is not quite sufficient to take m = x log x (try it). However, if we take m = x^2, we must have x^2 >= 2x log x, which will be satisfied if x >= 4 and the log is base 2. Thus, even without solving for m in the above inequality, we can be sure that a polynomial function of (1/epsilon), (1/delta) and s will be sufficient. Thus, modulo the problem of not knowing s, the bound on the correct number of rectangles, our polynomial time greedy covering algorithm succeeds in PAC learning unions of axis parallel rectangles. (A generalization of this idea shows that compressing the sample by a sufficient amount implies PAC learnability.)

So, how do we deal with the problem of not knowing s, the bound on the number of rectangles in the union? Now we will assume that our algorithm does not need to specify in advance the number of examples it will draw; it can continue drawing labeled examples until it decides to output a hypothesis and halt. However, the expected total number of examples drawn should still be polynomial in (1/epsilon), (1/delta) and (the unknown value) s. A reasonable approach is "guess and check": that is, assume that s = 1, draw samples, find a hypothesis h_0 and test to see if hypothesis h_0 has error at most epsilon. If so, output h_0 and halt. Otherwise, assume s = 2, draw samples, find a hypothesis h_1 and test to see if h_1 has error at most epsilon. If so, output h_1 and halt. Otherwise, assume s = 4, draw samples, find a hypothesis h_2 and test to see if h_2 has error at most epsilon. If so, output h_2 and halt, etc. Eventually the guess of s will be sufficiently large, and at that point our covering algorithm should succeed in PAC learning.

To implement this idea, we need a method of checking whether a hypothesis h has error at most epsilon, and we need to allocate our confidence budget of delta so that the probability of failure from all sources is at most delta.

So we consider the problem: given a hypothesis h and the ability to draw labeled examples, can we decide whether the error of h is at most epsilon? No, that is too much to ask: we cannot get a sharp cutoff at epsilon, saying "yes" with high probability if the error rate is at most epsilon and "no" with high probability if the error rate is greater than epsilon. To see why, assume you decide that a million labeled examples are enough. Then I will give you the task of deciding whether the error rate is epsilon or epsilon+(1/trillion). What we can do is a somewhat more relaxed task: say "no" with high probability if the error rate is greater than epsilon, say "yes" with high probability if the error rate is less than epsilon/2, and say either "yes" or "no" if the error rate is between epsilon/2 and epsilon. To do this, we shall use a technical tool that goes by the name "Chernoff bounds" in computer science (see next lecture).

2/11/09 Lecture 13. Learning a union of axis parallel rectangles.

First, we consider the issue of improper learning again. There is a target class C of concepts, and another class H of hypothesis concepts such that every concept representable in C is also representable in H. What can we say about sample size bounds if a learning algorithm learns concepts c in C using hypotheses h in H? Sample size lower bounds based on C are still valid, while sample size upper bounds based on H are sufficient for learning concepts from C.

Suppose now we consider points (x,y) in the plane, and concepts representable as unions of axis parallel rectangles. The VC dimension of the concept class is infinite: for any finite set S of points, we can produce an arbitrary labeling of S by putting a tiny rectangle around each point that is to be labeled +. However, this doesn't seem to be a reasonable choice for a hypothesis given a labled sample. More reasonable is to consider using as few rectangles as possible to cover the + points while avoiding all the - points. Can we say anything about the sample size that would be sufficient for this strategy to achieve PAC learning?

Suppose we introduce another parameter, s, and consider the class of unions of at most s rectangles. What is an upper bound on the number of labelings of m points in the plane by unions of at most s rectangles? We can restrict our attention to rectangles that have been shrunk as much as possible without excluding any + points -- each one can be specified by choosing (at most) 4 + points. Thus, m^{4s} choices of points are enough to specify up to s axis parallel rectangles, and this is an upper bound on the number of labelings of m points by unions of at most s axis parallel rectangles. In fact, in the BEHW paper there is an upper bound of 2dslog(3s) on the VC dimension of the class of unions of at most s concepts from a base class of VC dimension d, which means that unions of s rectangles in the plane have VC dimension bounded by 8slog(3s) (substituting 4 for d). This can be substituted into the sample size bound based on VC dimension to show that a sample of size

m >= k(1/epsilon)(8slog(3s)log(1/epsilon) + log(1/delta))
is sufficient to ensure PAC learning by an algorithm that finds the union of the fewest axis parallel rectangles to cover the + points and exclude the - points. (The fewest will always be at most s, since the target concept has at most s rectangles.)

However, there are two issues with this approach: (1) we don't know s a priori and (2) finding the fewest axis parallel rectangles to cover all the + points and none of the - points in a given sample is an NP-hard problem. We'll address point (1) later, but for point (2) we turn to a general and useful technique: be greedy. That is, given a sample of + and - points, we find an axis parallel rectangle R_1 that covers as many of the + points as possible without covering any - points. We put R_1 into our hypothesis and remove the covered + points from the sample. Then we find an axis parallel rectangle R_2 that covers as many of the remaining + points as possible without covering any - points. We put R_2 into our hypothesis and remove the covered + points from the sample. We continue this way, greedily choosing rectangles until all the + points are covered, and output the set of chosen rectangles as our hypothesis h. Note that in looking for rectangles, we can use sets of up to 4 + points to determine the x and y coordinates of possible rectangles, so we only have to consider O(m^4) possible rectangles. Thus this greedy algorithm runs in polynomial time, but it is not guaranteed to find an optimal solution.

What can we say about the number of rectangles used by the greedy method? In fact, it will use at most (s ln m + 1) rectangles. Consider the first rectangle chosen, R_1. Because all of the + points are covered by at most s rectangles in the target concept, a fraction of at least 1/s of the points must be covered by some rectangle in the concept. Because we are choosing R_1 to cover the maximum number of + points (without covering - points), R_1 must cover a fraction of at least 1/s of the (at most) m positive points, leaving at most m(1 - 1/s) points to be covered. When we choose R_2 to cover as many of the remaining + points, we know that there are s rectangles (in the target concept) that cover all of them, so there is a rectangle that covers at least a fraction 1/s of the remaining points, and by our choice of R_2, it must cover at least a fraction 1/s of the remaining points, leaving at most m(1 - 1/s)^2 positive points not covered by R_1 or R_2. Continuing in this way, after we've chosen t rectangles, there are at most m(1 - 1/s)t positive points left to be covered. When t > (s ln m), we have m(1 - 1/s)^t < 1, that is, NO positive points are left to be covered. Thus our hypothesis has at most (s ln m)+1 rectangles in it. Not optimal (which would be s or less), but good enough, as we shall see.

So we are in a situation of improper learning, where C is the class of unions of at most s rectangles, and H is the class of unions of at most (s ln m)+1 rectangles. Oops, the concept class H depends on the sample size m -- is this going to be a problem? (See next lecture.)

2/9/09 Lecture 12. A hardness result for proper learning of 3-term DNF formulas, continued.

For proper learning, we have a target concept class C (like 3-term DNF formulas) and the hypotheses output by the learning algorithm also have to be from C. For improper learning, we have a target concept class C (like 3-term DNF formulas) and a different hypothesis class H (like CNF formulas with at most 3 literals per clause); the learning algorithm is allowed to output hypotheses from H rather than C. (We assume that H is such that every concept representable in C is also representable in H, so that there is at least one h in H equal to the target concept c in C.) Here we consider the proper learning of 3-term DNF formulas.

The hardness of proper learning is conditional upon a complexity theoretic assumption, that NP (nondeterministic polynomial time) and RP are not equal. The complexity class RP is contained in the class NP and includes problems that can be solved with polynomial time randomized algorithms that have one-sided error. The most famous such algorithm is the Miller-Rabin test for primality: given an input number n, if n is prime, then the algorithm always outputs "prime", and if n is composite, then the algorithm outputs "composite" with probability at least 1/2. The separate runs of the algorithm are statistically independent (it involves generating a random possible "witness" to compositeness), so to increase the chances that we correctly assess the primality of n, we can re-run the algorithm several times on the same input n. If the answer is ever "composite", then we know for sure that n is composite (the algorithm has found a witness to the compositeness of n); if the answer is "prime" in k repeated runs, then either n is really prime, or we were quite unlucky (an event of probability at most (1/2)^k has happened.) (This is the test used in practice to generate cryptographic keys for the RSA cryptosystem, even though we now know a deterministic algorithm for primality.) Abstracting, a problem is in RP if there exists a polynomial time randomized algorithm A such that on input x, if the answer is 0, then A always outputs 0, and, if the answer is 1, then A outputs 1 with probability at least 1/2. Thus, if we receive an answer of 1, we can believe it, but an answer of 0 may be wrong. The Miller-Rabin algorithm shows that testing a number n for compositeness is in RP.

The hardness result shows that if we have a polynomial time proper PAC learning algorithm for 3-term DNF, we can derive from it a polynomial time randomized algorithm to solve an NP-complete problem, graph 3 coloring, which would show that NP = RP. The graph 3-coloring problem is the following: given an undirected graph G = (V,E) with vertices V and edges E, determine whether we can assign 3 colors (say, green, red and blue) to the vertices in such a way that no edge e from E has vertices of the same color at its two endpoints. This problem is known to be NP-complete, and if we could find an RP algorithm to solve it, then NP = RP.

Given a graph G = (V,E), suppose V = {v_1, v_2, ..., v_n}. We transform G into a learning problem for a Boolean concept with variables x_1, x_2, ..., x_n as follows. For each vertex v_i, construct a Boolean vector p(i) of length n that has 1's in all positions except position i, where it has a 0. For each edge (v_i,v_j), construct a Boolean vector n(i,j) of length n that has 1's in all positions except positions i and j, where it has 0's. We now consider the sample S in which each of the vectors p(i) is a positive (+) example, and each of the vectors n(i,j) such that (v_i,v_j) is an edge is a negative (-) example. The key claim about this sample is the following: there is a 3-term DNF formula consistent with S if and only if there is a 3 coloring of the original graph G.

The proof of the key claim is in two parts, for the "if" and the "only if". (If): Suppose G is 3 colorable; let c:C -> {red, blue, green} be a legal 3 coloring of the graph. We construct a 3-term DNF formula as follows. Let T_r consist of all the variables x_i such that v_i is NOT colored red, let T_b consist of all the variables x_i such that v_i is NOT colored blue, and let T_g consist of all the variables x_i such that v_i is NOT colored green. Consider the 3-term DNF formula f = T_r v T_b v T_g. For each i, vertex v_i has some color, say red, and the corresponding positive example p(i) has 1's in all positions besides i and therefore makes the term T_r and the formula f true. Thus f is consistent with all the positive examples in S. Consider any edge (v_i,v_j) of G. The two endpoints v_i and v_j must have different colors, say blue and green, respectively. Then the negative example n(i,j) has a 0 in position i, and x_i occurs in terms T_r and T_g, so n(i,j) makes both of these terms false. But n(i,j) also has a 0 in position j, and x_j occurs in terms T_r and T_b, so n(i,j) makes both of these terms false. Thus, n(i,j) makes all three terms T_r, T_b, and T_g false, and makes f false. Hence f is consistent with all of the negative examples in S as well. Thus f is a 3-term DNF formula consistent with the sample S. (Only if) Suppose there is a 3-term DNF formula f = T_1 v T_2 v T_3 consistent with the sample S. We use f to construct a 3 coloring of the graph G (using the colors 1, 2, and 3) as follows. For each i, let v_i be colored with color 1 if p(i) makes term T_1 true, color 2 if p(i) makes term T_2 true, or color 3 if p(i) makes term T_3 true (if there is more than one choice of colors, choose one arbitrarily.) Because for each i, p(i) makes the formula f true, it must make at least one of the terms T_1, T_2 or T_3 true, so every vertex v_i receives one of the three colors. Consider any edge (v_i,v_j) of G, and suppose v_i and v_j are both colored the same color, say 1. Thus, both p(i) and p(j) make term T_1 true, so neither x_i nor x_j nor their complements can occur in term T_1 (because x_i is 0 in p(i) and 1 in p(j) while x_j is 0 in p(j) and 1 in p(i).) But n(i,j) differs from p(i) only in position j, so because p(i) makes T_1 true, n(i,j) must also make T_1, and therefore f, true, contradicting the assumption that f is consistent with the sample S. Thus, for every edge (v_i,v_j) of G, the two endpoints receive different colors, and we conclude that G is 3 colorable. This concludes the proof that G is 3 colorable if and only if there is a 3 term DNF formula consistent with the sample S we derived from G.

Given this reduction, how can we use an algorithm for proper PAC learning of 3 term DNF formulas to get a randomized algorithm for 3 coloring a given graph? Suppose A is a polynomial time, proper PAC learning algorithm for 3 term DNF formulas. Let a graph G = (V,E) be given: our task is to decide whether G is 3 colorable. We use G to construct the sample S as described above -- the sample can be constructed from the graph is polynomial time. Let N denote the total number of examples (positive and negative) in S. We simulate running the algorithm A with delta = 1/2, epsilon = 1/2N, and a distribution D that assigns equal probability 1/N to each element of S, and 0 to all other Boolean n vectors. If A halts and outputs a formula f, we check whether f is a 3 term DNF formula, and, if so, whether it is consistent with the sample S; if so, we answer "yes, G is 3 colorable" and otherwise we answer "no, G is not 3 colorable." (In case A does not halt by its time bound (polynomial in n and N), we also give the "no" answer.) We claim that the algorithm we just described is an RP algorithm for graph 3 colorability. It clearly runs in polynomial time. If G is 3 colorable, then the sample S we construct is consistent with some 3 term DNF formula, and with probability at least (1 - delta) = 1/2, A must output a 3 term DNF formula that has error (wrt D) of at most epsilon = 1/2N. But because each sample point has probability 1/N, this means that the formula cannot be wrong on any point in S, that is, with probability at least 1/2, A must output a 3 term DNF formula f consistent with S -- our algorithm will check that f is a 3 term formula consistent with S and correctly output "yes, G is 3 colorable." If G is not 3 colorable, then there is NO 3 term DNF formula f consistent with S, so our algorithm will always output "no, G is not 3 colorable" in this case. In other words, when the answer should be "yes" we answer "yes" with probability at least 1/2, and when the answer should be "no" we always answer "no". This concludes the proof that if 3 term DNF formulas are properly PAC learnable in polynomial time, then NP = RP. (A different reduction can be used to show that proper PAC learning of 2 term DNF formulas is similarly hard.)

2/6/09 Lecture 11. An application of Markov's inequality; 3-term DNF is not properly PAC learnable in polynomial time if NP is not equal to RP.

We first revisited the following question. If X is a random variable taking values between 0 and 1, and E(X) >= 1/4, can we put a lower bound on the probability that X is at least 1/8? Markov's inequality does not apply immediately, but we can consider the random variable X' = (1 - X), which also takes values between 0 and 1 and has expected value E(X') <= 3/4. Also, Pr(X' >= 7/8) = Pr(X <= 1/8). We can use Markov's inequality and the fact that 7/8 = (7/6)*(3/4) to say

    Pr(X' >= 7/8) <= (6/7)
Thus, Pr(X <= 1/8) <= 6/7 and Pr(X > 1/8) >= 1/7, which is a bound of the kind we wanted.

Formulas in Disjunctive Normal Form (DNF) are an "or" of terms, each of which is an "and" of literals. The class of 3-term DNF formulas has at most 3 terms, for example,

    x_1x_3'x_5x_7   v   x_1'x_2x_4x_6'x_8'   v   x_3x_5'
(Note that the key element is that there are at most 3 terms -- each term may contain any number of literals.) We consider the problem of learning the class C of concepts represented by 3-term DNF formulas over the variables x_1, x_2, ..., x_n.

Proper versus improper learning. When the target class is C and a learning algorithm is guaranteed to output a concept in the class C, then learning is "proper." If the learning algorithm is permitted to output a concept in a larger class H containing C, then the learning is "improper." It is not difficult to "improperly" learn the class of 3-term DNF formulas. In particular, we can use distributivity of "or" over "and" in propositional logic to construct, for any 3-term DNF formula, an equivalent CNF formula with at most 3 literals per clause. As an exampe of how this works for a 2-term DNF formula:

 abc v def = (a v d)(a v e)(a v f)(b v d)(b v e)(b v f)(c v d)(c v e)(c v f)
One of the first classes we saw was PAC-learnable was the class of CNF formulas with at most k literals per clause, for constant k. Thus, if we "relax" the output requirement to be CNF with at most 3 literals per clause, the class of 3-term DNF formulas is learnable.

What if we insist on being "proper", that is, outputting a hypothesis that is a 3-term DNF formula? Unless NP = RP, there is no polynomial time proper PAC learning algorithm for the class of 3-term DNF formulas: proof is in next lecture. The problem is not the required sample size -- we can bound the number of 3-term DNF formulas by (3^n)^3, so that the cardinality based sample size bound gives O((1/epsilon)(n + (1/delta))). The difficulty is in the computational problem of actually finding a 3-term DNF formula consistent with a given labeled sample.

2/4/09 Lecture 10. PAC learning half spaces in d dimensions; a VC dimension lower bound on sample size for PAC learning.

PAC learning half spaces in R^d. We first observed that we can use the VC dimension upper bound to get an algorithm to learn half spaces in R^d. The VC dimension of half spaces in R^d is (d+1), which means that if we draw at least

    m >= k(((d+1)/epsilon) ln (1/epsilon) + (1/epsilon) ln (1/delta))
labeled samples and find a half space consistent with them, then with probability at least (1 - delta), the error of our hypothesis half space will be at most epsilon. How do we go about finding a half space consistent with given labeled samples? We are looking for a set of real numbers a_1, a_2, ..., a_d, and b, that correctly label each of our sample points (x_1, x_2, ..., x_d) in the sense that a_1x_1 + ... + a_dx_d >= b if and only if the label of (x_1, ..., x_d) is positive. Thus we can treat the a_i's and b as unknown and set up a system of inequalities, one for each labeled sample point, and ask for a solution of the inequalities. One way to solve such a system is to apply an algorithm for linear programming, a problem for which we know polynomial time (and practical) algorithms.

To get a lower bound on the sample size required for PAC learning, we first consider the following situation. Let X consist of the integers {1, 2, ..., d} and let C consist of all subsets of X. Then clearly C shatters X and the VC dimension of C is d. Let D be the uniform distribution over X and let epsilon and delta be positive numbers. It seems intuitively clear that we have to draw (1-epsilon)|X| of the points in X in order to know enough labels to output a concept that has error at most epsilon, because knowing the labels of some of the points does not help us guess labels on the rest. How do we prove this?

We'll prove something a bit weaker, for specific values of epsilon and delta. Let epsilon = 1/8 and delta = 1/8. We'll show that no algorithm that draws at most d/2 samples can do PAC learning with these parameters. Suppose A is a learning algorithm that draws at most d/2 samples and outputs a concept h from C. We'd like to find a concept c on which A does badly, that is, with probability greater than 1/8, A outputs a concept h whose error with respect to c is more than 1/8. Our method is to show that A does badly on average, and therefore there is at least one concept c on which A does badly.

That is, we consider choosing a concept c from C uniformly at random, and running A on it. Because the choice of c and the samples drawn by A are independent, we can consider that they happen in the other order -- that is, each time A draws a previously unlabeled point, we flip a coin to decide on a label for it. Thus, if we consider the hypothesis h output by A after seeing at most d/2 points, the expected error of h is at least 1/4. (For each of the at least d/2 points not seen by A, the value of h is correct with probability 1/2, so Pr_D(h XOR c) >= 1/4.) Thus, there is at least one concept c in C for which the expected error is at least 1/4.

So what does the "expected error" mean for epsilon and delta? If an algorithm A does PAC learning, then if we run A and then use A to predict one more point randomly drawn according to D, the probability of error is at most

    delta + (1 - delta)epsilon
(To see this, let p be the probability that A outputs an epsilon-bad hypothesis, so (1 - p) is the probability that A outputs an epsilon-good hypothesis. Then an upper bound on the probability of error in the next prediction is (p + (1 - p)epsilon). We know that p <= delta, and the result follows.) Thus, if A actually succeeded in PAC learning with epsilon = 1/8 and delta = 1/8, we would have the probability of error in predicting the next point
   <= 1/8 + (7/8)1/8 = 15/64 < 1/4,
contradicting the fact that the probability of error predicting the next point when the concept is c is at least 1/4. Thus, PAC learning in this situation requires that the algorithm draw at least d/2 points.

But surely this is a very specialized situation? Not really -- if C is a concept class of VC dimension d, then we may choose a set S of d points that is shattered by C and consider the distribution that is uniform on S and 0 elsewhere. Then the argument above shows that when delta = 1/8 and epsilon = 1/8, any PAC learning algorithm must draw at least d/2 points. This shows that the sample size for PAC learning must grow at least linearly with the VC dimension of the concept class.

To extend this bound to Omega(d/epsilon), we may consider a shattered set S = {p_1, p_2, ..., p_d} and a probability distribution D that puts most of the weight on p_1, that is, D(p_1) = (1 - 8*epsilon), divides the rest of the weight evenly between the other points of S, that is, D(p_i) = 8*epsilon/(d-1) for i = 2, ..., d, and assigns 0 probability to other points. In this circumstance, a PAC learning algorithm must collect a linear fraction of the points {p_2, ..., p_d}, but the probability of getting one of them on a particular draw is only 8*epsilon. The expected number of draws to get one of them is 1/(8*epsilon), and a linear fraction of them must be collected, which intuitively shows that the lower bound on the sample size in this case is Omega(d/epsilon). (See BEHW for details.)

2/2/09 Lecture 9. Discussion of Assignment #2.

Discussion of the solutions of problems 1 and 2 of Assignment #2. Students may resubmit solutions of problem 3 with Assignment #3.

1/30/09 Lecture 8. VC dimension and sufficient sample size.

In the Blumer, Ehrenfeucht, Haussler and Warmuth paper we have the following theorem. There exists a constant k such that for any domain X and concept class C over X of VC dimension d, for any probability distribution D over X and any target concept c from C, and any positive numbers epsilon and delta, if we draw

m >= k((d/epsilon) ln (1/epsilon) + (1/epsilon) ln (1/delta))
examples according to D, classified according to c and find a hypothesis h in C that is consistent with these m labeled examples, then with probability at least (1-delta), we have Pr_D(h XOR c) <= epsilon. (That is, an algorithm that outputs such an h will succeed in PAC learning.)

The VC dimension of a finite concept class C is at most log_2(|C|). This holds because to label d points in all possible ways, C must contain at least 2^d different concepts, so 2^d <= |C|. (Note that this gives an upper bound of (log_2(3))n on the VC dimension of conjunctions of literals over x_1, x_2, ..., x_n, because we earlier argued that there are no more than 3^n such conjunctions.) This gives us a way to compare the VC dimension sample size bound with the sample size bound based on cardinality:

((ln |C|)/epsilon + (1/epsilon) ln (1/delta)).
In this comparison, log(|C|) is analogous to the VC dimension d; the differences are the factor of log(1/epsilon) and the constant factor k. While we won't prove the VC dimension sample bound, we will look at some of the reasoning behind it.

The first part of the proof gives a relationship between the VC dimension of a concept class C and the number of possible labelings of m points by concepts from C. Consider the domain of real numbers and the concept class C_1 of closed intervals [a,b] of real numbers; recall that this concept class has VC dimension 2. Suppose we have m real points p_1 < p_2 < ... < p_m. How many different labelings of these points are there by concepts from C? Certainly all the points can be labeled - by an interval that contains none of them. And any interval of one or more points: p_i, p_{i+1}, ..., p_j can be labeled + by the interval [p_i,p_j], which labels the rest of the points -. Thus, there are 1 + m + (m choose 2) possible labelings, where (m choose 2) is the number of unordered pairs chosen from m distinct things. Because (m choose 2) is m(m-1)/2, we see that the number of labelings is O(m^2). In general, the result is that if a concept class has VC dimension d, then there are O(m^d) possible labelings of m points by concepts from the class. More precisely, the number of possible labelings of m points by concepts from a class of VC dimension d is at most

    1 + m + (m choose 2) + .... + (m choose d)  =  O(m^d).
(See the paper of BEHW or the book of Kearns and Vazirani for a proof.)

The second part of the proof uses epsilon nets, a concept borrowed from computational geometry. Let X be a domain, C a concept class over X, D a probability distribution over X and epsilon > 0 a positive number. In this situation, an epsilon-net for C is a set S of points of X such that for every concept c in C with Pr_D(c) >= epsilon, S contains at least one point of c. That is, S contains at least one point in every "sufficiently important" concept from C. As an example, suppose X is the unit interval [0,1], D is the uniform distribution over X, and C is the concept class of closed subintervals [a,b] of [0,1]. Then if we take S to contain all points of the form j(epsilon)/2, for integers j with 0 <= j <= 2/epsilon, any concept [a,b] of probability weight w >= epsilon has length w and must contain at least one point of S. Thus in this situation, S is an epsilon-net for this concept class. Rather than carefully constructing an epsilon-net, we can draw a set S of "enough" points from D, and the result will be an epsilon net for C with high probability.

To connect epsilon-nets and errors in hypotheses, we let c be the target concept and consider the class Delta(c) of concepts of the form (c XOR h) where h is in C. Note that (c XOR h) is the symmetic difference of c and h -- the points in c but not h together with the points in h but not c. If c is the target concept, then (c XOR h) is the error region of the hypotheis h -- the set of points where c' makes the wrong prediction. For any target concept c, the class Delta(c) has the same VC dimension as C. We looked at the concepts in Delta(c) for the class of axis parallel rectangles and a fixed axis parallel rectangle c. Given a distribution D over X, we can also consider Delta_{epsilon}(c), which is all the sets (c XOR h) in Delta(c) such that Pr_D(c XOR h) is at least epsilon. (To ensure that the sets in Delta(c) are measurable, we have to make the mild assumption that C is "well-behaved" -- see the Appendix to BEHW for details.) This is the set of error regions of probability weight at least epsilon for hypotheses h from C. If our sample S of c is an epsilon-net for Delta_{epsilon}(c), then every possible hypothesis h from C with an error of at least epsilon will be inconsistent with the sample S, because S contains some point of (c XOR h), and h disagrees with c on this point. Thus, a bound on the VC dimension gives a bound on the number of labelings of m points and this in turn is used to show that the given bound on sample size is sufficient to ensure an epsilon-net for the error regions of concepts in C with respect to the target c.

1/28/09 Lecture 7. Learning 1-decision lists.

A 1-decision list over the variables x_1, x_2, ..., x_n sequentially tests literals to determine the value of a Boolean function. One way to think about them is an if-then-else structure, for example the following. (The negation of literal x_i will be written x_i'.)

    if x_1 then 0
    elseif x_3' then 1
    elseif x_2 then 0
    elseif x_4' then 0
    else 1
Here, the assignment x_1 = 0, x_2 = 1, x_3 = 1, x_4 = 0 will receive the output 0 because x_1 is tested and found to be false, x_3 is tested and found to be true, and x_2 is tested and found to be true, causing the output 0 to be chosen. Another way to represent a 1-decision list is as a graph; the same 1-decision list could be pictured as follows.
      x_1  --  x_3'  --  x_2  --  x_4' --  1
       |        |         |        |
       0        1         0        0
Here the downward transitions correspond to the literal being true and the rightward transitions correspond to the literal being false. Clearly we can write down a DNF formula representing the same Boolean function by expressing an "or" of the conditions necessary to reach each output node labeled with 1. In this case, we get (x_1'x_3' v x1'x_3x_2'x_4). (Dually, we could write down a CNF formula representing this function.)

Can every concept representable as a conjunction of literals also be represented as a 1-decision list? Yes, for example for the conjuction x_1x_3'x_4, we construct the following 1-decision list:

    x_1' --  x_3  --  x_4'  --  1
     |        |        |
     0        0        0
That is, we test the complements of the literals in the conjunction in some order, and if one of them is true, then the output is 0, while if all of them are false, the output is 1. Dually, 1-decision lists can represent all disjuntions of literals, for example (x_1 v x_2' v x_3) can be represented by the following 1-decision list:
    x_1  --  x_2'  --  x_3  --  0
     |        |         |
     1        1         1
Here we test the literals in the disjunction one by one, and if any of them is true, the output is 1, while if all of them are false, the output is 0.

1-decision lists are PAC learnable. We use the general strategy: draw enough labeled examples (where our sample size bounds tell us how many are enough) and find a 1-decision list consistent with the examples. Suppose we have the following labeled examples.

    x_1  x_2  x_3  x_4  label
A.   0    0    1    0     -
B.   0    0    1    1     +
C.   0    1    1    0     -
D.   0    0    0    0     +
We look for a literal (over the variables x_1, x_2, x_3, x_4) that "makes progress" in the following sense: the literal is true in at least one of the remaining examples, and for all of the remaining examples in which the literal is true, their labels are either all + or all -. For example, literal x_4 is true on example (B) and on no other example, so we can start with x_4:
    x_4
     |
     1
Now we can eliminate all the examples (in this case, just (B)) that it accounts for:
    x_1  x_2  x_3  x_4  label
A.   0    0    1    0     -
C.   0    1    1    0     -
D.   0    0    0    0     +
Now we look for another literal that makes progress with respect to these examples. We can choose x_3 because it is true in examples (A) and (C), which are both labeled -. Thus we now have the following partial 1-decision list:
    x_4  --  x_3
     |        |
     1        0
Eliminating the two examples (A) and (C), we have just one example left (D), which is labeled +, so we can complete our 1-decision list with an else clause of 1:
    x_4  --  x_3  --  1
     |        |
     1        0

If we can show that there is always a literal to choose that "makes progress", then we can conclude that this algorithm will succeed in finding a 1-decision list consistent with any labeled sample of a target 1-decision list. Let $c$ be the target 1-decision list, and suppose we have partially constructed a 1-decision list and eliminated some examples as described above, leaving a set of remaining examples R. Consider which output nodes of c the examples in R reach, and consider the leftmost literal in c that has some examples from R that take its "true" branch. Clearly, some examples in R make this literal true, and they must all be labeled the same way, because their label is assigned by c. Thus, at least this literal "makes progress." Hence the algorithm will find a 1-decision list consistent with the sample.

It is clear that this algorithm runs in polynomial time in n and the sample size, so what remains is to show that the sample size is polynomial in n, 1/epsilon and 1/delta. We use the bound for a finite concept class C, that is, m >= (1/epsilon)(ln |C| + ln(1/delta)). To apply this, we need to bound the number of different 1-decision lists on n variables. A sufficient upper bound is 4^n(n+1)!, which gives us a bound of O(n log n) on |C|, which gives us our polynomial upper bound.

1/26/09 Lecture 6. PAC Learning: VC dimension Examples.

Recall from last time the definition of the VC dimension of a concept class C over domain X. It is the maximum d such that there exists a set of d points from X that can be "shattered", that is, labeled in all possible ways, by concepts from C. If there is no such maximum (that is, arbitrarily large sets of points from X can be shattered) then we say the VC dimension is infinite.

Examples: closed intervals and axis parallel rectangles. Let X_1 be the set of real numbers and C_1 be the set of closed intervals [a,b] of real numbers. As we saw in the preceding lecture, The VC dimension of C_1 is 2. Let X_2 be the set of pairs (x,y) of real numbers and C_2 be the set of axis parallel rectangles, that is, sets of points [a,b]x[c,d]. The VC dimension of C_2 is 4. To see this, we have to prove two things: that there is SOME set of 4 points that is shattered by C_2, and that there is NO set of 5 points that is shattered by C_2. Taking 4 points in a diamond shape (for example: (0,0), (-1,1), (1,1), (0,2)), we can see that each of the 16 possible labelings of these 4 points can be achieved by some concept from C_2. On the other hand, let S be any 5 points from X_2, and let x_m be the minimum of their x coordinates, x_M be the maximum of their x coordinates, y_m the minimum of their y coordinates and y_M be the maximum of their y coordinates. Then we may choose at most 4 points from S as follows: let p_1 be a point with x coordinate x_m, p_2 a point with x coordinate x_M, p_3 be a point with y coordinate y_m and p_4 a point with y coordinate y_M. If we label all these points + and the remaining (at least one) point(s) in S as -, then no concept from C_2 can achieve this labeling. If p is a point labeled - by this process, its x coordinate lies between x_m and x_M and its y coordinate lies between y_m and y_M, so no axis parallel rectangle can label p_1, p_2, p_3 and p_4 as + and p as -.

Another example: convex polygons in the plane. Let X_3 be the set of points (x,y) such that x and y are real numbers and C_3 be the class of all sets of all points within or on the boundary of a convex polygon in the plane. The VC dimension of C_3 is infinite, which we can see as follows. Given n, let S consist of n points evenly spaced around a unit circle. Clearly the labeling where at most 2 of the points is labeled + is achievable by a concept from C_3. Otherwise, consider a labeling in which at least 3 points of S are labeled +. The polygon with vertices consisting of the + points is convex, and includes all the + points and none of the - points. Thus, for any positive integer n, n points can be shattered by C_3, and its VC dimension is infinite.

Another example: half spaces in the plane, in space, and in R^d. Let X_4 be the set of points (x,y) such that x and y are real numbers and C_4 be the class of sets of points that are on one side of a line, for example, x+2y >= 2. In 3 dimensions, the corresponding class is sets of points that are on one side of a plane, and in d dimensions, the corresponding class is sets of points on one side of a hyperplane. In each case, the algebraic characterization of the set of points can be written as a_1x_1 + a_2x_2 + ... + a_dx_d >= b, where the a_i's and b are real numbers, and the variables are x_1, ..., x_d. In the plane, we can see that if we choose 3 points not on a line, we can label them in all possible ways with half spaces in the plane. We'll come back to the general result for R^d later.

Another example: conjunctions of literals over the variables x_1, x_2, ..., x_n.

1/23/09 Lecture 5. PAC Learning: finite concept classes and VC dimension.

The PAC learning algorithm we saw last time for conjunctions of literals can be used to establish the PAC-learnability of k-CNF formulas by a very general reduction. Consider the case of 2-CNF over 3 variables: x1, x2, and x3. There are 18 possible nonempty clauses (using y' to denote the negation of y.)

x1  x2  x3  x1'  x2'  x3'  
(x1 v x2)  (x1 v x3)  (x2 v x3)  (x1' v x2)  (x1' v x3)  (x2' v x3)  
(x1 v x2')  (x1 v x3')  (x2 v x3')  (x1' v x2')  (x1' v x3')  (x2' v x3')
If we assign new variables, say z1, z2, ..., z18, to represent these clauses, then we can translate an assignment to x1, x2, x3, say, x1 = 1, x2 = 1, and x3 = 0, into an assignment to the new variables, namely, z1 = 1, z2 = 1, z3 = 0, z4 = x1' = 0, z5 = x2' = 0, z6 = x3' = 1, z7 = (x1' v x2) = 1, z8 = (x1' v x3) = 0, and so on, up to z18 = (x2' v x3') = 0. Given a sample of labeled assignments to x1, x2 and x3, we can translate them to assignments to z1, z2, ..., z18 (with the same labels) and use the algorithm to learn a conjunction of literals over the zj's. We can translate a hypothesis like (z6 AND z10) back into a 2-CNF formula over x1, x2, and x3, namely (x3 AND (x1' v x2)). Thus, by adding new synthesized attributes (that are functions of the original attributes) we have been able to ensure a simpler form for the target concept (a conjunction of variables instead of a conjuction of clauses.) This transformation depends on the fact that PAC-learning works with respect to *any* distribution, because the distribution D over the assignments to x1, x2 and x3 is transformed into some other distribution D' over assignments to the zj's.

We can abstract the PAC learning problem as follows. There is a set of example points X, a class of possible target concepts C (a set of subsets of X), an unknown distribution D over X and a particular target concept c chosen from C. We draw m points p1, p2, ..., pm independently according to D and label pj as + if (pj is in c) or - (if pj is not in c). Then the goal is to output a concept c' such that with probability at least (1 - delta), Pr_D(c XOR c') is at most epsilon, where (c XOR c') is the set of points in the symmetric difference of the sets c and c'. (That is, points in c but not c' or in c' but not c.)

In the example of axis-parallel rectangles, X is the set of all points p in the unit square, and C is the set of all possible axis-parallel rectangles in the unit square. In the example of conjunctions of literals over the variables x1, x2, ..., xn, X is the set of all assignments of 0's and 1's to the xj's, and C is the set of all conjunctions of literals (identified with the set of assignments that make the conjunction true.)

In the case of a finite concept class C, we have the following. If we choose ANY hypothesis h from C that is consistent with at least m >= (1/epsilon) ln (|C|/delta) samples, then with probability at least (1 - delta) we have Pr_D(h XOR c) is at most epsilon. To prove this define a concept c' in C to be "bad" if Pr_D(c' XOR c) > epsilon. Consider any bad concept c' in C. What is the probability that c' will be consistent with m samples drawn according to D and labeled according to c? It will only be consistent if all m samples miss the points that c' and c classify differently, that is, the points in (c' XOR c). But because c' is bad, this region has probability at least epsilon of being hit with every sample drawn, so the probability that it is missed in m independent draws is bounded above by (1 - epsilon)^m. For our choice of m, this is at most delta/|C|. The probability that any particular bad concept is consistent with m samples is bounded above by delta/|C|, so by the union bound and the fact that there are at most |C| bad concepts, the probability that there exists any bad concept consistent with m samples is at most epsilon.

Is this bound, (1/epsilon) ln (|C|/delta) good or bad? In the case of conjunctions of literals over the variables x1, x2, ..., xn, we saw that |C| is at most 3^n, so this bound is (n/epsilon) ln 3 + (1/epsilon) ln (1/delta), which is in fact better than the bound we got previously for this problem. The quantity (ln |C|) is a constant times the number of bits to write down the name of a concept in C, so seems good. The stumbling block here is the *computational problem* of finding a hypothesis consistent with the examples we have drawn -- the number of examples is reasonable.

Another issue with this bound is that it does not apply if C is infinite (for example, for the class of axis-parallel rectangles.) In this case, we may look at another property of the hypothesis class, the Vapnik-Chervonenkis dimension (the VC-dimension). Given a concept class C, the VC-dimension of C is the largest d such that there EXISTS a set S of d points from X that can be labeled in all 2^d possible ways by concepts from C. For example, if X is the set of real numbers and C is the class of all closed intervals [a,b] in X, we can label 2 points, say 1 and 2, in all possible ways (because [-1,-2] excludes both, [0,1.3] includes 1 but not 2, [1.7, 2.1] includes 2 but not 1, and [0,3] includes both.) However, if we consider ANY three points, x, y, and z, if x < y < z, there is no single closed interval that labels x and z with + but labels y with -. Hence we conclude that the VC-dimension of this concept class is 2.

If we take X to be points in the plane and C to be axis-parallel rectangles, then if we take S to be a set of 4 points arranged in a diamond shape, for any subset of the 4 points we can choose an axis-parallel rectangle to include the subset and exclude the others, so VC-dimension of this C is at least 4. We will also see that no matter how we arrange 5 points in the plane, there will be some labeling of them that cannot be achieved with an axis-parallel rectangle, from which we conclude that the VC-dimension of this C is 4.

1/21/09 Lecture 4. PAC Learning: conjunctions of literals.

Zoo story: suppose we are learning to identify elephants by walking around a Zoo with someone who points to an animal and says "elephant!" or "not elephant!". We may have identified some set of attributes that we think are relevant to deciding elephanthood: has-fur, has-tail, has-two-legs, has-four-legs, has-no-legs, has-six-legs, has-feathers, has-scales, is-over-four-feet-tall, has-trunk, is-spotted, etc. (Notice that all these are phrased as Boolean ("yes" or "no") attributes, which suffice to express a finite number of discrete attribute values.) We assume that our set of attributes is sufficiently rich that a conjunction (AND) of literals gives a correct rule for distinguishing elephants from non-elephants in this population of animals, for example, (has-four-legs) AND NOT(has-fur) AND NOT(has-feathers) AND (has-trunk). (A "literal" is a variable or its negation.) How can we use the examples supplied by our informant to find a conjunction of attributes that (probably) (approximately) distinguishes elephants from non-elephants in this setting?

To abstract a bit, variables x1, x2, x3, x4, x5 might denote the attributes we are considering, and we may have been supplied four examples so far:

   x1    x2    x3    x4    x5   label
    1     0     1     1     0     -
    1     1     0     1     0     +
    0     1     1     0     1     -
    1     1     0     0     0     +
The domain X of possible examples is all 32 assignments of 0's and 1's to the attributes. Looking at the two positive examples so far, they agree on the values of x1 (= 1), x2 (= 1), x3 (= 0), and x5 (= 0), and disagree on the value of x4. The most specific conjunction of literals that agrees with these examples is therefore (x1 AND x2 AND (NOT x3) AND (NOT x5)). The set of examples classified as positive by this hypothesis h is a subset of the set of examples classified as positive by the target concept c. (Once again we are considering a hypothesis that has no false positives.) We can prove that this strategy achieves PAC-learning of the concept class of all conjunctions of literals over the variables we are considering, as follows.

Assume that the target concept c is a conjunction of literals over the variables x1, x2, ..., xn. Let D be a fixed unknown distribution over assignments of 0's and 1's to the variables. Let m >= (2n/epsilon) ln (2n/delta), where n is the number of variables. If we draw m examples and find the most specific conjunction c' of literals consistent with the positively labeled examples, then with probability at least (1 - delta), we have Pr_D(c - c') <= epsilon. (Note that this bound is polynomial in 1/delta, 1/epsilon and n.)

The proof is as follows. Let epsilon > 0, the target concept c and the distribution D be given. Define a literal yj (which is either xj or (NOT xj)) to be "bad" if Pr_D(yj is 0 and c is 1) > epsilon/2n. That is, if we draw an assignment according to D, the probability that it makes the target concept c true and the literal yj false is "large", that is, at least epsilon/2n. We claim that if there is no bad literal left in c', then Pr_D(c - c') <= epsilon. The reason for this is that an assignment makes c' false if and only if the assignment makes some literal in c' false. Thus, by the union bound, Pr_D(c - c') is bounded above by the sum over all the literals yj in c' of Pr_D(c = 1 and yj = 0). There are at most 2n literals in c', and if none of them is bad, then the sum is at most epsilon. Thus, it is sufficient to show that with probability at least (1 - delta), there are no bad literals left in c'. Look at one bad literal, yj. The probability that in m samples we do not see a positive example in which yj is false is bounded above by (1 - epsilon/2n)^m, which for our choice of m is bounded above by delta/2n. Thus, by another application of the union bound, the probability that there exists a bad literal that survives all m samples is bounded above by delta.

This argument can be used to establish the PAC-learnability of k-CNF formulas (as in Valiant's paper) by a reduction in which we add new attributes to represent each possible clause containing at most k literals. (More on this in the next lecture.)

1/16/09 Lecture 3. PAC Learning: axis-parallel rectangles.

We show that the strategy of finding the smallest axis-parallel rectangle achieves PAC-learning, as follows. Assume an unknown probability distribution D on the unit square and an unknown target concept, an axis-parallel rectangle R. If m >= (4/epsilon) ln (4/delta) points are drawn independently from D and labeled according to R, then with probability at least (1 - delta), the smallest axis-parallel rectangle R' that contains all the positively labeled points in the sample satisfies Pr_D(R - R') <= epsilon.

The proof considers the four strips, R1, R2, R3, R4, along the inner side of each edge of the target rectangle R chosen so that Pr_D(Rj) = epsilon/4. (Formally, a separate argument is required in case there is no strip with exactly the right probability mass, but we omit that case here.) *If* our sample contains a point in each of R1, R2, R3, R4, then the smallest axis-parallel rectangle R' containing all the positive points in the sample excludes a subset of the union of R1, R2, R3 and R4 from R, so Pr_D(R - R') <= epsilon. What is the probability that a sample of m points misses R1 altogether? Each time a point is drawn, the probability of missing R1 is at most (1 - epsilon/4), and because the draws from D are independent, the probability that all m of them miss R1 is bounded above by (1 - epsilon/4)^m. We use the convenient inequality (1 - x) <= exp(-x) for all real numbers x, to conclude that the probability of missing R1 altogether is at most delta/4 if m >= (4/epsilon) ln (4/delta). The same argument holds of R2, R3, and R4. By the "union bound" (Pr(A or B) <= Pr(A) + Pr(B), with no independence assumptions necessary for A and B), the probability that there exists j such that the sample misses Rj altogether is at most 4(delta/4), that is, at most delta. Thus, with probability at least (1 - delta) there will be a sample point in each of R1, R2, R3 and R4, and therefore the smallest rectangle R' containing all the positive points will have Pr_D(R - R') <= epsilon.

1/14/09 Lecture 2. PAC Learning.

We considered the 3rd vignette from Lecture 1 in more detail. We assume an unknown but fixed probability distribution D on points in the unit square (corners (0,0), (0,1), (1,0), (1,1)) in the plane. We also assume that the "target concept" is an axis-parallel rectangle R within the unit square. We are given some points p1, p2, ..., pm, drawn independently according to the distribution D, and labeled using R. Point pj is labeled "+" if it is in R, and "-" if it is not in R. We are given one more point, p, drawn independently according to D, and we are asked to guess the correct label (+ or -) for p. What shall we do? There was considerable discussion of various strategies.

One strategy is to find the *smallest* axis-parallel rectangle, R', that contains all the positively labeled points in the sample, and to label p as + if it is in R', and - if it is not in R'. What can we guarantee about this strategy? One observation is that R' is a subset of the true target R, and thus every point labeled as + by R' is correctly labeled. However, points in (R - R') will be incorrectly labeled.

The guarantee that we will prove is that this strategy achieves probably approximately correct (PAC) learning, defined as follows. Given positive real parameters delta and epsilon, there is some number m that is polynomial in 1/delta and 1/epsilon such that if the sample used by the algorithm has at least m points, then with probability at least (1 - delta) ("probably"), the probability assigned by D to the points in (R - R') is at most epsilon ("approximately correct").

1/12/09 Lecture 1.

Syllabus and related matters; please see [Syllabus]. Please consider whether Prof. Lisha Chen's course, Stat 365b (Statistical datamining and machine learning) might be more appropriate to your needs. Three vignettes were given to illustrate some problems and reasoning typical of CPSC 463b/563b.

Vignette 1: I claim to have a program P that correctly predicts the next term of any sequence. Possible or impossible? Impossible: feed the program the terms 1,2. If P says the next term is 3, then say "no, it is 17." If P says the next term is 17, then say "no, it is 3." I revise my claim: P eventually correctly predicts every term correctly, possibly after making a finite number of errors of prediction. Possible or impossible? Impossible: feed P the initial term 1 and get its prediction, x_2. Then make (x_2 + 1) the second term of the sequence. Next feed P the terms 1, (x_2, + 1) and get its prediction, x_3. Then make (x_3 + 1) the third term of the sequence, and so on. Thus, P makes an error on *every* (certainly infinitely many) prediction. I revise my claim again: if the sequence is computable, then P eventually predicts every term correctly, after making a finite number of errors of prediction. (A sequence x_1, x_2, ... is computable if there exists a program Q that takes n as input and outputs x_n.) Possible or impossible? Still impossible, since we may use the preceding idea and P itself to construct the program Q that outputs (x_n + 1), where x_n is the prediction of P when fed the values of Q(1),Q(2), ..., Q(n-1). Backed into a corner, I revise my claim once more: if the sequence is generated by a polynomial expression (for example, x_n = n(n+1)/2 for the sequence of triangular numbers, 1, 3, 6, 10, ...), then P eventually predicts every term correctly, after making a finite number of errors of prediction. Possible or impossible? This is possible. We think of this process as a game against an adversary: the adversary commits to some particular polynomial and generates the terms of the sequence from it, NOT changing the polynomial midway through the process. The polynomial chosen by the adversary has some specific degree d. Once we have seen the first (d+1) terms of the sequence, then a "polynomial interpolation" method will fit the unique correct polynomial to the points, and all subsequent predictions will be correct. (To think about: how can this positive result be generalized? Is polynomial interpolation essential or a red herring?)

Vignette 2: Imagine that I have n coins, numbered 1 through n. One of the coins weighs 3 ounces, the remaining (n-1) coins weigh 1 ounce. I have a scale to perform weighings: I put a collection of coins on the scale, and the readout is their weight in ounces. How do I find the heavy coin, using as few weighings as possible? Answer: binary search. Divide the coins into two equal sized groups, and weigh one of the groups; this will tell us whether the group weighed contains the heavy coin or not. We continue recursively with the group that contains the heavy coin, until the group has one element, which is the heavy coin. This succeeds with (log_2 n) weighings (rounded up to the next integer) in the worst case. Can we do better? In the best case, yes; if we weigh the coins one by one, it may happen that we get lucky and the heavy coin is the first one we weigh. But, can we do better in the worst case, that is, can we give an algorithm to find the heavy coin that *always* uses fewer weighings? No, we cannot do better; the outline of an argument based on information theory is as follows. Each weighing returns one bit of information (whether the heavy coin is in the group being weighed) and we need to determine the "label" of the heavy coin, which consists of (log n) bits. Hence we need at least (log n) weighings in the worst case. (Question: how do we make this more rigorous?)

Now suppose that there are 2 heavy coins, each weighing 3 ounces. How do we proceed? We can proceed as before, dividing the group into two equal groups, and weighing one group. The outcomes are: (0) no heavy coins are in this group (1) one heavy coin is in this group, or (2) two heavy coins are in this group. In cases (0) and (2) we continue recursively with the group that has the two heavy coins, and in case (1) we have two instances of the problem of finding one heavy coin. The worst case(?) is after one weighing we have two groups of size (n/2) each of which has a heavy coin to find (by the method of the previous paragraph.) This analysis gives us (1 + 2(log_2(n/2))), or 2(log_2 n) - 1, without accounting for ceilings. Is this optimal? By analogy to our previous information-theoretic lower bound, we observe that each weighing gives one of three outcomes (0, 1, or 2 heavy coins in the group weighed), which is somewhat more than one bit of information, in particular, (log_2 3) bits of information. The total number of bits we need to find is log_2 of the number of possible outcomes, which is (n choose 2) = n(n-1)/2 possible choices of subsets of size 2 (the heavy coins) out of n things. Then (log_2 (n(n-1)/2)) is about 2(log_2 n), and we divide by (log_2 3) to give a lower bound of somewhat less than 2(log_2 n). Then perhaps there is a better algorithm for finding 2 heavy coins? (Side note: a generalization of this problem asks us to find the edges of a graph given queries that allow us to specify a subset S of the vertices and be told how many edges have both endpoints in S. A paper from STOC 2008 dealt with this problem: Choi and Kim, Optimal query complexity bounds for finding graphs, Proceedings of the 40th annual ACM symposium on Theory of Computing, paper. STOC and FOCS are two of the most important annual conferences for research in theoretical computer science.)

Vignette 3: Suppose there is an unknown probability distribution D on points in the unit square in the plane, and an unknown axis-parallel rectangle R contained within the unit square. Suppose we are given a set of points p_1, p_2, ..., p_n drawn according to D and labelel as + (if the point is within R) and - (if the point is not within R). Suppose also we are given another point p drawn according to D, and we want to guess its label, that is, whether it is within R or outside of R. How should we proceed? One observation is that there is a smallest axis-parallel rectangle that contains all the points that are labeled +, and (several?) largest axis-parallel rectangles that exclude all the points that are labeled -. This gives us some clues about how to label the new point p. Certainly, if it is within the smallest rectangle containing the positive points, it should be labeled +. (Question: What is a good algorithm here, and what can we say about it?)


Last modified: April 12, 2009