Computer Science 463a/563a Lecture Log, Fall 2005


[Home]
9/1/05 Lecture 1.

Discussion of the nature of critical reading. One approach: read each sentence, argue with it, imagine the author's counterarguments, then move on to the next sentence. Skipping and skimming, e.g., of a proof that you don't need, a description of a method that you are familiar with, background that you already know, or rhetorical passages that don't convey information. Another approach: read a theorem or other claim, put the paper down and try to find your own justification for the claim before reading the author's justification. Especially be on the lookout for unstated assumptions, neglected possibilities, and so on.

A joint critical reading exercise performed on Turing-selection.txt. This is drawn from a position paper arguing for the possibility (and desirability) of producing a machine to pass the "Turing test" by the year 2000, which was 50 years in the future when the paper was published. "We may hope that machines will eventually compete with men in all purely intellectual fields." Was his emphasis on purely intellectual fields based on what he thought was feasible or on a specific restriction of the hope to just those fields? His two approaches: chess machine and "the best sense organs that money can buy" are both vigorously pursued today. "As I have explained, the problem is mainly one of programming." How to compare the productivity of Turing (1000 bits a day) with people writing in higher-level languages today? What about the image of 60 people programming for 50 years straight?

This leads to the hope of speeding up the process by introducing a learning element. The tabula rasa, (or notebook from the stationer's) view of the child's mind is both attractive (radically less to simulate) and quite at odds with modern developmental psychology. He gives an analogy between an experimenter sorting through possible "child machines" with evolution by natural selection. The field of genetic algorithms is based on the premise that simulated evolution can be acceptably efficient for some tasks, contrary to Turing's view.

Finally, various attempts to define learning. Acquiring, processing, and recalling information to improve performance at a particular task. Distinctions: learning a fact (the capital of Florida), learning a method (how to differentiate polynomials), and learning to learn (maybe our critical reading exercise). Question of how novelty fits in the picture, and whether it is related to a performance criterion. Can we learn just by thinking, without interacting with an external object? A comparison of two artificial agents, one arriving at all the facts about a domain by pure deduction, another by pure induction. Is the second one learning, and the first one not? Is the method of learning more important than the content?

Some test cases: (1) We add more entries to a database of state capitals -- is this learning? (Class opinion was pretty evenly divided.) (2) We memorize the postions of all the cards in a random shuffle of the deck -- is this learning? (3) We learn to ride a bicycle -- this is presumably learning. (4) I learned a grossly inefficient tennis serve -- can we learn "wrong" things? (5) A child learns his or her native language between ages 0 and 5 -- this is generally accepted as learning.

9/6/05 Lecture 2.

Discussion of the goals of machine learning and artificial intelligence and what their social consequences might be. This is a vast topic; we illuminated some of the issues.

Reading Gold's paper. What is a journal? A periodical containing a collection of articles describing current research in a given area. The roles of author, publisher, editor, area editors, referees, libraries, delays, monetary compensation and lack thereof. Title? "Learning" seems more informative than "identification." It wasn't just "language," though that was a major focus. "In the limit" seemed appropriate in the title, as the model being advocated. Affiliation: his changed between doing the research and the paper appearing. Granting agency: funding for the research is acknowledged in the resulting papers. Abstract: a paper that introduces a new model must somehow convey that model in the abstract, a difficult task. Gold states only representative results, not all of them, in the abstract.

In reading, what was skipped? Appendix I (Proofs of the results) and Appendix II (Intuitive explanation of some of the terminology), though Appendix II should probably have been incorporated into the paper or more prominently referred to. Also. "overly" technical parts, and the ends of paragraphs that seemed to go on too long. Section 5 (a summary of a previous paper) seemed to be largely unconnected to the rest of the paper.

What was re-read more than once? Definitions of generator and tester, section on identification by enumeration, abstract, table of results. What resources were consulted? Wikipedia, which has an article on "Language identification in the limit", summarizing Gold's paper and referring to other related work. The paper seemed to have a significant amount of undefined jargon, for example, the term "primitive recursive function" (though this appears in Appendix II.) However, Gold worked to express most of the technical ideas in English prose instead of a highly formalized symbolic notation.

The general form of the paper: introducing a new mathematical model of some phenomenon. Gold says "I wish to construct a precise model for the intuitive notion 'able to speak a language' in order to investigate theoretically how it can be achieved artificially."

Motivation. He starts with motivation and selects a few aspects of the phenomenon of interest. He says an AI learning a natural language will have to do so from implicit information, that is, examples of the use of the language and/or an informant who can judge whether or not a given usage is correct.

Model. He uses the (then available) theory of formal languages, recursive function theory, and Chomsky's hierarchy of regular, context-free, context-sensitive, recursive and recursively enumerable languages as the tools to define his models. He explicitly says "a very naive model of language is assumed." As in formal language theory, languages are just sets of strings. Gold was writing before the advent of analysis of algorithms, polynomial time and NP-completeness as recognized research topics, so for him algorithms are just functions computable by Turing machines. He refers in passing to efficiency, but does not make much of it.

Results. After the definition of the model, he gives some theorems that are the main results (summarized in a useful table.) He explains the results and proofs informally, deferring the more formal proofs to Appendix I.

Interpretation. He goes back to the motivation, and gives interpretations of the theorems in terms of the motivating questions. One of the results he views as most significant is that learning from text, that is, positive examples only, is very weak in terms of the Chomsky hierarchy, since any class of languages consisting of all finite languages and at least one infinite language cannot be identified in the limit from recursive text. This seems to contradict our understanding of child language learning, which aapears not to involve significant quantities of negative examples. In his interpretation, he offers three possible explanations to reconcile these: "that the class of possible natural languages is much smaller than one would expect from our present models of syntax," that the child receives negative instances in ways we do not recognize, or there is some restriction on the texts (for example, on the order of presentation of examples) from which negative instances can be derived.

Each alternative is an example of a possible shortcoming of the model, which might be remedied by changing and/or elaborating the model. There was a discussion of whether it is useful to treat a natural language like English as a finite set of strings (acknowledging that our competence as speakers is bounded by our lifetimes), and therefore learnable from text in this model (by the simple method of remembering the set of all strings that have been seen so far.) Mathematical notation for numbers seems to provide an easy example of the desirability of a model of language that allows an infinite set of well-formed utterances. Statistical approaches will also be considered. The question of whether it helps to assume that the child prefers simpler explanations was deferred.

9/8/05 Lecture 3.

Gold concluded. What are the main ideas of Gold's paper? The definition of identification in the limit, the models of information presentation (text and informant and their variants), the choice of formal languages (sets of strings) as the objects to be learned, and generators (grammars) or testers (decision procedures) as representations for them. Also, the algorithms: memorization (to learn the class of finite languages from text) and identification by enumeration (to learn, for example, the class of context-free languages over some alphabet from an informant.) (Identification by enumeration may be very time-consuming.) Finally, techniques for showing negative results, for example, that no class of languages containing every finite language and at least one infinite language L can be identified in the limit from arbitrary text. The learning problem can be cast as a game between the learner and nature; to show such a class is not identifiable in the limit from text, we adversarially construct a text for L that causes any learner that successfully identifies every finite language to change its guess infinitely many times, and therefore, fail to identify L in the limit.

The renowned economist John Maynard Keynes said, "In the long run we are all dead."

Valiant begun. Valiant's paper is in the same genre as Gold's, the introduction of a new model of learning, but has a number of differences, including a concern with polynomial time operation of the learner, the use of probabilities, more explicitly described algorithms, the choice of Boolean functions as the objects to be learned, and Boolean expressions (or circuits) as the representation. Note that in this setting each object to be learned has just a finite number of examples, and so is "trivial" in Gold's setting (learnable by memorization.)

The basic definition of learnability. There is a target Boolean function f to be learned. The learner may gather information about f by calling EXAMPLES, which returns a positive example of f (that is, an input x such that f(x) = 1), drawn according to some fixed but unknown probability distribution D on the positive examples of f. The learner may also call an ORACLE, which in the simplest case takes an input x and returns the value of f on x. There is a tunable parameter h > 1 that is used to bound the probability of failure. The learner must run in time polynomial in h, t (the number of propositional variables) and the "size" of the target function f (eg, the length of its smallest representation as an expression.) The learner outputs a guess g of the target function and then halts. Then we require "probably" (with probability at least (1- 1/h)), that the output g satisfies the following conditions: (1) whenever g(x) = 1 then f(x) = 1, and (2) the probability (according to D) of examples x such that g(x) = 0 and f(x) = 1 is bounded by 1/h, (g is "approximately" correct.)

The above definition ignores the *'s, or undetermined values, in the example vectors. Subsequent research has tended to separate the issues related to "relevant variables" from the basic model.

9/13/05 Lecture 4. Valiant continued.

The definition of PAC-learnable according to Mitchell and Kearns & Vazirani. Sample space X with each example x of length n, a target concept c from a known class C, where c maps X to {0,1} and C is equipped with a size measure size(c), an arbitrary stationary probability distribution D over X, a class H of hypothesis concepts (assumed to include all the concepts in C), the function error(h) = Pr_D[h(x) \neq c(x)] measuring the true prediction error of h with respect to c and D, an EXAMPLES oracle that (independently) draws x from X according to D and returns (x, c(x)). A learning algorithm L PAC-learns the class C if for every target c in C and every distribution D on X, with inputs epsilon, delta, n, and size(c), and access to the EXAMPLES oracle, L halts with an output h from H such that with probability at least (1 - delta) we have error(h) < epsilon, AND L runs in time polynomial in (1/epsilon), (1/delta), n, and size(c).

Example with X = Boolean vectors of length n and H = C = monomials over n variables. An algorithm: start with S = the set of all 2n possible literals. For every positive example x, remove from S all literals that are 0 for x. Repeat this for L examples, and output the conjunction of the literals remaining in S. Can we choose L so that this algorithm PAC-learns C?

More general algorithm for finite H: draw L examples (x1,y1), ..., (xL,yL) and halt and output ANY h in H that is consistent with each of these examples, that is, h(xi) = yi for each i. We proved that if L > (1/epsilon)(ln(|H|) + ln(1/delta)), then with probability at least (1 - delta) this algorithm outputs an h with error(h) bounded above by epsilon. Since the monomial algorithm in the preceding paragraph outputs an h in H consistent with the sample, this theorem applies to it. How big is |H| in this case? It is bounded above by 2^(2n), certainly, so the number of samples we'll need to draw is (1/epsilon)((2ln2)n + ln(1/delta)), which is polynomial in 1/epsilon, ln(1/delta) (which is smaller than 1/delta) and n. The processing time for the monomial algorithm in the preceding paragraph is polynomial in the number of examples, so the algorithm PAC-learns monomials.

Remarks: Consider the class of concepts H = C in which each c consists of the points within an axis parallel rectangle in the plane. Then H is not only not finite, but is uncountably infinite, so the theorem above does not help us. Finding ANY h in H consistent with a sequence of examples may not be achievable in polynomial time in general. The above monomial algorithm completely ignores information in the negative examples.

9/13/05 Lecture 5. BEHW and VC-dimension begun.

Last time: the definition of PAC-learnable, the example of C = monomials with algorithm of most specific consistent hypothesis, proof that for a finite H, it suffices to find ANY h from H consistent with at least (1/epsilon)(ln |H| + ln(1/delta)) examples drawn from EXAMPLES, and the application to monomials over n variables, using the upper bound of |H| < 2^{2n}, yielding the PAC-learnability of monomials.

How can we get Valiant's result on the PAC-learnability of k-CNF formulas? One approach is to reduce the problem to the learning of (in fact monotone) monomials, by introducing new "features" -- one for each possible clause with at most k literals in it. We use an upper bound of (2n)^{k+1} on the total number of features in the resulting problem, which is still polynomial in n, but exponential in k. The problem of whether general CNF (or DNF) is PAC-learnable was posed by Valiant in his original paper, and remains open. The strategy of introducing new features to make the resulting hypothesis have a simpler form is quite fundamental and widespread. Here it is crucial that PAC-learnability implies learnability with respect to an arbitrary distribution.

We return to Marek's point from last time. Suppose the hypothesis space H is simply ALL Boolean functions over n inputs. Then we can represent any possible target Boolean function. However, the number of such hypotheses is 2^{2^n}, which means that the sample size we'll need in general will grow as 2^n. To see this, imagine generating the target function by flipping an unbiased coin every time we need to know the value of the function on a new point. This is the issue that is termed "bias" in AI: we focus attention on a very small segment of the possible hypotheses (eg, monomials, or k-CNF formulas, or CNF formulas of polynomial length) in order to have any hope of successful generalization.

We consider the class C of concepts c, each of which consists of the points within or on the boundary of an axis-parallel rectangle in the plane. As there are uncountably many possible concepts (or a countable infinity if you insist that they be defined by rational numbers), the theorem on finite H is of no use. We proved directly that an algorithm that returns the smallest axis-parallel rectangle containing the positive points in a sample of m points will succeed in PAC-learning C if m > (4/epsilon) ln (4/delta).

Instead of treating such geometric classes case by case, there is a general theorem (analogous to the finite H theorem) giving a sufficient number of examples for PAC-learning in terms of a property of the hypothesis class H termed the VC-dimension. The theorem states that there is a constant c0 such that if the VC-dimension of H is finite and equal to d, then an algorithm that outputs ANY h from H consistent with at least (c0)(1/epsilon)(d ln (1/epsilon) + (1/delta)) examples drawn from EXAMPLES succeeds in PAC-identification. Note that d is analogous to ln(|H|) in the finite H theorem. (Except for the pesky ln(1/epsilon) factor.)

The VC-dimension of a class of concepts C is the maximum cardinality of any set S of points from X that can be shattered by C. A set S of points is shattered by C if every possible labelling of the points in S agrees with some concept c in C. We looked at a set of 3 points in the plane that cannot be shattered by the class of axis-parallel rectangles, and another set of 3 points that can be shattered. 4 anyone?

9/20/05 Lecture 6. BEHW and VC-dimension continued.

Examples of finding the VC-dimension of various classes of concepts, including axis-parallel rectangles in the plane (4), closed intervals of the real line (2), unions of k closed intervals of the real line (2k), finite unions of closed intervals of the real line (infinity), half-spaces in the plane (3), half-spaces in n dimensions (n+1), "nice" convex polygons in the plane (infinity). For classes with infinite VC-dimension, we may consider another parameter (the number of intervals in the union, or the number of vertices of the polygon) and stratify the whole class into subclasses each with finite VC-dimension. The VC-dimension of any finite concept class H is at most log(|H|), where the log is base 2.

Why does bounding the VC-dimension give us any leverage over prediction? Intuitive reason: if C has VC-dimension d, then for any set of m points, the number of labellings of m consistent with some concept from C is bounded by O(m^d), that is, a polynomial number of the 2^m unrestricted possible labellings. "Bias" again! From the paper "A general lower bound on the number of examples needed for learning" by Ehrenfeucht, Haussler, Kearns, and Valiant, we have a lower bound of (1/32epsilon)(d - 1) on the number of examples needed to PAC-learn concepts from a class C of VC-dimension d.

To see the intuition for the lower bound, we consider a set S of d points shattered by C, say S = {x1,...,xd}. We consider any sufficiently small epsilon and delta, and the specific distribution D that assigns probability (1 - 8epsilon) to x1 and equally divides the remaining probability among the (d-1) other points, assigning 8epsilon/(d-1) to each. Then if the number of examples drawn is m < (1/32epsilon)(d-1), the expected number of times we get an example other than x1 is just (d-1)/4. With "high probability" the number of such draws will be at most (d-1)/2, leaving the values on (d-1)/2 points unseen, with total probability weight 4epsilon. But the concepts in C may label the unseen points in any manner (because S is shattered), so the error of our output will exceed epsilon with probability greater than delta.

9/22/05 Lecture 7. The computational problem of finding a consistent hypothesis.

Results bounding the VC-dimension of neural networks are quite nontrivial to prove, but give bounds of the form O(W log W) or O(W^2), where W is the number of weights of a feedforward network, depending on the form of the "thresholding" function for each node.

We have seen theorems specifying sufficient sample sizes (in terms of ln(|H|) or the VC-dimension of H, as well as epsilon and delta) to guarantee that any h in H consistent with a sufficiently large set of labelled examples has error at most epsilon with probability at least (1 - delta). This presents us with the following computational problem: given a set S of labelled examples, find a hypothesis h in H that is consistent with the labelled examples in S.

We've already seen some algorithms to solve this problem: for monomials (in Valiant's paper) and by adding features, for k-CNF and k-DNF formulas (also Valiant's paper), for intervals and axis-parallel boxes in n dimensions (easy). What about half-spaces? An example in 2 dimensions shows that finding a plane to separate some positively and negatively labelled points can be solved by finding a feasible solution to a set of linear constraints (a linear program.) This easily generalizes to n dimensions. There are effective algorithms to solve linear programs, including the simplex algorithm and interior point methods. Later we'll look at the perceptron algorithm, which also solves this problem.

Sometimes, the problem of finding a consistent hypothesis is computationally hard. For example, consider the concept class C of Boolean concepts represented by 2-term DNF formulas, that is, disjunctions of two monomials. It is NP-complete to decide whether there is a 2-term DNF formula consistent with a given set of labelled examples. (We saw a proof of this, using a reduction from the problem of Set Splitting, aka 2-coloring a hypergraph so that no edge is monochromatic.) A corollary of this is that 2-term DNF is not *properly* learnable unless NP = RP. Proper learning refers to the restriction that the learner's output be from the target class. A similar kind of result was proved by Blum and Rivest showing that training a 2-layer, 3-node neural network with n inputs is NP-hard; they also showed that finding an intersection of 2 halfspaces consistent with a given labelled set of points in n-dimensional space is NP-hard.

However, every 2-term DNF formula is equivalent to a 2-CNF formula (a conjunction of clauses, each of which has at most 2 literals) by the fact that "or" distributes over "and" in Boolean logic. The equivalent formula isn't "too much" bigger, since it will have at most n^2 clauses. But 2-CNF formulas are PAC-learnable! What is going on here? By allowing the learner to use "nonproper" hypotheses (2-CNF instead of 2-term DNF), the learning problem becomes easy again. This kind of negative result is "representation dependent." Next time we'll see a non-representation dependent negative result.

We briefly discussed issues with the basic PAC model: (1) simple concept classes might not be very appropriate/useful, (2) real examples will come with noise, omissions, perhaps malicious errors, (3) is PAC learning related to how people learn? (You might re-read Valiant's discussion of the consequences of the PAC model for philosophical "thought experiments.") (4) there will have to be human intervention, to determine the target concept class, its VC-dimension, which features to consider, and so on, (5) are learned things retained forever?, or more generally, (6) there is no framework for learning collections of related concepts, which paradoxically might be easier than learning single, stand-alone concepts (think rhinoceros from elephant), which casts the "atomism" of the approach into some doubt, (7) it seems that it would be useful to evaluate whether the concept is "close enough" as the algorithm draws more examples, perhaps permitting earlier termination.

9/26/05 Lecture 8. Cryptographic limitations on learning Boolean formulas; Queries.

Last time we considered the computational problem of finding a hypothesis h from H consistent with a given labelled set of examples. If S is a labelled set of examples, define Q(S) to be 1 if there is a 2-term DNF formula consistent with S, and 0 if no 2-term DNF formula is consistent with S. Last time we showed that Q(S) is an NP-complete problem. To see why this implies that 2-term DNF is not properly PAC-learnable unless NP = RP, we show that a proper PAC-learning algorithm for 2-term DNF would allow us to decide Q(S) in the sense that if Q(S) = 1, we output 1 with probability at least 1/2, and if Q(S) = 0, we always output 0, thus putting Q(S) in RP, with the consequence that NP = RP.

The reduction works as follows: assume A is an algorithm that properly PAC-learns 2-term DNF formulas, and S is a given set of labelled examples. We run A with epsilon = 1/2|S| and delta = 1/2, giving it a randomly drawn element of S whenever it requests an example from the EXAMPLES oracle. If A halts and outputs a 2-term DNF formula f, we test whether f is consistent with S and output 1 if so, 0 otherwise. If A runs until its polynomial time bound without halting, we just output 0. If Q(S) = 1, then with probability at least 1/2, A must output a 2-term DNF formula (since it learning is proper) with error at most epsilon. But we chose epsilon sufficiently small that an error on even one point in S would give an error greater than epsilon, so in this case, the 2-term formula must be completely consistent with S. If Q(S) = 0, A cannot return a 2-term DNF formula consistent with S, so our output will be 0.

Recall that 2-term DNF formulas are "improperly" PAC-learnable by expanding H to be the 2-CNF formulas, so this negative result is "representation dependent." For "representation independent" hardness results, we turn to cryptography. The following result is due to Kearns and Valiant. Consider the RSA encryption function E(x,N,e) = x^e (mod N), where N = pq is the product of two primes. The function D(y,N,e) = x such that E(x,N,e) = y is thought to be cryptographically secure (at least by users of PGP and other RSA-based cryptosystems). Research in cryptography has shown that it is equivalent (up to probabilistic polynomial time reductions) to computing LSBD(y,N,e) = the least significant bit of x = D(y,N,e) with any probability 1/2+1/p(n), where p(n) is a polynomial. By enriching the inputs with y^2 (mod N), y^4 (mod N), ..., y^(2^n) (mod N), where n is the number of bits in N, we get a new predicate LSBD'(y,N,e,y^2 (mod N), ..., y^(2^n) (mod N)), which is not cryptographically weaker than LSBD, but can be computed with a Boolean circuit with constant fan-in and depth O(log n). From this we get a result of the form: an assumption about the cryptographic security of RSA implies that O(log n)-depth, constant fan-in Boolean circuits are not PAC-learnable, even in the weak sense of guessing the correct classification with probability 1/2+1/p(n). Since a circuit of depth O(log n) is equivalent to a Boolean formula of size O(n^k), the same negative result holds of polynomial-sized Boolean formulas. It is natural to ask whether learning in this "weak" (1/2+1/p(n)) sense is equivalent to learning in the stronger sense of the original definition. Schapire proved them equivalent in his thesis, a theoretical result that grew into the area of Boosting (which we'll see later in the course.)

Queries. Some of the background for queries: Gold's informant (for him it didn't matter which type of informant was considered, for, as Kevin G. pointed out: "In the limit, all questions are answered"), Valiant's oracles, Ehud Shapiro's "Algorithmic Program Debugging" system. Given a set X of possible examples, a class C of concepts, a class H of hypothesis concepts, and a target concept c from C, we define two types of queries as follows. A membership query has input x and returns the label of x according to c, that is, MQ(x) = c(x). An equivalence query has input h from H and returns "yes" if h and c are equivalent, or "no" if they are not equivalent, together with an (adversarially chosen) counterexample, that is, an element x of X such that c(x) is not equal to h(x).

To get a sense of these queries, we looked at an algorithm to learn monotone DNF formulas using EQ's and MQ's, using the example (x1x2 + x3). Initially hypothesize the empty formula (everywhere false), and receive a counterexample, say 111 (which represents the assignment x1 =1 , x2 = 1, and x3 = 1). Using MQ's, search for a minimum positive point: MQ(011) = 1, MQ(001) = 1, MQ(000) = 0. Since 001 is a minimum positive point, we add the term x3 to the hypothesis and query EQ(x3). We receive the counterexample 110, and query MQ(010) = 0 and MQ(100) = 0 and determine that 110 is a minimum positive point. We add the term x1x2 to the hypothesis and query EQ(x3 + x1x2), which is answered "yes."

9/28/05 Lecture 9. Queries, continued.

Recall from last time: Given a domain X of possible examples, a concept class C over X, and a target concept c from C, we define a membership query as MQ(x) = c(x), and an equivalence query as EQ(h) = "yes" if h = c, or "no" and a counterexample x from X such that h(x) is not equal to c(x). We formalized the algorithm from last time to learn monotone DNF formulas, as follows.

Initially h = the empty disjunction, which is 0 on every x. Query EQ(h) -- if the answer is "yes", output h and halt. Otherwise let t be the term corresponding to the point Minimize(x), where x is the counterexample returned for h, add the term t to h, and repeat. One implementation of Minimize(x): for each 1 in x, let x' be the result of setting it to 0, and query MQ(x'). If the answer is 1, then return Minimize(x'), otherwise, go on to the next 1 in x. If all the 1's in x are tested without finding a positive point, then return x. Assuming the example x is positive, this procedure is guaranteed to find a minimum positive point of the target concept c below x.

This procedure makes at most as many EQ's as there are terms in the canonical form of the target concept, and for each EQ, it makes O(n^2) MQ's. Thus time and queries are clearly polynomial in n and the number of terms of the target concept. (We can reduce the number of MQ's per EQ to O(n) by observing that once we have tested a 1 and not found a positive point below it, we need not test it again during this call to Minimize.) Thus: monotone DNF formulas are exactly learnable in polynomial time with EQ's and MQ's.

If we have access only to MQ's (no EQ's), then an adversary argument shows that at least (2^n - 1) queries may be necessary in the worst case for a formula over 2n variables. Consider the target class of formulas of the form x1y1 + x2y2 + ... + xnyn + T, where T is a term consisting of a conjunction of n variables, where the i-th variable is one of xi or yi. Suppose an algorithm exactly learns every target concept in this class using just MQ's. Consider the following adversary: (1) if a query sets both xi and yi equal to 1 for some i, then answer 1, (2) if a query sets at most one of xi and yi to 1 for each i, but does not set at least n variables to 1, then answer 0, (3) if a query sets exactly one of xi or yi to 1 for all i, then answer 0 unless this would eliminate the very last concept in the target class. This strategy means that each MQ eliminates at most one target concept, which means that until at least (2^n - 1) MQ's have been made, there are at least two target concepts consistent with all the answers -- thus, until the algorithm has made this many queries, it cannot guarantee exact learning of the target concept.

If we have access only to EQ's (no MQ's), then it is also possible to prove that no polynomial number of EQ's with polynomial-sized monotone DNF formulas can exactly identify all the monotone DNF formulas. The argument for this is considerably more involved (Angluin, "Negative results for equivalence queries") and consists of demonstrating that each hypothesis formula has an "approximate fingerprint", that is, an assignment with relatively few 1's that satisfies the formula, or relatively few 0's that falsifies the formula. By making the "approximate fingerprint" the counterexample, the adversary guarantees that the fraction of possible target concepts that is eliminated is smaller than any 1/p(n). These two results show that neither EQ's nor MQ's can be dispensed with in the above polynomial time algorithm for monotone DNF formulas. Similar negative results for EQ's hold for concepts represented as deterministic or nondeterministic finite automata and context-free grammars.

How outrageous are EQ's? Imagine a scenario in which a domain expert classifies X-rays as containing or not containing tumors, and the goal is to learn the concept class of "X-rays containing tumors." In this context, MQ's seem reasonable, since the expert just has to classify particular examples. However, an EQ would amount to typing out the C (or Haskell?) program representing the learner's current concept and asking for a counterexample (that is, a misclassified X-ray.) This does not seem at all reasonable.

However, consider the PAC model augmented with MQ's. In this model, we have X, the domain of possible examples, C, the class of possible target concepts, a target concept c from C, and a probability distribution D over X. The learner has as input the usual parameters epsilon and delta, and access to two oracles, the usual EXAMPLES oracle, that draws an example x according to D and returns the pair (x, b), where b = c(x), and a membership query oracle MQ(x) that returns the value of c(x) for an example x of the learner's choosing. The learner is expected to output with probability at least (1 - delta) a concept h such that error(h) is less than epsilon, and to do so in time polynomial in 1/delta, 1/epsilon, n (the length of an example), and the size of the target concept. Considering our X-ray example, we can imagine that the EXAMPLES oracle is supplied by a large collection of labelled examples, while MQ's might be answered by a human domain expert, answering queries selected (as particularly relevant) from an even larger collection of unlabelled examples by the learning algorithm.

The nice thing about a polynomial-time exact learning algorithm using MQ's and EQ's is that it can be transformed into a not-much-less efficient algorithm in this PAC model with MQ's. The idea is that when the exact learning algorithm makes an equivalence query, say EQ(h), we instead draw a large sample from the EXAMPLES oracle, and test to see whether h labels all the examples as in the sample. If not, then we have a counterexample x such that h(x) is not equal to c(x) to return as the answer to the EQ. If so, we output h and halt. The key is to choose the size of the sample used to test h in such a way that with probability at least (1 - delta), the error of h will be at most epsilon. We did some analysis to show that if we use a sample of size O((1/epsilon)(ln i + ln(1/delta))) for the i-th EQ, then the overall probability of returning a hypothesis h with error greater than epsilon is bounded by delta. This means that the EQ and MQ framework can be used to develop efficient algorithms for the PAC and MQ framework.

Lower bounds on EQ's are representation-dependent. If C is finite and we place no restriction at all on the concept h queried in EQ(h) (except that it be some subset of X), then (log |C|) EQ's suffice, via the "Halving Algorithm." Given a finite set C' of concepts, we define the "majority vote" concept for C' as follows: h(x) = 1 if at least half the concepts c' in C' have c'(x) = 1, otherwise, h(x) = 0. Given a labelled sample S, we define VS(S), the version space of S, to be all those concepts c from C that are consistent with S. (See Mitchell's book for an expanded treatment of version spaces.) The Halving Algorithm works as follows: initially S is empty and VS(S) = C. Let h be the majority vote concept for VS(S) and query EQ(h). If the answer is "yes", then output h and halt. Otherwise, add the example x to S with the opposite of its label h(x), and repeat. (Draw a table of concepts and examples here.) It is not difficult to see that because each hypothesis h agrees with a majority of the remaining concepts on every point x, each counterexample x must reduce the size of the version space by a factor of at least 1/2. Thus the correct target concept must be tested after at most (log |C|) counterexamples. This is quite efficient in terms of the number of EQ's, but is not in general optimal. What is wrong with it in a practical sense? The majority vote concept may be "too big" (no polynomial-size representation) or "too hard to find" (computationally.)

10/4/05 Lecture 10. Queries, concluded.

Introduction to Littlestone's model of on-line prediction with worst-case mistake bounds. In this setting, we have a set X of possible examples, a concept class C, and a target concept c from C. The learner repeatedly receives an example x from X, makes a prediction (0 or 1) of the label of x, and receives the correct classification of x (that is, c(x)). The quantity of interest is the total number of "mistakes" made by the learner, where each time the learner predicts a value different from c(x) is a mistake. This is a worst case bound, over all possible sequences of examples x, and all possible concepts c from C.

Consider a concrete example over X = {x1, x2, x3, x4} with C = {c1, c2, c3, c4, c5} where c1 = {x1, x4}, c2 = {x2, x3, x4}, c3 = {x4}, c4 = {x2, x3}, c5 = {x1, x3, x4}. (Picture of table.) Consider the complete 2-mistake tree for this concept class with root labelled by (x2, {c1, c2, c3, c4, c5}), 0-child(root) labelled by (x3, {c1, c3, c5}), 1-child(root) labelled by (x4, {c2, c4}), where the 0 and 1 children of 0-child(root) are labelled {c1, c3} and {c5},respectively, and the 0 and 1 children of 1-child(root) are labelled {c4} and {c2}, respectively. (Picture of tree.) We can use this tree as an adversary strategy to cause any learner to make at least 2 mistakes, as follows. We present the learner with the example at the root: x2. If the learner predicts 0, then we claim that the correct prediction was 1, and continue our strategy with the subtree rooted at 1-child(root). If the learner predicts 1, then we claim that the correct prediction was 0, and continue our strategy with the subtree rooted at 0-child(root). Because after any two claimed mistakes, there is still a concept from C consistent with the examples, the learner cannot avoid making at least 2 mistakes against this adversary.

The generalization of this is to define K(C) to be the maximum k such that there is a complete k-mistake tree for C. Then Littlestone shows that the optimal worst case number of mistakes for C is precisely K(C), and, moreover, the Standard Optimal Algorithm (SOA) achieves this. The SOA is like the Halving Algorithm (last lecture), except that instead of voting with the majority of the current version space to predict an example x, it calculates the value of K on the two version spaces obtained by assuming (x,0) or (x,1) in addition to the current labelled examples, and then predicting according to the one with the larger value of K. Littlestone gives an example to show that SOA may beat Halving, which is therefore not optimal (though pretty darn good.)

Learning regular sets with MQ's and EQ's in polynomial time. A deterministic finite acceptor (DFA) M over a finite alphabet A of symbols consists of a finite set Q of states, a start state q0 from Q, a set F of accepting states, and a transition function delta with domain QxA and codomain Q. (Picture with circles and arrows.) We may extend the transition function as follows: f(q, empty string) = q, and f(q, wa) = delta(f(q,w),a). Thus, f(q,w) is the state q' obtained by starting in state q and following the transitions corresponding to the symbols in w in left to right order. The language accepted by M, denoted L(M), is the set of all strings w from the alphabet A such that f(q0,w) is in F, that is, starting in the start state and following the transitions corresponding to the symbols of w, we arrive at an accepting state. A set of strings is "regular" if it is equal to L(M) for some DFA M. The regular languages over a given alphabet A is the class of concepts we consider.

Example machine M1 has alphabet A = {a,b}, states Q = {q0, q1, q2}, start state q0, accepting states F = {q1, q2} and transition function delta given by (q0,a)->q1, (q0,b)->q0, (q1,a)->q2, (q1,b)->q1, (q2,a)->q0, (q2,b)->q2. L(M1) is the set of all strings of a's and b's such that the number of a's is not divisible by 3. We consider the task of learning a regular set over a given alphabet A using MQ's and EQ's. Each MQ specifies a string w over A and is answered 1 or 0 depending on whether the target language includes w or not. For an EQ, we assume that the hypothesis is represented as a DFA M', and the query is answered "yes" if L(M') is equal to the target language, and "no" otherwise, with a counterexample string w which is in the symmetric difference of L(M') and the target language. The time used by the learning algorithm should be polynomial in two parameters: the number of states in the smallest DFA to accept the target language, and the sum of the lengths of the counterexamples received so far. (Note that counterexamples may be chosen arbitrarily, which means that they might be arbitrarily long compared to the number of states in the target machine. Thus, to allow time to process "long" counterexamples, we allow the learner time polynomial in the counterexample lengths.)

Remark: we cannot do this task (polynomially) with just MQ's. For each n, consider the class C of all concepts {w} such that w is a string of a's and b's of length n. There are 2^n concepts in the class C. Each one can be represented by a DFA of n+2 states. (Picture of DFA accepting just the string aabba.) An adversary for MQ's answers every query with 0 until there is just one concept left in the class, causing any MQ learner to make at least 2^n - 1 queries. Nor can we do this task with a polynomial number of EQ's with DFA's of polynomial size ("Negative results for equivalence queries" paper again.) Thus, for polynomial time, we need both MQ's and EQ's.

Aside on conciseness of representations. Nondeterministic finite automata (NFAs) are like DFAs, except that instead of a transition function, we have a transition relation, that specifies for any triple (qi,a,qj) whether a transition is permitted from qi to qj on symbol a. The language L(M) of an NFA M is the set of all strings w such that there is a permitted sequence of transitions on the symbols of w leading from the start state to an accepting state. (There may also be permitted sequences of transitions that lead to nonaccepting states; as long as at least one permitted sequence leads to an accepting state, the string w is in L(M).) It is an elementary result in automata theory that for every NFA M there is a DFA M' such that L(M) = L(M'), so both models accept exactly the regular sets. However, the conversion from NFA to DFA may blow up the number of states by an exponential factor, and necessarily so. Consider the language En of strings of a's and b's that have an "a" in the position n symbols from the right end of the string. It is not difficult to show that the smallest DFA to accept this language has at least 2^{n-1} states, while there is an NFA with n+2 states to accept the language. Thus, if we permitted NFA representations of regular languages, parameter of the number of states for En would be at most n+2, whereas if we require DFA representations, the parameter would be at least 2^{n-1} -- giving us much more time! Thus, NFAs may be much harder to learn than DFAs, even though they represent the same class of concepts.

We followed through the algorithm from "Learning regular sets with queries and counterexamples" for the example machine. Considering the PAC+MQ model, we get a polynomial time algorithm to learn DFAs, but one consequence of "Cryptographic limitations on learning Boolean formulae and finite automata" is that (subject to assumptions on the hardness of certain cryptographic primitives) without the MQ's, DFAs cannot even be weakly predicted by a polynomial time algorithm. Thus, in this case at least, MQ's seem to matter.

10/6/05 Lecture 11. On-line prediction, Littlestone's Winnow.

In the on-line prediction setting, there is a class X of instances, a class C of concepts, and a target concept. The learner repeatedly requests an instance x, predicts the label of x, and then receives the correct label, c(x). The learner makes a "mistake" when its prediction of the label of x is not equal to the correct label. We'd like to bound the total number of mistakes made by the learner, for any concept and any sequence of instances. In particular, opt(C) is defined to be the minimum over all learning algorithms A, of the maximum over all concepts c in C and sequences of instances x1, x2, .., of the number of mistakes made by A when the target concept is c and the sequence of instances is x1, x2, ... Littlestone shows that opt(C) is exactly equal to K(C) (see previous lecture for definition.) He also shows that VC-dimension(C) is a lower bound for opt(C).

Littlestone also shows that the minimum worst-case number of unrestricted equivalence queries to learn a class C is one more than opt(C). Suppose we have an on-line prediction algorithm A for the class C. We can use A to construct an unrestricted EQ algorithm A' for C as follows. Let h1(x) be the prediction of A when x is the first instance it receives; this determines a concept h over X. (In particular, we initialize A, use it to predict the label for x, reinitialize A, use it to predict the label for x', eventually collecting its prediction for every element of X.) A' then makes a query EQ(h1). If the answer is "yes", it outputs h1 and halts. Otherwise, the answer is a counterexample, say x1. A' then constructs another concept h2, where h2(x) is the prediction of A when given instance x1, then label (1-h1(x1)), and finally instance x, for each x in X. A' then makes another query, EQ(h2). If the answer is "yes", it outputs h2 and halts. Otherwise, the answer is a counterexample x2, and A' constructs another concept h3, where h3(x) is the prediction of A when given instance x1, label (1-h1(x1)), x2, label (1-h2(x2)), and finally instance x, for each x in X, and so on. Thus, A' makes at most one more EQ than A makes mistakes of prediction. (The final, correct, EQ of A' does not correspond to a mistake of prediction.)

For the other direction, suppose we have an (unrestricted) EQ algorithm to learn the class C. We construct an on-line prediction algorithm A' for C as follows. A' simulates A until it makes its first EQ, say EQ(h1). A' then suspends A and then repeatedly requests an instance to predict, say x, predicts its label is h1(x), and receives the correct label c(x), until (if ever) the values of h1(x) and c(x) are different. At this point, it resumes the simulation of A with the x for which h1(x) and c(x) are different as the counterexample, thus finally answering the EQ of A. When A makes another query, say EQ(h2), A' again suspends the simulation of A, and uses h2 to predict instances until (if ever) its prediction is wrong, and so on. Thus, the number of errors of prediction by A' will be strictly less than the number of EQ's made by A.

Littlestone's Winnow is an interesting algorithm in the on-line prediction model, for the simple problem of monotone disjunctions of variables. In this setting, X consists of truth assignments to the n variables p1, p2, ..., pn, and each concept in C is an "OR" of some variables, for example, (p1 + p3 + p6). We already know one algorithm to learn this concept class (it is dual to the case of a monotone conjunction -- on each negative instance, eliminate any variables set to 1.) We'll consider Winnow specialized to have a threshold of n/2 and a promotion factor of 2. In this case, the total number of mistakes of prediction made by Winnow is bounded by (2klog_2(n) + 2), where k is the number of variables appearing in the target concept. The (n-k) variables that don't appear in the target concept are "irrelevant" in the sense that the value of the target concept does not depend upon their values. The advantage of the bound given is that it depends linearly on k, the number of relevant attributes, but only logarithmically on n, a bound on the number of irrelevant attributes. This is in contrast to our previous algorithm, which might make as many as (n-k) mistakes, a quantity that depends linearly on n.

Winnow maintains a threshold (specialized to n/2 in our treatment), a promotion faction of alpha (specialized to 2 in our treatment), and a weight wi for each variable pi, initially set to 1. Given an instance x to predict, it calculates the sum of wi times xi for i from 1 to n, that is, the sum of the weights corresponding to 1's in the instance. If the sum is at least the threshold, Winnow predicts 1; otherwise, it predicts 0. When it receives the correct label, it does not change the weights if its prediction matches the correct label. Otherwise, if its prediction was 1 and the correct label is 0, Winnow does an "elimination step", setting wi = 0 for each i such that xi = 1 in the example. If its prediction was 0 and the correct label is 1, Winnow does a "promotion step", setting wi = 2wi for each i such that xi = 1 in the example. After it has updated the weights, Winnow requests another instance to predict.

(Insert example illustrating how Winnow performs on the instance sequence 10010010, 00101100, 01001010, 10000001 when n = 8 and the target concept is (p1 + p3 + p6).) To see the bound on the number of mistakes made by Winnow, we note that every promotion must double the weight of at least one variable in the target concept, and none of their weights can ever exceed n, so there can be at most (klog_2(n)) promotion steps overall. Moreover, if we consider the sum of the weights, each elimination step must subtract at least n/2 from that sum (because the sum of the weights corresponding to 1's in the example must have been at least n/2 for Winnow to predict 1, and all of those weights will be set to 0.) Each promotion step adds less than n/2 to the sum of the weights (because we double the weights corresponding to 1's in the example, but they must have summed to less than n/2 in order for Winnow to predict 0.) Since the sum of the weights cannot fall below 0, the number of elimination steps is bounded by the number of promotion steps plus 2 (for the initial sum of weights n, which can support 2 additional elimination steps.) This gives the bound of (2klog_2(n) + 2), as claimed. (We'll look briefly at Winnow2, lower bounds, and applications next time.)

10/11/05 Lecture 12. Winnow Lower Bounds, Winnow2.

Last time we looked at Winnow1 for learning monotone disjunctions of variables. We maintain a weight wi for each variable xi, where each wi = 1 initially. Repeat: request an instance to predict, (x1,...,xn). If sum(i=1,n) wi*xi is at least n/2 (the threshold), predict 1, else predict 0. Receive the correct label 0 or 1, and if there was a mistake (prediction not equal to correct label), then update the weights as follows. If the prediction was 1, then set wi = 0 for all those i such that xi = 1 (elimination step); if the prediction was 0, then set wi = 2*wi for all those i such that xi = 1 (promotion step.) We saw that the total number of mistakes of prediction when the target concept is a disjunction of k variables is bounded by (2k log_2(n) = 2). How good is this? Recall that Littlestone showed that VC-dim(C) is a lower bound for opt(C), that is, the VC dimension of a concept class is a lower bound on the optimal number of mistakes in the on-line prediction settting. Thus, we can get a lower bound on the number of mistakes made by *any* algorithm for this task by determinining the VC dimension of the class of concepts represented by disjunctions of k variables.

A lower bound for the VC dimension of disjunctions of k out of n variables is (k log_2(floor(n/k))). An example will help us understand Littlestone's general construction. Suppose k = 1 and n = 8, so that the lower bound is 3 in this case. The following 3 assignments are shattered by the class of concepts represented by single variables: let a = (11110000), b = (11001100), and c = (10101010). If we consider these assignments as the rows of a table, then the columns give every possible vector of 3 0's and 1's. To see that this set is shattered, consider for example the labelling (a,1), (b,1), (c,0). If the variables are x1 through x8, then x2 is 1 on a and b but 0 on c, so it achieves this labelling. Each of the 8 concepts represented by a single variable achieves a different one of the 8 possible labellings of the set S = {a, b, c}, so this set is shattered, and the class has VC dimension 3, as claimed. To generalize this for k > 1, we divide the variables into k groups of size (n/k) each, and construct k groups of log_2(n/k) assignments, corresponding to the groups of variables. The i-th group of assignments will have 0's except in the positions corresponding to the i-th group of variables, and in these n/k columns they will have all possible vectors of log_2(n/k) bits. To see that the resulting set of (k log_2(n/k)) assignments is shattered by the class of unions of k variables, we observe that the i-th group of assignments we can pick out one variable from the i-th group of variables that correctly labels the assignments in that group. Since each assignment is 0 for variables outside its group, we can union together the k variable selected (one from each group) to achieve a correct labelling of the whole set of assignments. This shows that for k < n/2, the high order term of Winnow1's mistake bound is tight up to a constant factor. (Note that by a more careful choice of threshold and promotion factor, the constant factor can be improved below 2.)

Returning to Winnow1, if there are ERRORS in the labels supplied to the algorithm, the result can be quite catastrophic. In particular, in an elimination step we may incorrectly set some wi = 0, and it will remain 0 for the rest of the run of the algorithm. Thus, we consider Winnow2, which is like Winnow1, except that in place of the elimination step, we have a demotion step. That is, when the prediction is 1 and the label supplied is 0, Winnow2 sets wi = (1/2)*wi for all those i such that xi = 1. Note that promotion and demotion steps are inverses, so that we can recover from ERRORS in the labels supplied as feedback to the algorithm.

However, first we analyze the total number of mistakes Winnow2 may make assuming that all the labels supplied as feedback to the algorithm are correct. Note that our previous analysis of promotions still holds: each of the k relevant variables can have its weight doubled at most log_2(n) times, so there can be a total of at most (k log_2(n)) promotion steps. Once again we look at the effect of promotion and demotion steps on the sum of all the weights. A promotion step still adds at most n/2 to the sum of the weights (because the sum of the weights corresponding to xi = 1 must be less than n/2 for a prediction of 0.) A demotion step subtracts at least n/4 from the sum of the weights (because the sum of the weights corresponding to xi = 1 must be at least n/2 for a prediction of 1.) Thus, the initial sum of weights (n) can "pay for" at most 4 demotions, and each promotion can "pay for" at most 2 demotions, so there can be at most (2k log_2(n) + 4) demotions, which implies a grand total of at most (3k log_2(n) + 4) total mistakes. Thus, if there are no label ERRORS, Winnow2 has performance not much worse than Winnow1. If there is an ERROR in a label, what can the consequences be? One erroneous demotion could cut in half the weights of all k relevant variables, which could then be corrected by k separate correct promotions, each of which can "pay for" 2 more demotions. Thus an upper bound for the number of extra mistakes is 3k times the number of ERRORS in labels.

A different variant of Winnow2 can be shown to learn certain linearly separable Boolean functions. A Boolean function f is linearly separable if there exist weights wi and a threshold t such that for all Boolean vectors (x1,...,xn), we have f(x1,...,xn) = 1 if and only if sum(i=1,n) wi*xi >= t. In two dimensions, examples of linearly separable Boolean functions are (x1 AND x2) and (x1 OR x2), while a Boolean function that is not linearly separable is (x1 XOR x2), where XOR stands for "exclusive or" (1 if exactly one of its two inputs is 1 and the other is 0.) We can represent AND by (x1 + x2) > 1 and OR by (x1 + x2) > 0. A linearly separable Boolean function f is delta-separable, for some delta > 0, if there exist weights wi such that f(x1,...,xn) = 1 iff sum(i=1,n)wi*xi >= 1, and f(x1,...,xn) = 0 iff sum(i=1,n)wi*xi <= (1 - delta). That is, there is a separation of at least delta between positive and negative examples. In n dimensions, we can take delta = 1 for OR, because (x1 + ... + xn) is 0 if all the xi's are 0 and is at least 1 if not all the xi's are 0. However, for AND, we need to take delta = 1/n, because if we consider the expression (1/n)(x1 + ... +xn) when all the xi = 1, we get 1, but when not all the xi are = 1, we may get a value as large as (1 - 1/n). When the target concept is a delta-separable Boolean function f and we run Winnow2 with a promotion factor of (1 + delta/2) and a demotion factor equal to the inverse of this and a threshold of theta, then Littlestone's Theorem 9 gives an upper bound on the total number of mistakes of (8/delta^2)(n/theta) + ((5/delta) + ((14 ln(theta))/delta^2))sum(i=1,n)wi, where wi are the weights in the delta-separable representation of f. Note that when the target is a disjunction of k variables, we have delta = 1 and the sum of the weights = k, and we can take theta = n/2, and get a bound in the same form as the Winnow1 bound, with larger constants.

As an example of a linearly separable Boolean function with a very small separation, Littlestone gives the function (x1 OR (x2 AND (x3 OR (x4 AND ...)))). If we write this out as a decision list, we get: if (x1 = 1) then 1 else (if (x2 = 0) then 0 else (if (x3 = 1) then 1 else (if (x4 = 0) then 0 else ..))). This is linearly separable; consider the specific expression that terminates with x4. A linear representation of it is 8*x1 + 4*x2 +2*x3 + 1*x4 >= 5. When x1 = 1, the inequality holds and f(x) = 1. When x1 = 0 and x2 = 0, the inequality is false and f(x) = 0. When x1 = 0, x2 = 1, and x3 = 1, the inequality holds and f(x) = 1. When x1 = 0, x2 = 1, x2 = 0, the inequality is true when x4 = 1 and false when x4 = 0, agreeing with f(x) in both cases. However, in general the gap will be 1/2^n, that is, exponentially small.

Finally, we briefly saw the Weighted Majority Algorithm. Avrim Blum's paper, assigned as reading, gives empirical results for implementations of Weighted Majority and Winnow for a specific prediction task. Coverage of the Weighted Majority Algorithm and Blum's paper may be found in Lecture 13.

10/13/05 Lecture 13. Weighted Majority, Blum's paper, Empirical research.

In the Weighted Majority (WM) algorithm, we assume that there is a pool of experts, say A1, ..., An, each of which functions as an on-line prediction algorithm for instances x in X. The WM algorithm attempts to combine the predictions of all the experts in such a way that it makes not "too many" more mistakes on a given sequence of instances x1, x2, ... than the best (in hindsight) expert in the pool on this sequence. The WM algorithm maintains a weight wi for each expert Ai, all initially = 1. WM requests an instance x to predict, and gives the instance x to each expert Ai and receives its prediction. WM compares the total weight q1 of all experts predicting 1 for x to the total weight q0 of all experts predicting 0 for x. WM predicts 1 for x if q1 >= q0, and predicts 0 for x otherwise. WM then receives the correct label for x, which it passes along to the experts Ai. WM also sets wi = (1/2)*wi for all those experts Ai that predicted incorrectly on x.

As an example, suppose the pool contains 3 experts, A1, A2, and A3, with initial weights (1,1,1). Suppose the first instance to predict is x1, and the predictions of A1, A2, and A3 are (1,1,0). Then WM predicts 1 (because the total weight of experts predicting 1 is 2 and the total weight of experts predicting 0 is 1.) Suppose the correct label for x1 is 0. Then WM passes along the label 0 for x1 to A1, A2, and A3, and also updates their weights to be (1/2,1/2,1). Suppose the second instance to predict is x2, and the predictions are (0,1,0). Then WM predicts 0 (total weight 3/2) and not 1 (total weight 1/2). If the correct label of x2 is 1, then WM passes along this label of x2 to A1, A2, and A3, and updates their weights to (1/4,1/2,1/2). Suppose the third instance to predict is x3, and the predictions are (0,1,1). Then WM predicts 1 (weight 1) and not 0 (weight 1/4). If the correct label is 1, then WM passes along the label for x3 and updates the weights to (1/8,1/2,1/2). Note that A2 and A3 have each made 1 mistake and have weight 1/2, while A1 has made 3 mistakes and has weight 1/8.

If we look at the number M of mistakes made by WM after a particular sequence of labelled instances, and the number mi of mistakes made by the i-th expert Ai on the same sequence, we see that wi, the weight of Ai, is (1/2)^(mi). Each mistake made by WM decreases the sum of the wi's by a factor of 1/4 (because at least half the weight predicted wrong, and half of that will be discarded). Therefore, after M mistakes, the sum of the wi's is at most n*(3/4)^M, because n is the initial sum of the weights. Because we have a lower bound on the sum of the weights and an upper bound on the sum of the weights, we must have (1/2)^(mi) <= n*(3/4)^M. Thus, M <= c*(log_2(n) + mi), where c = 1/(log_2(4/3)), which is about 2.4. Interpreting this, the total number, M, of mistakes made by WM on an arbitrary sequence of labelled examples is "nearly" as good as the number, mi, of mistakes made by the best expert (using hindsight to judge which would have been best.) "Nearly" means a constant factor, plus a term proportional to the log of the number of experts in the pool. (Note that the constant factor can be improved to be near 1, at the cost of increasing the constant multiplying the log term, by a strategy of randomized prediction (predict 1 with probability q1 and 0 with probability q0) combined with a less drastic penalty than 1/2.)

Turning to "Empirical support for Winnow and Weighted Majority algorithms: results on a calendar scheduling domain" of Avrim Blum, we got to see Weighted Majority and Winnow in action. Tom Mitchell's calendar scheduling apprentice (CAP) provided both the data and a prediction algorithm for comparison. The data consists of attributes describing a sequence of meetings for each of two users. The attributes include such things as event-type, position-attendees, lunch-time?, location, and so on. CAP used selected attributes to try to predict 4 specific attributes: location, duration, start-time, and day-of-week. CAP used the previous 180 days worth of data to build a decision tree, prune it, extract rules, and order the rules according to their empirical accurary. It was run once a day, overnight, to develop rules to predict the next day's instances. Observing that CAP tended to produce rules with few attributes, Blum decided to test the performance of Weighted Majority and Winnow on the same problem.

For WM, Blum created an expert for each pair of attributes, for example, (event-type, position-attendees). Thus, for predicting location, which was based on 12 selected attributes, there would be (12 choose 2), or 66 experts. Each expert kept a history: for each pair of values that had occurred in the data for its two attributes, it kept the last five values of the attribute to be predicted. When the expert was asked to predict a new instance x, it took the actual values of its two attributes in x and looked at its history-of-five for that pair of values, and predicted the most frequent value for the attribute to predict among those five. If it had no history, it simply predicted a global default (the most frequently occurring value for the predicted attribute.) Using this pool of experts, WM operated as described above. The prediction is not binary, so WM used the prediction with the largest sum of weights of experts making that prediction.

For Winnow, although it does not seem useful to model the task as predicting a monotone disjunction, Blum describes an adaptation of the ideas to this task. The individual predictions for Winnow are made by "specialists" which may abstain from prediction. There is one specialist for every pair of attribute-value pairs that has occurred in the data. For example, for predicting location, one specialist would be the pair (event-type=meeting, position-attendees=faculty). This results in a very much larger number of individual prediction algorithms (59731 for the first task, using a larger feature set.) However, for each prediction, only (n choose 2) specialists "wake up" to make predictions, where n is the number of attributes being used for the prediction. Each specialist keeps a history of the last five times its pair of attribute-value pairs has occurred in instances, and the resulting label for the attribute it is to predict, and predicts the most frequently occurring label among the stored history. When a specialist is first created (the first time its pair of attribute-value pairs occurs in the data), it is given a weight of 1 and abstains from prediction. Winnow collects the predictions of the non-abstaining specialists, and makes the prediction whose corresponding weight is largest. If Winnow makes a mistake, it updates the weights of the specialists, multiplying weights by 1/2 for specialists who predicted incorrectly, and by 3/2 for specialists who predicted correctly.

How closely does the actual task reflect the theoretical models? For WM, the analysis does not assume any target concept, just an arbitrarily labelled sequence of instances. We can think of this data as an arbitrarily labelled sequence of instances, but the performance guarantee we have seen for WM compares WM with the performance of the best expert (in hindsight) for the sequence of data. This analysis does not really capture the fact that the data is made up of semester-long chunks for which different experts might be best. Thus, a more sophisticated analysis of how well WM tracks temporally varying (or "drifting" in the literature) concepts might be more appropriate here. For Winnow, the departure from the theoretical model is more striking: there doesn't seem to be any monotone disjunction in sight, and we are using a vote of the specialists instead of a threshold.

What do the empirical results say? The overall average accuracy in predicting all 4 of the attributes to be predicted is 53% for Mitchell's program CAP, 57% for WM, and 63% for Winnow on the larger of the two data sets. Blum also consider a version of Winnow modified to update its weights only at day boundaries (to make it more comparable to CAP's operation), which reduces Winnow's overall prediction accuracy to 59% on the larger data set. He also considers the effects of disabling weight update in Winnow, which reduces its overall prediction accuracy on the larger data set to 52%. He also considered the combination of both modifications, as well as the effects of using a larger set of attributes than the hand-selected set used by CAP. He also considered a version of Winnow that would only predict if the highest weighted outcome was at least some specified fraction of the total weight, allowing for a tradeoff of accuracy of prediction versus coverage (fraction of instances on which Winnow predicted rather than abstained.) He also studied the effects various strategies for pruning low-weight experts in WM. In addition to these empirical comparisons, Blum gave a theoretical analysis of two pruning strategies for WM and a bound on mistakes for his modified version of Winnow.

How does this empirical work fare in light of the recommendations of "Machine learning as an experimental science" (Aha and Kibler, 1988) and "Fundamental experimental research in machine learning" (Thomas Dietterich, 1997)? Kibler and Aha suggest that research in machine learning is based on three assumptions (1) learning is a regular process, (2) a few learning mechanisms are sufficient to support intelligent behavior, and (3) these learning mechanisms can be expressed in computational terms. In the discussion, there was strong skepticism expressed about (1) and (2), though not (3). It is not at all clear how to do science in the realm of machine learning if (1) and (2) are strongly violated. Aha and Kibler assert that many learning algorithms are too complex for formal analysis, and that empirical studies are crucial to the understanding of their behaviors. They express the hope that empirical study will lead to empirical laws, which would provide a basis for theory formation. They emphasize the importance of independent and dependent variables, and identify "performance" as a key dependent variable. Comparisons can be between different methods on the same tasks, including comparisons to human behavior, or to "straw man" algorithms. Studies of the effect of different parameter settings in algorithms, "lesion studies" (which disable some component(s) of the algorithm to understand their effects), varying the data (noise and irrelevant attributes), and artificial as well as natural data are all recommended. They emphasize that experiments should illuminate the factors relevant to the success or failure of learning algorithms. Certainly, Blum's experiments use the performance measure of prediction accuracy, compare three main methods on the same data, study the importance of continuous versus once-a-day update, the effect of extra attributes, the effect of disabling weight update, and the benefits of pruning low-weight experts.

Dietterich makes the point that machine learning is inevitably empirical, because performance of algorithms (in the real world) depends on how well the assumptions of the models match the facts on the ground. Thus, experimental studies are not optional in the field. He exemplifies this for supervised learning from examples: the success of a model and algorithm in the world depends on whether the hypothesis space is sufficiently small that predictions are "guided enough" to be useful, and sufficiently large that it contains good approximations of the target concept. He asserts that the interplay of theoretical models and empirical studies will produce stronger results than either separately, and gives as an example the development of the practically important Boosting algorithm out of a theoretical question (is weak PAC learning equivalent to strong PAC learning?) and the subsequent theoretical efforts to understand and improve its performance. He strongly makes the point that empirical study is different from applied work. In empirical study, we strive for a greater understanding of the strengths and weaknesses of a method or methods -- we embrace problems as a challenge to understanding. In applied work, the focus may be more on evading problems on the way to a shipped application.

One giant possible pitfall of empirical work: if we start with a pile of data and repeatedly refine our algorithm in response to it, how can we be sure we haven't just tuned for performance on this *one* pile of data, rather than achieving something more general? Stay tuned.

10/18/05 Lecture 14. Blum's paper concluded, Project discussion (not written up.)

At the end of his paper, Blum provides a theoretical analysis of the variant of Winnow that he devised for the calendar scheduling task, called Winnow Specialist. The theoretical model assumes a very large number of specialists, who may either vote for some prediction of the attribute value or abstain. On any particular instance to predict, at most n of the specialists "wake up" and actually vote. In addition, there are r "relevant" specialists with the following properties: on every instance we will encounter, at least one relevant specialist will vote, and whenever a relevant specialist votes, it votes for the correct prediction. In other words, the votes of relevant specialists are infallible. (This assumption is relaxed at the end, by considering the effect of errors in the labels supplied for the instances. That is, instead of thinking of the label as correct and the relevant specialist's prediction as wrong, we might treat this as an error in the label.)

Recall how the weights work: when a specialist "wakes up" for the first time, it is initialized with a weight of 1 and abstains. (Is abstention really necessary?) Then, when the Winnow Specialist makes a mistake (predicts a wrong value), it updates the weights of the at most n specialists who participated in the vote as follows: those who voted for incorrect predictions have their weights halved, those who voted for the correct prediction have their weights multiplied by 3/2. The analysis is similar to that for Winnow, though somewhat more complex. We have no particular way of bounding the total weight of the specialists, since we may initialize a large number to 1. Instead, we distinguish between "high weight" and "low weight" specialists, depending on whether their weight (currently) is greater than 1 or less than or equal to 1. Then the quantity whose evolution we analyze is the sum of the weights of the high weight "irrelevant" (that is, not one of the r relevant) specialists. Denote this by W_{hi}.

Consider what happens when Winnow Specialist makes a mistake. We know that the weight of the incorrect predictions is greater than or equal to the weight of the correct prediction in the vote. None of the relevant specialists voted for an incorrect prediction (they are infallible, recall.) Let W_a denote the weight of the high weight specialists who voted for incorrect predictions. The weighted vote for incorrect predictions is W_a plus the sum of the weights of the low weight specialists who voted for incorrect predictions, which can be bounded above by n (since at most n of them voted, and their weights are all less than or equal to 1.) That is, the vote for incorrect predictions is bounded above by (W_a + n). If we let W_r denote the sum of the weights of the relevant specialists who voted on this instance (all of which voted correctly, by infallibility) and W_b denote the sum of the weights of irrelevant specialists who voted correctly on this instance, then the vote for the correct prediction is (W_r + W_b). Because Winnow Specialist made a mistake, we know that (*) (W_r + W_b) is bounded above by (W_a + n).

Now we examine what happens to the weights of the various categories of specialists in response to the mistake made by Winnow Specialist. The weights (W_r) of the voting relevant specialists are all multiplied by 3/2. The weights (W_b) of the correctly voting irrelevant specialists are also all multiplied by 3/2. The weights of the incorrectly voting specialists are all multiplied by 1/2. How does this affect W_{hi}? It could be increased by at most (1/2)W_b by the promotion of irrelevant specialists, and must be decreased by at least (1/2)W_a by the demotion of high weight irrelevant specialists. Thus, the net increase to W_{hi} is at most ((1/2)(W_b - W_a)). By using the inequality (*) that we obtained above, this last quantity is bounded above by (1/2)(n - W_r).

We divide mistakes by Winnow Specialist into two categories: (1) mistakes made when W_r < 2n, and (2) mistakes made when W_r >= 2n. Every time a mistake is made, the weight of some relevant specialist must be multiplied by 3/2 (because at least one of them votes, and votes correctly by infallibility.) Nothing (in the error free case) decreases the weights of the relevant specialists, so these increases can only happen so many times before W_r will exceed. If the weight of every relevant specialist is at least 2n, then no more mistakes of type (1) can occur. Thus, each relevant specialist can be promoted at most ceiling of (log_{3/2}(2n)) times by mistakes of type (1). (Note: ceiling(log_{3/2}(2n)) <= log_{3/2}(2n) + 1 = log_{3/2}(3n), which is where the 3n comes from in the bound.) Thus, the total number of mistakes of type (1) is bounded above by (r*log_{3/2}(3n)).

Considering mistakes of type (2), we know that the vote for the correct prediction is at least 2n, so the vote for incorrect predictions is at least 2n, which means that the vote by high weight irrelevant specialists for incorrect predictions is at least n (since the low weight ones sum to at most n), which means that W_{hi} must be decreased by at least n/2 for each mistake of type (2). W_{hi} starts at 0; what increases it? Recall that the increase to W_{hi} is bounded above by (1/2)(n - W_r), which is at most n/2. Moreover, increases can only take place when W_r < n, that is, only for mistakes of type (1). Thus, the sum of all increases to W_{hi} is bounded above by (n/2) times the number of type (1) mistakes. Since every type (2) mistake must decrease W_{hi} by at least n/2, this means that the number of type (2) mistakes is bounded by the number of type (1) mistakes, that is, the total number of mistakes is bounded by (2r*log_{3/2}(3n)). This bound is in the same form as the bound obtained on Winnow1, growing linearly with the number of relevant specialists (variables) and logarithmically with the number of specialists that "wake up" for any example (or, the total number of variables, in the case of Winnow1.) There is also a brief analysis bounding the effect of ERRORS in the labels of the instances.

How relevant is the analysis of Winnow Specialist to the empirical results in the paper? The model is clearly very idealized: in the calendar setting, we are assuming r rules such that at least one of them makes a prediction for every example we will see, and such that every prediction made by any of them will be correct, together with treating deviations from this model as errors in the "true" labels of the instances. It seems unlikely that the calendar data satisfies these assumptions, not least because it is so strongly time-varying. However, the analysis does give us some confidence that in adapting Winnow in this way, Blum has not destroyed the essential theoretical properties of the algorithm. The relationship between the theory and the empirical studies is quite indirect: the theoretical algorithms provide ideas and hints for the empirical application. How does the paper fare as an empirical study? It does seem to provide evidence for the possible usefulness of Weighted Majority and Winnow as a source of practical algorithms and implementations. It does not really answer the question that is posed by the empirical results: Why on this data is the performance of Winnow better than that of Weighted Majority, which in turn is better than that of CAP? Some remarks at the end of the paper about the properties of the experts or specialists (keeping the last five applicable cases) in preserving "rare but useful events" and adapting quickly to changes, as well as general remarks about the importance of the weights and the ability of WM and Winnow to deal with irrelevant attributes, address this question, but a general conclusion about which properties of this domain are important to this result, and the strengths and weaknesses of these three algorithms as applied to other domains seems beyond the intended goals of the paper.

10/20/05 Lecture 15. Boosting: AdaBoost, discussion of "self-plagiarism" (no notes for this.)

Boosting is somewhat analogous to the unfortunate tendency of an oral exam to spend the most time material where the student's performance is weakest. Schapire, in his thesis in 1989, gave the first polynomial time method of converting a weak learning algorithm (able to get some error rate less than (1/2 - 1/p(n)) on ANY input distribution) into a (strong) PAC learning algorithm (able to get any error rate less than epsilon (at a cost polynomial in 1/epsilon.)) Freund in 1990 gave a more practical algorithm, and Freund and Schapire developed the idea into the AdaBoost algorithm. They received the 2003 Godel Prize for their 1995 AdaBoost paper.

We look at the AdaBoost algorithm as presented in ``A brief introduction to boosting'' by Schapire. We assume a domain X of examples. Labels will be +1 and -1 instead of 1 and 0; this makes some expressions more concise. In particular, if h is a hypothesis and (x,y) is a labeled example, then the expression y*h(x) will be 1 if h(x) = y and -1 otherwise. The input to AdaBoost is a fixed set of labelled examples, denoted (x_1, y_1), ..., (x_m, y_m), and called the training set. (We do NOT assume access to an EXAMPLES oracle.) We assume that there is a "weak" or "base" learner that takes as input a probability distribution over the training set and returns a hypothesis h mapping X to {+1, -1}.

AdaBoost operates in stages: t = 1,2,...,T. At each stage, it computes a probability distribution on the labelled examples in the training set. Let D_t(i) denote the probability assigned to the i-th labelled example, (x_i, y_i), in stage t. Initially, the probabilities are uniform, that is, D_1(i) = 1/m, for i = 1,2,...,m; all m labelled examples have equal probability. In stage 1, AdaBoost calls the weak learner on the training set with this initial distribution, and the weak learner returns its first hypothesis, h_1. If the weak learner cannot deal directly with weighted examples, then AdaBoost can simulate randomly and independently drawing examples from the training set according to the current probability distribution, to supply examples to the weak learner. After it receives the hypothesis of the weak learner, AdaBoost updates the distribution and repeats, until t = T.

To update the distribution D_t to get the new distribution D_(t+1), AdaBoost calculates the error rate epsilon_t of the weak learner's hypothesis h_t on the training sample. That is, epsilon_t is the sum of D_t(i) for all examples (xi,yi) from the training set such that h(xi) is not equal to yi. This is used to calculate a weight, alpha_t, to assign to the hypothesis h_t. In particular, alpha_t is chosen to be (1/2)(ln((1 - epsilon_t)/epsilon_t). Note that the weight alpha_t will be positive if epsilon_t is less than 1/2, and increases as epsilon_t decreases. Thus more weight in the final hypothesis will be given to those h_t with smaller error rates. To update the probability assigned to (x_i,y_i), AdaBoost multiplies D_t(i) by exp(alpha_t) if h_t(xi) is not equal to yi, and by 1/exp(alpha_t) if h_t(xi) = yi, and divides by a normalization factor (Z_t), equal to the sum of all the updated values, so that D_(t+1) will be a probability distribution. Thus, the probability of each example on which h_t makes an error is INCREASED (by a factor proportional to sqrt((1 - epsilon)/epsilon)) and the probability of each example on which h_t does not make an error is DECREASED (by an amount proportional to the inverse of the preceding factor.) Once the probability distribution D_(t+1) is computed, AdaBoost moves to stage t+1 and calls the weak learner again.

Once stage T is reached, the weak learner produces a hypothesis h_T and AdaBoost calculates its weight alpha_T. AdaBoost then combines the hypotheses h_1, h_2, ..., h_T as a weighted sum f(x) = sum(t=1,T) alpha_t*h_t(x). This is a real valued function (values not restricted to +1 and -1), so the final hypothesis output by AdaBoost is H(x) = sign(f(x)), where sign(x) is -1 if x is negative, and +1 otherwise. Thus, the final hypothesis of AdaBoost is in effect a weighted vote of the hypotheses h_1, h_2, ..., h_T, produced by the weak learner in response to the various modified distributions computed by AdaBoost, where alpha_t is the weight of the vote of h_t.

To get a better understanding of the algorithm, we can run it on a very simple example. Suppose the training set consists of the examples z_1 = (000, 1), z_2 = (110, 1), z_3 = (100, -1), and z_4 = (011, 1). Suppose the possible hypotheses considered by the weak learner are "decision stumps" consisting of one attribute. Thus, a1(x) = 1 if the first coordinate of x is 1, and -1 if the first coordinate of x is 0, and a1'(x) = (-1)a1(x), that is, 1 if the first coordinate of x is 0 and -1 if the first coordinate of x is 1. Similarly, a2, a2' are determined by the second coordinate of the input, and a3, a3' by the third coordinate of the input. Considering the hypothesis a1, it makes errors on z_1 (because a1(000) = -1), z_3 (because a1(100) = 1), and z_4 (because a1(011) = -1), and so has an error rate of 75% on the initial distribution on the training set. Its complement, a1', makes errors only on z_2 (because a1'(110) = -1), and so has an error rate of 25% on the initial distribution on the training set. Similarly, the error rate for a2 is 25%, for a2' is 75%, for a3 is 50%, and for a3' is 50% on the initial distribution. Assuming that the weak learner achieves an error rate less than 50% on this distribution, it must return either a1' or a2; suppose it returns a1'.

Then h_1 is a1', with epsilon_1 = .25, and AdaBoost calculates the weight alpha_1 as (to two decimal places), .55. It then updates the distribution over the examples to be: D_2(z_1) = .17, D_2(z_2) = .50, D_2(z_3) = .17, D_2(z_4) = .17. Note that the example z_2, on which a1' makes its only error, receives a large probability in this second stage. The error rates of the six possible hypotheses with respect to D_2, are 50% for a1 and a1', 17% for a2, 83% for a2', 67% for a3, and 33% for a3'. Suppose the weak learner now returns the hypothesis a2 (although a3' would also be a possibility, since it has an error rate less than 50%.) AdaBoost has h_2 = a2, with epsilon_2 = .17. This leads to a weight of alpha_2 = .79 and a new distribution D_3 on the examples, of D_3(z_1) = .50, D_3(z_2) = .30, D_3(z_3) = .10, and D_3(z_4) = .10. The error rates of the six possible hypotheses with respect to D_3 are: 70% for a1, 30% for a1', 50% for a2, 50% for a2', 80% for a3, 20% for a3'. Suppose that the weak learner now returns the hypothesis a3'. Then AdaBoost has h_3 = a3' with epsilon_3 = .20. This leads to a weight of alpha_3 = .69.

If AdaBoost only runs for T = 3 stages, it now forms the weighted sum of h_1, h_2, and h_3, namely, f(x) = .55*a1'(x) + .79*a2(x) + .69*a3'(x), and outputs the hypothesis H(x) = sign(f(x)). If we check this on the examples in the training set, we find that f(000) = .45, f(110) = 1.38, f(100) = -.65, and f(011) = .65. Thus, H(x) is completely consistent with the training sample, that is, it has training error equal to 0. We'll see what this has to do with generalization next time.

10/25/05 Lecture 16. Boosting: AdaBoost, continued.

Recall the setting and description of AdaBoost (previous lecture.) Lev Reyzin gave a talk on results from his senior project supervised by Prof. Schapire at Princeton: "Analyzing Margins in Boosting." This compared the performance of AdaBoost and another boosting algorithm with respect to the distribution of margins and generalization performance.

To understand where the choice of weights comes from in AdaBoost, we expanded the argument from Schapire's paper "The Boosting Approach to Machine Learning."

10/27/05 Lecture 17. AdaBoost concluded, Decision Trees.

We finished understanding the choice of weights in AdaBoost as a greedy minimization of the normalization factor Z_t at each round. Assuming that the error epsilon_t is bounded below 1/2 by gamma at each round, we have that the error of AdaBoost's hypothesis on the *training set* is at most exp(-2T{gamma}^2) after T rounds, which is decreasing exponentially fast as a function of T.

But what does error on the *training set* say about performance on other data? Could we not just be horribly overfitting the training data? For this question, we go back to the PAC model and assume that the training data was chosen (independently) from a distribution D on (instance, label) pairs (x,y), and ask for bounds on the error of AdaBoost's final hypothesis h(x), that is, bounds on the probability that h(x) is not equal to y if (x,y) is drawn from D. Freund and Schapire proved one bound on this error: it is bounded above by the error of h on the training set PLUS soft-Oh(sqrt(Td/m)), where m is the number of elements in the training set, T is the number of rounds of boosting, and d is the VC-dimension of the space of all possible base classifiers that could be returned by the weak learner. (Soft-Oh, frequently denoted by a tilde over a big-Oh, is like big-Oh, but suppresses factors that are polylogarithmic in the explicit variables, as well as constant factors. Thus, (n log n)^2 is Soft-Oh(n^2), for example.) This bound suggests a tradeoff between driving the training set error down and using lots of rounds to do so (increasing T). But the empirical evidence suggests that this bound is too pessimistic in certain circumstances; empirically many rounds may sometimes not blow up the generalization error.

Where does the bound come from? Underlying it is a bound of 2(d+1)(T+1)log(e(T+1)) on the VC-dimension of thresholds of linear combination of T classifiers from a base class of dimension d, proved by Baum and Haussler. Using soft-Oh, we see that this is soft-Oh(dT). This is used in a theorem of Vapnik to get the generalization bound above. Thinking about this in the context of, say, a class of base classifiers consisting of decision stumps, there is a fixed number of possible base classifiers (for each attribute, each possible way of assigning +1/-1 to its possible values.) Thus, the number of classifiers appearing in the final linear combination is bounded by this quantity, and doesn't increase arbitrarily with T. Thus, in this specific case, the growth of T is irrelevant after a certain threshold.

As Lev covered in his talk, there is another bound on the generalization error that refers to the margin of the final classifier on the training set, and does not involve the number (T) of rounds of training at all. That bound has a parameter theta > 0 and says that the generalization error of the final hypothesis h(x) returned by AdaBoost is bounded above (whp) by the SUM of the fraction of the training set with margin less than theta, and a term that is soft-Oh(sqrt(d/(m{theta}^2))). Again, d is the VC-dimension of the class of base classifiers, and m is the number of examples in the training set. Letting f(x) denote the weighted sum of the base classifiers before thresholding, so that f(x) = sum(t=1,T)(alpha_t)(h_t(x)), we have that the final hypothesis is h(x) = sign(f(x)). Then the *margin* of a pair (x,y) is y*f(x)/sum(t=1,T)|alpha_t|. The margin takes on a value between -1 and +1. It is negative if the prediction of h(x) disagrees with the value y in (x,y) and positive if it agrees. It is close to +1 if the vote is overwhelmingly in agreement with the label y, a positive value close to 0 if the vote is slightly in agreement with the label y, a negative value close to 0 if the vote slightly disagrees with the label y, and close to -1 if the vote overwhelmingly disagrees with the label y. We may view AdaBoost as attempting to find a classifier h(x) with large (close to 1) margins on lots (a fraction close to 1) of the the examples in the training set, which would mean a large fraction of the training set is correctly classified by a "confident" vote. (AdaBoost does not directly optimize this quantity.) Then in the bound above, for a specific theta > 0, we'd like the fraction of examples from the training set with margins below theta to be "small" (the first term of the bound) and yet theta itself to be not too small (since the second term is proportional to 1/theta.)

Decision trees (Quinlan: ID3, C4.5, C5.0) (Breiman: CART). We constructed a decision tree from the renowned "weather data" (available on paper.) This illustrated the concept of a decision tree and, informally, considerations relevant to constructing one. Our final tree correctly classifies all the training instances: did we overfit the data?

11/1/05 Lecture 18. Decision Trees.

Problem: given a training set {(x_1, y_1), ..., (x_m, y_m)}, where each instance x_i is specified by its values on a collection of attributes, and each label y_i is a class label for the instance x_i, find a decision tree that tests attributes of, and assigns class labels to, arbitrary instances. The example decision tree that we built last time for the weather data (on paper) could be described as: root vertex tests attribute "outlook?" and branches to vertices 1 (value = sunny), 2 (value = overcast), and 3 (value = rainy). At vertex 1 the attribute "humidity" is tested and branches to vertices 4 (value = high) and 5 (value = normal). At vertex 3 the attribute "windy?" is tested, and branches to vertices 6 (value = true) and 7 (value = false). Vertices 4,5,2,6,7 are leaf vertices, which are assigned class labels of no, yes, yes, no, and yes, respectively. This decision tree correctly classifies all 14 of the elements of the training sample.

Note that we can extract rules from the tree, one rule for each path from the root to a leaf. For example, the path root->1->4 gives the rule "if (outlook=sunny) and (humidity=high) then (play=no)" and the path root->3->7 gives the rule "if (outlook=rainy) and (windy?=false) then (play=yes)." The number of rules obtained in this way is just the number of leaves in the decision tree. Viewing this slightly differently, there is a DNF formula for each of the classes. For example, for (play=no) we have the DNF formula ((outlook=sunny and (humidity=high)) or ((outlook=rainy) and (windy?=true)), and for (play=yes) we have the DNF formula ((outlook=sunny) and (humidity=normal)) or (outlook=overcast) or ((outlook=rainy) and (windy?=false))).

Informally it is clear that we would like to build a "small" decision tree from the training data. Formally, this is related to the desire to keep the number of possible hypotheses (or the VC-dimension of the class of possible hypotheses) "small", which will allow the number of training examples to be "large enough" to guarantee good generalization performance. The optimization problem: find a "smallest" (in various senses) decision tree consistent with a given training sample, is NP-hard (Rivest and Hyafil showed this.) However, they are efficiently learnable in a stronger learning model (proved by Nader Bshouty), namely, PAC-learnable with membership queries (or, more precisely, exactly learnable with membership queries and equivalence queries using depth 3 formulas as hypotheses.) One property that might make decision trees more tractable to learn than general DNF formulas is the following. In the 2-class case, both the concept (yes) and its complement (no) have "small" DNF representations (small in the sense of polynomial in the size of the decision tree itself.) For general DNF, a concept represented by a DNF formula may require an exponentially larger DNF formula to represent its complement.

Back to the main line: optimally small decision trees seem to be hard to come by, so we consider a widely used and studied heuristic that greedily attempts to construct small (shallow) decision trees, due to Ross Quinlan. A brief aside on information theory: suppose I am trying to transmit to you the result of n flips of a very biased coin (probability of heads = 0.9, probability of tails = 0.1.) I could do so by sending you n bits, one bit representing the result of each coin flip. Is this the best I can do? Suppose n is even and I decide to represent the 4 possibilities for two consecutive flips, which, with their probabilities of occurrence are: HH (0.81), HT (0.09), TH (0.09), TT (0.01). I could choose a prefix free code for these possibilities as follows: (0->HH), (10->HT), (110->TH), and (111->TT). Being prefix free, this code can be uniquely decoded, so that 11001010111 is parsed as 110,0,10,10,111, which represents the coin flip sequence: TH,HH,HT,HT,TT. With this representation, what is the expected number of bits I will have to send you to transmit n coin flips? It will be (n/2) times the expected code length for a single pair of flips, which is 1*(0.81) + 2*(0.09) + 3*(0.10) = 1.29. That is, each pair of flips can be transmitted using an expected 1.29 bits, so the n flips can be transmitted using an expected 0.645*n bits, instead of the "naive" n bits. Is this the best? Well, no. The best is given asymptotically by the entropy (or information) of the distribution (0.9, 0.1), which yields about 0.469 bits per flip. Define H(p) = (p log_2(1/p)) + ((1-p) log_2(1/(1-p))). This is the entropy of the distribution (p, 1-p). It tells us how many bits per flip we'd need to transmit a sequence of coin flips biased as (p, 1-p). It is (by convention) 0 when p = 0 or p = 1. It reaches a unique maximum of 1 at p = 1/2; if the coin is fair, we may as well use the naive method of one bit per flip. It is symmetric around p = 1/2. More generally, we can define the entropy of a distribution (p_1, p_2, ..., p_k) as sum(i=1,k)p_i*log_2(1/p_i). This is the expected length (in bits) of the optimal coding of a single message in a sequence generated by picking (independently) a message m_i to transmit with probability p_i. (Note that since log(1/p) = -log(p), this can be written in an equivalent form with minus signs and no division, but the minus signs are really confusing, since everything is nonnegative!) (See information theory (Shannon) and arithmetic coding to follow up on these topics.)

Now we can look at the application of information theory in Quinlan's ID3 algorithm for constructing decision trees. In the example we did, the initial division of yes and no labels is 9 Y and 5 N. If we treat this as a distribution in which the probability of Y is 9/14 and the probability of N is 5/14, we can apply the reasoning above to say that the information required to specify the label of an instance randomly drawn from this distribution is H(9/14,5/14) bits, which is approximately 0.940 bits. If we consider the effect of placing the single attribute "outlook" at the root of our decision tree, then the result divides our original collection of examples into three parts: (2 Y, 3 N), (4 Y), and (3 Y, 2 N). Again, looking at these as distributions, we can investigate the information required to specify an example's label in each of the three subpopulations: H(2/5,3/5), H(4/4,0/4), H(3/5,2/5), respectively, or approximately, 0.971, 0.0, and 0.971 bits. Assuming that instances are drawn from these populations with a probability equal to the empirical frequency, that is, 5/14 for the first node, 4/14 for the second node, and 5/14 for the third node, we can calculate the expected number of bits to specify an example's label as (5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971), or approximately 0.694 bits. Thus, we by testing the "outlook" attribute, we have gone from needing 0.940 bits to needing 0.694 bits to specify the label of an instance, a *gain* of 0.246 bits. If we do a similar calculation for the "humidity" attribute, we divide the original population into two parts: (3 Y, 4 N), and (6 Y, 1 N), with entropies of 0.985 and 0.592 bits respectively, which when combined yield an expected number of bits of 0.788, for a gain of 0.152 bits, smaller than the gain for the "outlook" attribute. We do a similar caculation for each attribute and pick the one with the largest information gain for the root. Then we treat separately and recursively the training examples that reach each of the children of the root.

When does the recursion stop? Clearly it may stop when all the class labels of training examples associated with a node are equal, and the class label is the common value. It may also stop when all the instances in a node agree on all the attributes that may be tested (although the class labels may still disagree, in the case of noise and/or insufficiency of the attributes to determine the class label.) In this case, the recursion stops with a class label equal to the majority (or plurality) value of the instances in the node. What about overfitting? The recommended course is to build a tree that possibly overfits, and then to prune the tree using estimations of the effect on the generalization performance. This seems to work better than proactively trying to avoid building the parts that are later to be pruned. What about missing values? It depends partly on whether the domain supports treating "missing" as another potentially informative value for the corresponding attribute, or whether it is more reasonably thought of as a kind of random noise. In the second case, one can introduce fractional instances corresponding to the overall fraction of values of the given attribute, and send appropriate fractions of instances along the corresponding value edges. What about numeric values? One approach is to introduce binary splits such as "is age > 46?" based on information-theoretic considerations.

One improvement introduced with C4.5 was to use *gain ratio* instead of information gain as the criterion by which to choose a root attribute. The problem is that the information gain overly favors attributes that have many values. As an extreme example, if we test the "id" attribute of the weather data, we get a 14-way split into perfectly "pure" singleton nodes. Thus, the information gain in this case is the whole 0.940 bits that we started with (since 0 bits are required after testing the attribute.) The heuristic fix for this is to divide the information gain by the entropy associated with the distribution of examples into the children associated with the attribute's values. For example, considering the "outlook" attribute at the root, testing it produces the distribution (5/14,4/14,5/14), which has an entropy of about 1.58 bits. The gain ratio of "outlook" is its gain divided by this entropy, or about 0.156. For the "humidity" attribute at the root, the corresponding distribution is (7/14,7/14), which has an entropy of exactly 1 bits. Thus, the gain ratio of "humidity" is 0.152, making it a much closer second to "outlook" with respect to this measure.

We also started looking at the Perceptron training algorithm of Rosenblatt (1958), which will be described in the next lecture.

11/3/05 Lecture 19. Perceptron Training, Support Vector Machines.

The Perceptron training algorithm solves the following problem. We are given a training set {(x_1, y_1), ..., (x_m,y_m)}, where each x_i is a point (equivalently, vector) in R^d with length equal to 1, and each y_i is a label from {+1, -1} and there exists a vector w* of length 1 such that y_i = sign(w*x_i) for i=1,...,m. That is, the points labeled +1 are linearly separable from the points labeled -1. The goal is to find a vector w that linearly separates the +1 instances from the -1 instances. Recall that we discussed this problem before, and noted that it could be solved using linear programming, by treating the entries of w as the unknowns, and deriving the linear inequality constraints from the labeled examples, one constraint per example. However, the Perceptron training algorithm is an iterative updating method to solve this problem.

We phrase the Perceptron training algorithm in the on-line mistake bound model, assuming that we repeatedly cycle through the examples in the training set until the algorithm stops making mistakes. Initially the hypothesis vector w is 0. On input x_i, the algorithm predicts the sign of wx_i, (that is, calculates the dot product of w and x_i, and predicts +1 if it is positive, and -1 otherwise.) If the prediction is incorrect (that is, not equal to y_i), then the hypothesis w is updated by adding y_ix_i (that is, adding x_i if y_i = +1 and subtracting x_i if y_i = -1.) We did a 2-dimensional example, and noted that the vector w seemed to grow longer and longer, while at the same time the angle between w and w* seemed to get closer and closer to 0.

Why should this work? If we look at w*w, the dot product of the "correct" separator w* and the current hypothesis w, we see that it is equal to |w|(cos A), where |w| is the length of w and A is the angle between w and w*. How much can |w| increase with each mistake? Look at |w|^2, which is the dot product of w and w. Let w be the current hypothesis and w' the result of updadting w after a mistake. After a mistake, we have w' = w + y_ix_i, so |w'|^2 = (w + y_ix_i)(w + y_ix_i) = |w|^2 + 2y_iwx_i + |x_i|^2. (This follows from the linearity of the dot product.) Now we can use the facts (1) the length of each x_i is 1, that is, |x_i|^2 = 1, (2) we know that wx_i has a different sign from y_i, because a mistake was made on this example, so the term 2y_iwx_i is nonpositive. Hence, |w'|^2 is at most |w|^2 + 1. This limits the growth rate of |w| -- after M mistakes, it can be at most sqrt(M).

To say that we make progress on reducing the angle A between w and w*, we make an assumption: that the *geometric margin* of the training set is s > 0. The geometric margin of an example (x_i,y_i) is the distance of x_i from the separator w*, which is the length of the projection of x_i onto w*, that is, |w*x_i|, that is, the absolute value of the dot product of w* and x_i. (Review dot product and projections, recalling that w* has length 1.) Taking s to be the minimum of the geometric margins of the examples (x_i,y_i), we assume s > 0. Now we claim that w*w (the dot product of w* and w) increases by at least s with every mistake. Letting w' = w + y_ix_i be the updated value after a mistake, we have w*w' = w*(w + y_ix_i) = w*w +y_iw*x_i = w*w + |w*x_i|. (Note that y_i and w*x_i must have the same sign since w* classifies this example correctly, so y_iw*x_i must be the absolute value of w*x_i.) By assumption, |w*x_i| is at least s > 0, so w*w' is at least w*w + s. Hence, after M mistakes, w*w is at least Ms.

Putting these two bounds together, after M mistakes, |w| is at most sqrt(M), and w*w is at least Ms and at most |w|, so Ms <= sqrt(M), which implies that M <= 1/s^2. The total number of mistakes before the Perceptron training algorithm constructs w consistent with the sample is bounded by the inverse of the square of the geometric margin. Note that the updating is reminiscent of Winnow, except that it is additive instead of multiplicative. Other modifications of the algorithm incorporate a *learning rate* between 0 and 1, to reduce the effect of any single training example on the hypothesis. The assumption about unit length vectors can be removed by normalizing each example (x_i, y_i) by dividing x_i by its length -- this does not affect the direction of the vector, only its length, and preserves separability. If the separator is of the form w*x >= b, then we can add a dimension to the examples, extending them by a constant coordinate = 1, where the learning algorithm will also learn the threshold (the coefficient of the new dimension.)

We began looking at a "user's view" of Support Vector Machines, which will be covered in the next lecture.

11/8/05 Lecture 20. Support Vector Machines. (See the survey by Cristianini and Scholkopf.)
11/10/05 Lecture 21. Fast Online Active Learning, a guest lecture by Claire Monteleoni from MIT. (See her COLT 05 paper.)
11/15/05 Lecture 22. Bayesian Approaches to Learning. (See Chapter 6 of Mitchell's text.)
11/17/05 Lecture 23. Bayesian Approaches to Learning, concluded. Expectation Maximization.
11/29/05 Lecture 24. Markov chains, hidden Markov models, the forward/backward algorithm, and using Expectation Maximization to estimate missing parameters in a HMM.
12/1/05 Lecture 25. Reinforcement learning: see Mitchell, Chapter 13.

Last modified: 7 December 2005