Exam #2 Practice

Problem : Condsider the multi-armed bandit problem with 3 arms with payoffs determined by the outcomes of rolling a fair 6-, 8-, or 10-sided die. Show that the greedy strategy does not achieve zero average regret for this setup.

If we roll on 1 on the 10-sided die anything other than a 1 on either of the other dice then the average regret for each subsequent roll will be non-zero ,and so the limit of the average regret is non-zero. That happens with probability $\frac{1}{10} \cdot (1 - \frac{1}{8}\cdot\frac{1}{6}) > 0$, so the probability that the average regret goes to zero in the limit is less than 1.

Problem : Describe modifcations that would be necessary to standard MCTS for 2-player Yahtzee.

The tree policy would have to account for the stochastic element. At random event positions, the tree policy could randomly choose the next position according to the probability distribution of that event.

Problem : Pick an two enhancements to standard MCTS and describe how they might be applied to NFL Strategy, Kalah, or Yahtzee.

Predicate Average Sampling Technique: for NFL Strategy, use predicates "trailing by more than 21 points" and "leading by more than 21 points" and keep track of the success of various plays in those situations during random playouts. Then bias play selection towards the plays that are better in those situations when in positions matching those situations.

Search Seeding: use a heuristic function for Kalah (perhaps our simple score margin heuristic) to initialize visit count and reward statistics for newly expanded nodes in the hopes of reducing the amount of time determining that bad moves are in fact bad.

Problem : Consider using coevolution to optimize a set of numeric parameters in a heuristic for chess. Describe some ways to get useful information from playing two members of the population against each other multiple times when ordinarily the resulting players are deterministic so each game would follow the same sequence of moves.

As described in Genetic Algorithms for Evolving Computer Chess Programs (O. David et al; see reading list), we can play multiple games with randomly selected opening move sequences.

Problem : Describe how you would set up the inputs to a neural network that chooses plays for NFL Strategy. Describe how you would use the outputs to choose a play.

Inputs:

distance to goal line, normalized to $[0.0, 1.0)$
time left in quarter, normalized to $[0.0, 1.0)$
current quarter (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
current down (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
yards to first down divided by 10 (this one is interesting: we should normalize somehow, but dividing by 100 would result in compressing the most frequently seen positions to a small range; normalizing to standard deviations away from the mean seen during actual play may be better)
score margin divided by 20 (again, the best normalization may be determined by experimentation)
(and field position as -1 for left, 0, for center, 1 for right if you wish to include that feature of the commercial game)

Outputs:

one output for each play, interpreted as the probability of selecting that play (so the output layer should use a softmax activation function)

Problem : How did DeepMind address the issue of overfitting when training the value network they used in AlphaGo?

The value network was trained using positions reached in games played by the policy network obtained after reinforement learning against itself. Instead of training on every position reached during every game, they trained using one position sampled from 30 million different games.