Problem : Compute the minimax value of each node in the tree below. Squares represent max nodes and circles represent min nodes.

Minimax tree
Top-to-bottom, left-to-right, the minimax values are 7;6,7,4;9,6,9,7,6,7,4.

Problem : Show the operation of alpha-beta pruning on the tree shown below. Show the (alpha, beta) windows passed to each node as the tree is traversed, the values returned from each node, and which branches are pruned.

Alpha-beta tree
Labelling the nodes A through K top-to-bottom, left-to-right, the $(\alpha, \beta)$ windows passed to each node and the values returned are
Node$(\alpha, \beta)$Value
A $(-\infty, \infty)$ 7
B $(-\infty, \infty)$ 4
C $(4, \infty)$ 7
D $(7, \infty)$ 6
E $(-\infty, \infty)$ 4
F $(-\infty, 4)$ [prunes 2nd child] 6
G $(-\infty, 4)$ [prunes 2nd child] 7
H $(4, \infty)$ 7
I $(4, 7)$ [prunes 2nd, 3rd children] 9
J $(7, \infty)$ 9
K $(7, 9)$ 6

Problem : Show the operation of Scout on he tree from the previous problem. Show the (alpha, beta) windows passed to each node as the tree is traversed, the values returned from each node, and which branches are pruned.

Labelling the nodes A through K top-to-bottom, left-to-right, the $(\alpha, \beta)$ windows passed to each node (nodes that are examined multiple times are given with each window passed to them and each returned value) and the values returned are
Node$(\alpha, \beta)$Value
A $(-\infty, \infty)$ 7
B $(-\infty, \infty)$ 4
C $(4, 5)$
$(7, \infty)$ [prunes second child]
7
7
D $(7, 8)$ 6
E $(-\infty, \infty)$ 4
F $(3, 4)$ [prunes 2nd child] 6
G $(3, 4)$ [prunes 2nd child] 7
H $(4, 5)$ [prunes 2nd child]
$(7, \infty)$
7
7
I $(4, 5)$ [prunes 2nd, 3rd children] 9
J $(7, 8)$ [prunes 2nd, 3rd children] 9
K $(7, 8)$ 6

Problem : Show what would be stored in the transposition table after using alpha-beta pruning on the tree from above. Assume that the transposition table is initially empty and there are no collisions.

Labelling the nodes A through K top-to-bottom, left-to-right, the depths and bounds stored in the transposition table are
NodeDepthBound
A 3 =7
B 2 =4
C 2 =7
D 2 =6
E 1 =4
F 1 ≥6
G 1 ≥7
H 1 =7
I 1 ≥9
J 1 =9
K 1 =6

Problem : Condsider the multi-armed bandit problem with 3 arms with payoffs determined by the outcomes of rolling a fair 6-, 8-, or 10-sided die. Show that the greedy strategy does not achieve zero average regret for this setup.

If we roll on 1 on the 10-sided die anything other than a 1 on either of the other dice then the average regret for each subsequent roll will be non-zero ,and so the limit of the average regret is non-zero. That happens with probability $\frac{1}{10} \cdot (1 - \frac{1}{8}\cdot\frac{1}{6}) > 0$, so the probability that the average regret goes to zero in the limit is less than 1.

Problem : Describe modifcations that would be necessary to standard MCTS for 2-player Yahtzee.

The tree policy would have to account for the stochastic element. At random event positions, the tree policy could randomly choose the next position according to the probability distribution of that event.

Problem : Pick an two enhancements to standard MCTS and describe how they might be applied to NFL Strategy, Kalah, or Yahtzee.

Predicate Average Sampling Technique: for NFL Strategy, use predicates "trailing by more than 21 points" and "leading by more than 21 points" and keep track of the success of various plays in those situations during random playouts. Then bias play selection towards the plays that are better in those situations when in positions matching those situations.

Search Seeding: use a heuristic function for Kalah (perhaps our simple score margin heuristic) to initialize visit count and reward statistics for newly expanded nodes in the hopes of reducing the amount of time determining that bad moves are in fact bad.

Problem : Describe how you would set up the inputs to a neural network that chooses plays for NFL Strategy. Describe how you would use the outputs to choose a play.

Inputs: Outputs:

Problem : How did DeepMind address the issue of overfitting when training the value network they used in AlphaGo?

The value network was trained using positions reached in games played by the policy network obtained after reinforement learning against itself. Instead of training on every position reached during every game, they trained using one position sampled from 30 million different games.