Solutions to Exam #2 Practice

Problem : Compute the minimax value of each node in the tree below. Squares represent max nodes and circles represent min nodes.

Top-to-bottom, left-to-right, the minimax values are 7;6,7,4;9,6,9,7,6,7,4.

Problem : Show the operation of alpha-beta pruning on the tree shown below. Show the (alpha, beta) windows passed to each node as the tree is traversed, the values returned from each node, and which branches are pruned.

Labelling the nodes A through K top-to-bottom, left-to-right, the $(\alpha, \beta)$ windows passed to each node and the values returned are

Node	$(\alpha, \beta)$	Value
A	$(-\infty, \infty)$	7
B	$(-\infty, \infty)$	4
C	$(4, \infty)$	7
D	$(7, \infty)$	6
E	$(-\infty, \infty)$	4
F	$(-\infty, 4)$ [prunes 2nd child]	6
G	$(-\infty, 4)$ [prunes 2nd child]	7
H	$(4, \infty)$	7
I	$(4, 7)$ [prunes 2nd, 3rd children]	9
J	$(7, \infty)$	9
K	$(7, 9)$	6

Problem : Show the operation of Scout on he tree from the previous problem. Show the (alpha, beta) windows passed to each node as the tree is traversed, the values returned from each node, and which branches are pruned.

Labelling the nodes A through K top-to-bottom, left-to-right, the $(\alpha, \beta)$ windows passed to each node (nodes that are examined multiple times are given with each window passed to them and each returned value) and the values returned are

Node	$(\alpha, \beta)$	Value
A	$(-\infty, \infty)$	7
B	$(-\infty, \infty)$	4
C	$(4, 5)$ $(7, \infty)$ [prunes second child]	7 7
D	$(7, 8)$	6
E	$(-\infty, \infty)$	4
F	$(3, 4)$ [prunes 2nd child]	6
G	$(3, 4)$ [prunes 2nd child]	7
H	$(4, 5)$ [prunes 2nd child] $(7, \infty)$	7 7
I	$(4, 5)$ [prunes 2nd, 3rd children]	9
J	$(7, 8)$ [prunes 2nd, 3rd children]	9
K	$(7, 8)$	6

Problem : Show what would be stored in the transposition table after using alpha-beta pruning on the tree from above. Assume that the transposition table is initially empty and there are no collisions.

Labelling the nodes A through K top-to-bottom, left-to-right, the depths and bounds stored in the transposition table are

Node	Depth	Bound
A	3	=7
B	2	=4
C	2	=7
D	2	=6
E	1	=4
F	1	≥6
G	1	≥7
H	1	=7
I	1	≥9
J	1	=9
K	1	=6

Problem : Condsider the multi-armed bandit problem with 3 arms with payoffs determined by the outcomes of rolling a fair 6-, 8-, or 10-sided die. Show that the greedy strategy does not achieve zero average regret for this setup.

If we roll on 1 on the 10-sided die anything other than a 1 on either of the other dice then the average regret for each subsequent roll will be non-zero ,and so the limit of the average regret is non-zero. That happens with probability $\frac{1}{10} \cdot (1 - \frac{1}{8}\cdot\frac{1}{6}) > 0$, so the probability that the average regret goes to zero in the limit is less than 1.

Problem : Describe modifcations that would be necessary to standard MCTS for 2-player Yahtzee.

The tree policy would have to account for the stochastic element. At random event positions, the tree policy could randomly choose the next position according to the probability distribution of that event.

Problem : Pick an two enhancements to standard MCTS and describe how they might be applied to NFL Strategy, Kalah, or Yahtzee.

Predicate Average Sampling Technique: for NFL Strategy, use predicates "trailing by more than 21 points" and "leading by more than 21 points" and keep track of the success of various plays in those situations during random playouts. Then bias play selection towards the plays that are better in those situations when in positions matching those situations.

Search Seeding: use a heuristic function for Kalah (perhaps our simple score margin heuristic) to initialize visit count and reward statistics for newly expanded nodes in the hopes of reducing the amount of time determining that bad moves are in fact bad.

Problem : Describe how you would set up the inputs to a neural network that chooses plays for NFL Strategy. Describe how you would use the outputs to choose a play.

Inputs:

distance to goal line, normalized to $[0.0, 1.0)$
time left in quarter, normalized to $[0.0, 1.0)$
current quarter (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
current down (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
yards to first down divided by 10 (this one is interesting: we should normalize somehow, but dividing by 100 would result in compressing the most frequently seen positions to a small range; normalizing to standard deviations away from the mean seen during actual play may be better)
score margin divided by 20 (again, the best normalization may be determined by experimentation)
(and field position as -1 for left, 0, for center, 1 for right if you wish to include that feature of the commercial game)

Outputs:

one output for each play, interpreted as the probability of selecting that play (so the output layer should use a softmax activation function)

Problem : How did DeepMind address the issue of overfitting when training the value network they used in AlphaGo?

The value network was trained using positions reached in games played by the policy network obtained after reinforement learning against itself. Instead of training on every position reached during every game, they trained using one position sampled from 30 million different games.