-
>
>
>
>
Examples >
Exam #2 Practice
Problem :
Condsider the multi-armed bandit problem with 3 arms with payoffs determined by the outcomes
of rolling a fair 6-, 8-, or 10-sided die. Show that the greedy strategy does not achieve
zero average regret for this setup.
If we roll on 1 on the 10-sided die anything other than a 1 on either
of the other dice then the average regret for each subsequent roll
will be non-zero ,and so the limit of the average regret is non-zero.
That happens with probability
$\frac{1}{10} \cdot (1 - \frac{1}{8}\cdot\frac{1}{6}) > 0$,
so the probability that the average regret goes to zero in the limit is
less than 1.
Problem :
Describe modifcations that would be necessary to standard MCTS for 2-player Yahtzee.
The tree policy would have to account for the stochastic element. At
random event positions, the tree policy could randomly choose the next
position according to the probability distribution of that event.
Problem :
Pick an two enhancements to standard MCTS and describe how they might be applied to
NFL Strategy, Kalah, or Yahtzee.
Predicate Average Sampling Technique: for NFL Strategy, use predicates
"trailing by more than 21 points" and "leading by more than 21 points"
and keep track of the success of various plays in those situations
during random playouts.
Then bias play selection towards the plays that are better in those
situations when in positions matching those situations.
Search Seeding: use a heuristic function
for Kalah (perhaps our simple score margin heuristic) to
initialize visit count and reward statistics for newly expanded
nodes in the hopes of reducing the amount of time determining
that bad moves are in fact bad.
Problem :
Consider using coevolution to optimize a set of numeric parameters in a heuristic for chess.
Describe some ways to get useful information from playing two members of the population
against each other multiple times when ordinarily the resulting players are deterministic
so each game would follow the same sequence of moves.
As described in Genetic Algorithms for Evolving Computer Chess Programs (O. David et al; see reading list), we can play multiple games with
randomly selected opening move sequences.
Problem :
Describe how you would set up the inputs to a neural network that chooses plays for
NFL Strategy. Describe how you would use the outputs to choose a play.
Inputs:
- distance to goal line, normalized to $[0.0, 1.0)$
- time left in quarter, normalized to $[0.0, 1.0)$
- current quarter (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
- current down (1.0 = 1st, 0.66 = 2nd, 0.33 = 3rd, 0.0 = 4th)
- yards to first down divided by 10 (this one is interesting: we should normalize somehow, but dividing by 100 would result in compressing the most frequently seen positions to a small range; normalizing to standard deviations away from the mean seen during actual play may be better)
- score margin divided by 20 (again, the best normalization may be determined by experimentation)
- (and field position as -1 for left, 0, for center, 1 for right if you wish to include that feature of the commercial game)
Outputs:
- one output for each play, interpreted as the probability of selecting
that play (so the output layer should use a softmax activation function)
Problem :
How did DeepMind address the issue of overfitting when training the value network they
used in AlphaGo?
The value network was trained using positions reached in games played by
the policy network obtained after reinforement learning against itself.
Instead of training on every position reached during every game, they
trained using one position sampled from 30 million different games.