an agent, assuming perfect knowledge of its static oppo-nent. However, such strategies are brittle: against a worstcase opponent, they have a high exploitability. In a two-
The problem of exploiting information about the
player zero-sum game, a Nash equilibrium strategy maxi-
environment while still being robust to inaccu-
mizes its utility against a worst-case opponent. As a result,
rate or incomplete information arises in many
we say that such strategies are robust. If a perfect model
of the opponent is available, then they can be exploited by
games where the goal is to maximally exploit an
a best response; if a model is not available, then playing a
unknown opponent’s weaknesses are an example
Nash equilibrium strategy is a sensible choice. However,
of this problem. Agents for these games must
if a model exists but it is somewhat unreliable (e.g., if it is
balance two objectives. First, they should aim to
formed from a limited number of observations of the oppo-
exploit data from past interactions with the op-
nent’s actions, or if the opponent is known to be changing
ponent, seeking a best-response counter strategy.
strategies) then a better option may be to compromise: ac-
Second, they should aim to minimize losses since
cepting a slightly lower worst-case utility in return for a
the limited data may be misleading or the oppo-
higher utility if the model is approximately correct.
nent’s strategy may have changed, suggesting anopponent-agnostic Nash equilibrium strategy. In
One simple approach for creating such a compromise strat-
this paper, we show how to partially satisfy both
egy is to create both a best response strategy and a Nash
of these objectives at the same time, producing
equilibrium strategy, and then play a mixture of the two.
strategies with favourable tradeoffs between the
Before each game, we will flip a biased coin. With prob-
ability to exploit an opponent and the capacity
ability p we will use the best response, and with probabil-
to be exploited. Like a recently published tech-
ity (1 − p) we will use the Nash equilibrium. By varying
nique, our approach involves solving a modified
p, we can create a range of strategies that linearly trade
game; however the result is more generally appli-
off exploitation of the opponent and our own exploitability
cable and even performs well in situations with
by a worst-case opponent. While this approach is a useful
very limited data. We evaluate our technique in
baseline, we would like to make more favourable tradeoffs
the game of two-player, Limit Texas Hold’em.
McCracken and Bowling [McCracken and Bowling, 2004]proposed -safe strategies as another approach. The set of
-safe strategies contains all strategies that are exploitable
by no more than . From this set, the strategies that maxi-
Maximizing utility in the presence of other agents is a fun-
mize utility against the opponent are the set of -safe best
damental problem in game theory. In a zero-sum game,
responses. Thus, for a chosen , the set of -safe best
utility comes from the exploitation of opponent weak-
responses achieve the best possible tradeoffs between ex-
nesses, but it is important not to allow one’s own strategy to
ploitation and exploitability. However, their approach is
be exploited in turn. Two approaches to such problems are
computationally infeasible for large domains, and has only
well known: best response strategies and Nash equilibrium
been applied to Ro-Sham-Bo (Rock-Paper-Scissors).
strategies. A best response strategy maximizes utility for
Appearing in Proceedings of the 12th International Confe-rence
In previous work we proposed the restricted Nash re-
on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwa-
sponse [Johanson et al., 2008] technique (RNR) as a prac-
ter Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright
tical approach for generating a range of strategies that pro-
vide good tradeoffs between exploitation and exploitabil-
ity. In this approach, a modified game is formed in which
best response to an opponent’s strategy σ2 is a strategy for
the opponent is forced to act according to an opponent
player 1 that achieves the maximum expected utility of all
model with some probability p, and is free to play the
strategies when used against the opponent’s strategy. There
game as normal with probability (1 − p). When p is 0
can be many strategies that achieve the same expected util-
the result is a Nash equilibrium, and when p is 1 the re-
ity; we refer to the set of best responses as BR(σ2) ⊆ Σ1.
sult is a best response. When 0 < p < 1 the technique
For example, the set of best responses for player 1 to use
produces a counter-strategy that provides different trade-
offs between exploitation and exploitability. In fact, the
counter-strategies generated are in the set of -safe best re-
sponses for the counter-strategy’s value of , making themthe best possible counter-strategies, assuming the model is
A strategy profile σ consists of a strategy for each player
correct. In a practical setting, however, the model is likely
formed through a limited number of observations of the op-
σ1 ∈ BR(σ2) and σ2 ∈ BR(σ1), we refer to σ as a Nash
ponent’s actions, and it may be incomplete (it cannot pre-
equilibrium. A zero-sum extensive game is an extensive
dict the opponent’s strategy in some states) or inaccurate.
game where u1 = −u2 (one player’s gains are equal to
As we will show in this paper, the restricted Nash response
the other player’s losses). In such games, all Nash equilib-
technique can perform poorly under such circumstances.
rium strategies have the same utility for the players, and werefer to this as the value of the game. We define the term
In this paper, we present a new technique for generat-
exploitability to refer to the difference between a strategy’s
ing a range of counter-strategies that form a compromise
utility when playing against its best-response and the value
between the exploitation of a model and its exploitabil-
of the game for that player. We define exploitation to re-
ity. These counter-strategies, called data biased responses
fer to the difference in utility between one strategy’s utility
(DBR), are more resilient to incomplete or inaccurate
against a specific opponent strategy and the value of the
models than the restricted Nash response (RNR) counter-
strategies. DBR is similar to RNR in that the technique in-volves computing a Nash equilibrium strategy in a modified
A strategy that can be exploited for no more than
game where the opponent is forced with some probability
called -safe, and is a member of the set of -safe strate-
to play according to a model. Unlike RNR, the opponent’s
strategy is constrained on a per-information set basis, and
depends on our confidence in the accuracy of the model.
For comparison to the RNR technique, we demonstrate the
effectiveness of the technique in the challenging domain of
2-player Limit Texas Hold’em Poker.
A perfect information extensive game consists of a treeof game states and terminal nodes. At each game state, an
Heads-Up Limit Texas Hold’em poker is a two-player
action is taken by one player (or by “chance”) causing a
transition to a child state; this is repeated until a terminal
played in casinos (both online and in real life), it is also
state is reached. The terminal state defines the payoffs to
the main event of the AAAI Computer Poker Competi-
the players. In imperfect information extensive games
tion [Zinkevich and Littman, 2006], an initiative to foster
such as poker, the players cannot observe some piece of in-
research into AI for imperfect information games. Texas
formation (such as their opponent’s cards) and so they can-
Hold’em is a very large zero-sum extensive form game
not exactly determine which game state they are in. Each
with imperfect information (the opponent’s cards are hid-
set of indistinguishable game states is called an informa-
den) and stochastic elements (cards are dealt at random).
tion set and we denote such a set by I ∈ I. A strategy for
Each individual game is short, and players typically play a
player i, σi, is a mapping from information sets to a proba-
bility distribution over actions, so σi(I, a) is the probability
We will briefly summarize the rules of the game. A ses-
player i takes action a in information set I. The space of all
sion starts with each player having some number of chips,
possible strategies for player i will be denoted Σi. In this
which usually represent money. A single game of Heads-
paper, we will focus on two player games.
Up Limit Texas Hold’em consists of each player being
Given strategies for both players, we define ui(σ1, σ2) to
forced to place a small number of chips (called a blind)
be the expected utility for player i if player 1 uses the strat-
into the pot before being dealt two private cards. The play-
egy σ1 ∈ Σ1 and player 2 uses the strategy σ2 ∈ Σ2. A
ers will combine these private cards with five public cards
that are revealed as the game progresses. The game has
Nash equilibria in a game of this size is intractable. There-
four phases: the preflop (when two private cards are dealt),
fore, it is common practise to instead reduce the real game
the flop (when three public cards are dealt), the turn (when
to a much smaller abstract game that maintains as many of
one public card is dealt) and the river (when one final pub-
the strategic properties as possible. The strategies of inter-
lic card is dealt). If both players reach the end of the game
est to us will be computed in this abstract game. To use the
(called a showdown), then both players reveal their private
abstract game strategy to play the real game, we will map
cards and the player with the best 5-card poker hand wins
the current real game information set to an abstract game
all of the chips in the pot. If only one player remains in the
information set, and choose the action specified by the ab-
game, then that player wins the pot without revealing their
cards. After the cards are dealt in each phase, the players
The game is abstracted by merging information sets that
engage in a round of betting, where they bet by placing
result from similar chance outcomes. On the preflop, one
additional chips in the pot that their opponent must match
such abstraction might reduce the number of chance out-
or exceed in order to remain in the game. To do this, the
comes from 52 choose 2 down to 5, and from (52 choose
players alternate turns and take one of three actions. They
may fold to exit the game and let the opponent win, call to
come is reduced to one of 5 outcomes, giving 625 possi-
match the opponent’s chips in the pot, or raise to match,
ble combinations, resulting in a game that has 6.45 × 109
and then add a fixed number of additional chips (the “bet”
game states. In this abstract game, best response counter-
amount). When both players have called, the round of bet-
strategies can be computed in time linear in the size of
ting is over, and no more than four bets are allowed in a
the game tree; on modern hardware, this takes roughly 10
minutes. Using recent advances for solving extensive form
The goal is to win as much money as possible from the op-
games [Zinkevich et al., 2008], a Nash equilibrium for this
ponent by the end of the session. This distinguishes poker
abstract game can be approximated to within 3 millibets
from games such as Chess or Checkers where the goal is
simply to win and the magnitude of the win is not mea-sured. The performance of an agent is measured by the
number of bet amounts (or just bets) they win per game
towards creating strong agents for Texas Hold’em
across a session. Between strong computer agents, this
number can be small, so we present the performance in mil-
libets per game (mb/g), where a millibet is one thousandth
of a bet. A player that always folds will lose 750 millibets
the ability to exploit opponent weaknesses, so we will
per game to their opponent, and a strong player can hope
examine results where the opponent is not playing an
to win 50 millibets per game from their opponent. Due to
equilibrium strategy. Toward this end, we created an agent
a standard deviation of approximately 6000 millibets per
similar to “Orange”, which was designed to be overly
game, it can take more than one million games to distin-
aggressive but still near equilibrium and competed in the
guish with 95% confidence a difference of 10 millibets per
First Man-Machine Poker Championship [Johanson, 2007,
p. 82],. “Orange” is a strategy for an abstract non-zero-sum
Since the goal of the game is to maximize the exploita-
poker game where the winner gets 7% more than usual,
tion of one’s opponent, the game emphasizes the role
while the loser pays the normal price. When this strategy
of exploitive strategies as opposed to equilibrium strate-
is used to play the normal (still abstract) zero-sum game
gies. In the two most recent years of the AAAI Computer
of poker, it is exploitable for 28 millibets per game. This
Poker Competition, the “Bankroll” event which rewards
value is the upper bound on the performance obtainable by
exploitive play has been won by agents that lost to some op-
any counter-strategy that plays in the same abstraction.
ponents, but won enough money from the weakest agents to
In this paper, we will also refer to an agent called
have the highest total winnings. However, many of the top
“Probe” [Johanson et al., 2008]. Probe is a trivial agent
agents have been designed to take advantage of a suspected
that never folds, and calls and raises whenever legal with
a priori weakness common to many opponents. A more
equal probability. The Probe agent is useful for collecting
promising approach is to observe an opponent playing for
observations about an opponent’s strategy, since it forces
some fixed number of games, and use these observations
them into all of the parts of the game tree that the opponent
to create a counter-strategy that exploits the opponent for
more money than a baseline Nash equilibrium strategy or astrategy that exploits some expected weaknesses.
strategy can simply be encoded as a strategy itself. Even a
posterior belief derived from a complicated prior and many
9.17 × 1017 game states; computing best responses and
observations still can be summarized as a single function
mapping an information set to a distribution over actions,
we can plot any counter-strategy as a point on a graph with
the expected posterior strategy1. In this work, we will
these axes. Restricted Nash responses involve a family of
mainly take a frequentist approach to observations of the
counter-strategies attained by varying p. Hence, we plot
opponent’s actions (although we discuss a Bayesian inter-
a curve passing through a set of representative p-values to
pretation to our approach in Section 7). Each observation
demonstrate the shape of the envelope of strategies. Since
is one full information game of poker: both players’ cards
the exploitability is determined by the choice of p, we
are revealed. The model of our opponent will consider all
are (indirectly) controlling the exploitability of the result-
of the information sets in which we have observed the op-
ing counter-strategy, and so it appears on the x-axis; the
ponent acting. The probability of the opponent model tak-
counter-strategy’s exploitation of the specific opponent is
ing an action a in such an information set I is then set to
the result, and is shown on the y-axis. In each of the fol-
the ratio of the number of observations of the opponent
lowing graphs, the values of p used were 0, 0.5, 0.7, 0.8,
playing a in I to the number of observations of I. There
0.9, 0.93, 0.97, 0.99, and 1. Each value of p corresponds
will likely be information sets in which we have never ob-
to one datapoint on each curve. Unless otherwise stated,
served the opponent acting. For such information sets, we
each set of counter-strategies was produced with 1 million
establish a default policy to always choose the call ac-
observed games of Orange playing against Probe.
Since our opponent model is itself a strategy, it can be
Restricted Nash response counter-strategies can over-
used to play against the counter-strategies that are de-
Nash response counter-strategies each present a different
strategies to perform very well in such cases, and this is
tradeoff of exploitation and exploitability when compared
demonstrated in our previous work on restricted Nash re-
against their opponent model. As p increases, the counter-
sponses [Johanson et al., 2008]. However, since the model
strategies exploit the opponent model to a higher degree,
is constructed only from (possibly a small number) obser-
and are themselves more exploitable. However, as Fig-
vations of the opponent’s strategy, it is more interesting to
ure 1a shows, this trend does not hold when we compare
examine how the counter-strategies perform against the ac-
their performance against the actual opponent instead of
the opponent model. As p increases, the counter-strategiesbegin to do markedly worse against the actual Orange strat-egy. The computed counter-strategy has overfit to the op-
ponent model. As the number of observations approach thelimit, the opponent model will perfectly match the actual
As discussed in the introduction, restricted Nash response
opponent in the reachable part of the game tree, and this
counter-strategies form an envelope of possible counter-
effect will lessen. In a practical setting, however, p must
strategies to use against the opponent, assuming the op-
be chosen with care so that the resulting counter-strategies
ponent model is correct [Johanson et al., 2008]. The re-
stricted Nash response technique was designed to solve thebrittleness of best response strategies. As was presented
Restricted Nash response counter-strategies require a
in Table 1 of that work, best response strategies perform
well against their intended opponent, but they can per-
technique is given more observations of an opponent, the
form very badly against other opponents, and are highly
counter-strategies produced will grow in strength. This is
exploitable by a worst-case opponent. Restricted Nash re-
true of the restricted Nash response technique. However, if
sponse strategies are robust, and any new technique for pro-
there is not a sufficient quantity of observations, increasing
ducing counter-strategies should also be able to produce ro-
p can make the resulting counter-strategies worse than the
bust strategies. However, restricted Nash response strate-
equilibrium strategy. This is another aspect of the restricted
gies have three limitations. We will show that our new
Nash response technique’s capacity to overfit the model;
counter-strategy technique addresses these issues.
if there is an insufficient number of observations, then the
default policy plays a larger part of the model’s strategy and
the exploitability-versus-exploitation graph that is used
the resulting counter-strategy is less applicable to the actual
throughout the paper. For each counter-strategy, we can
opponent. Figure 1b shows this effect. With less than 100
measure the exploitability (worst-case performance) and
thousand observed games, increasing p causes the counter-
exploitation (performance against a specific opponent). So
strategies to be both more exploitable and less exploitive.
is the posterior density function over strate-
gies, then the expected posterior strategy chooses action a at infor-
Restricted Nash response counter-strategies are sensi-
tive to the choice of training opponent.
2Alternative default policies were tried in this previous work,
nique for creating counter-strategies based on observations
should be able to accept any reasonably diverse set of ob-
servations as input. However, the restricted Nash response
technique requires a very particular set of observations in
order to perform well. Figure 1c shows the performance of
two sets of restricted Nash response counter-strategies. The
set labelled Probe uses an opponent model that observed
one million games of Orange playing against Probe; the set
labelled Self-Play uses an opponent model that observed
one million games of Orange playing against itself. One
might think that a model constructed from self-play obser-
vations would be ideal, because it would be accurate in the
parts of the game tree that the opponent is likely to reach.
Instead, we find that self-play data is of no use when con-
structing a restricted Nash response counter-strategy. If an
agent will not play to reach some part of the game tree, then
the opponent model has no observations of the opponent inthat part of the tree, and is forced to turn to the default
policy which may be very dissimilar from the actual oppo-nent’s strategy. The Probe agent forces the the opponent to
play into all of the parts of the tree reachable because of the
opponent’s strategy, however, and thus the default policy isused less often.
The guiding idea behind the restricted Nash response tech-
nique is that the opponent model may not be perfect. The
parameter p can be thought of as a measure of confidence
in the model’s accuracy. Since the opponent model is based
on observations of the opponent’s actions, there can be twotypes of flaws in the opponent model. First, there may be
information sets in which we never observed the opponent,and so the opponent model must provide a default policy
to be taken at this information set. Second, in information
sets for which there were a small number of observations,the observed frequency of actions may not match the true
We claim that the restricted Nash response technique’s se-
lection of one parameter, p, is not an accurate representa-tion of the problem, because the accuracy of the opponent
model is not uniform across all of the reachable informa-
tion sets. Consider the two cases described above. First,
in unobserved information sets, the opponent model uses
the default policy and is unlikely to accurately reflect theopponent’s strategy. If we could select a value of p for just
this information set, then p would be 0. Second, the num-
ber of observations of a particular information set will vary
Figure 1: Exploitation versus exploitability curves that il-
wildly across the game tree. In information sets close to
lustrate three problems in the restricted Nash response tech-
the root, we are likely to have many observations, and so
nique. In 1a, we note the difference in performance when
we expect the model to be accurate. In information sets
counter-strategies play against the opponent model and
that are far from the root, we will tend to have fewer ob-
against the actual opponent. In 1b, we see how a scarcity
servations, and so we expect the model to be less accurate.
of observations results in poor counter-strategies. In 1c, we
If we were selecting a value of p for one information set,
see that the technique performs poorly when self-play data
it should depend on how accurate we expect the model to
is used. Note that the red, solid curve is the same in each
be; one measure of this is the number of times we have
observed the opponent acting in this information set.
This is the essential difference between the restricted Nash
fashion to p in the restricted Nash response technique. It
response technique and the data biased response technique.
is used to set a maximum confidence for Pconf . Varying
Instead of choosing one probability p that reflects the ac-
Pmax in the range [0, 1] allows us to set a tradeoff between
curacy of the entire opponent model, we will assign one
exploitation and exploitability, while nI indicates places
probability to each information set I and call this mapping
where our opponent model should not be trusted.
Pconf . We will then create a modified game in the follow-ing way. Whenever the restricted player reaches I, they will
be forced to play according to the model with probability
ple choice of Pconf , which we call the 1-Step function. In
Pconf (I), and can choose their actions freely with probabil-
information sets where we have never observed the op-
ity (1 − Pconf (I)). The other player has no restrictions on
ponent, Pconf returns 0; otherwise, it returns Pmax. This
their actions. When we solve this modified game, the unre-
choice of Pconf allows us to isolate the modelling error
stricted player’s strategy becomes a robust counter-strategy
caused by the default policy from the error caused by the
opponent model’s action probabilities not matching the ac-tion probabilities of the actual opponent.
One setting for Pconf is noteworthy. If Pconf (I) is set to0 for some information sets, then the opponent model isnot used at all and the player is free to use any strategy.
However, since we are solving the game, this means that
other simple choice of Pconf , which we call the 10-Step
we assume a worst-case opponent and essentially compute
function. In information sets where we have observed the
a Nash equilibrium in these subgames.
opponent fewer than 10 times, Pconf returns 0; otherwise,it returns Pmax. Thus, it is simply a step function that re-quires ten observations before expressing any confidence in
Given an opponent model σfix and Pconf , the restrictedplayer chooses a strategy σ
dle ground between our two step functions. The 0-10 Lin-
2. The resulting probability of σ2 taking
action a at information set I is given as:
ear function returns Pmax if nI > 10 and (nI × Pmax)/10
otherwise. Thus, as we obtain more observations, the func-
2(I , a) = Pconf (I )×σfix(I , a)+(1−Pconf (I ))×σ (I , a)
tion expresses more confidence in the accuracy of the op-ponent model.
Define ΣPconf ,σfix to be the set of strategies for the restricted
player, given the possible settings of σ2. Among this set
of strategies, we can define the subset of best responses to
ting of Pconf with a Bayesian interpretation. The s-Curve
an opponent strategy σ1, BRPconf,σfix (σ1) ⊆ ΣPconf,σfix .
function returns Pmax × (nI /(s + nI )) for any constant
Solving a game with the opponent restricted accordingly,
s; in this experiment, we used s = 1. Thus, as we ob-
tain more observations, the function approaches Pmax. The
foundation for this choice of Pconf is explained further in
In this pair, the strategy σ∗1 is a Pconf -restricted Nash re-
sponse to the opponent model σfix, which we call a databiased response counter-strategy.
In Section 3, we presented three problems with restricted
We will now present four ways in which Pconf can be cho-
Nash response strategies. In this section, we will revisit
sen, all of which have two aspects in common. First, each
these three problems and show that data biased response
approach sets Pconf (I) for an information set I as a func-
counter-strategies overcome these weaknesses. In each ex-
tion of the number of observations we have of the opponent
periment, the sets of restricted Nash response and data bi-
acting in information set I, nI . As the number of observa-
ased response counter-strategies were created with p and
tions of our opponent acting in I increase, we will become
Pmax (respectively) parameters of 0, 0.5, 0.7, 0.8, 0.9, 0.93,
more confident in the model’s accuracy. If nI = 0, then we
0.97, 0.99, and 1. Unless otherwise stated, each set of
set Pconf (I) to zero, indicating that we have no confidence
counter-strategies was produced with 1 million observed
in the model’s prediction. Note that this choice in setting
games of Orange playing against Probe.
Pconf removes the need for a default policy. As mentionedin Section 5, this means the restricted player will become a
worst-case opponent in any information sets for which we
overfitting to the model. Figure 2a shows the results of
have no observations. Second, each approach accepts an
sets of restricted Nash response and 1-Step, 10-Step and
additional parameter Pmax ∈ [0, 1], which acts in a similar
0-1 Linear data biased response counter-strategies playing
against Orange and the opponent model of Orange. Two
of the results are noteworthy. First, we observe that the
set of 1-Step data biased response counter-strategies overfit
the model. Since the 1-Step data biased response counter-
strategies did not use the default policy, this shows us that
the error caused by the opponent model’s action probabil-ities not agreeing with the actual opponent’s action proba-
bilities is a nontrivial problem and that the default policyis not the only weakness. Second, we notice that the 0-10
Linear, 10-Step and 1-Curve data biased response counter-strategies do not overfit the opponent model, even at the
last datapoint where Pmax is set to 1.
lem of the quantity of observations necessary to produceuseful counter-strategies. In Figure 1b, we showed that
with insufficient quantities of observations, restricted Nash
counter-strategies not only did not exploit the opponent but
in fact performed worse than a Nash equilibrium strategy
(which makes no attempt to exploit the opponent). In Fig-ure 2b, we show that the 0-10 Linear data biased response
counter-strategies perform well, regardless of the quantity
of observations provided. While the improvement in ex-
ploitation from having 100 or 1000 observations is very
small, for Pmax < 1 the counter-strategies became only
marginally more exploitable. This is a marked difference
from the restricted Nash response results in Figure 1b.
lem of the source of the observations used to create themodel. In Figure 1c, we showed that the restricted Nash
response technique required observations of the opponentplaying against an opponent such as Probe in order to
create useful counter-strategies. In Figure 2c, we show
that while the data biased response counter-strategies pro-
duced are more effective when the opponent model ob-
serves games against Probe, the technique does still pro-
duce useful counter-strategies when provided with self-
We motivated data biased responses by noting that the con-
fidence in our model is not uniform over all information
sets, and suggesting p should be some increasing function
Figure 2: Exploitation versus exploitability curves for data
of the number of observations at a particular information
biased response counter-strategies. 2a shows that restricted
set. We can give an alternative motivation for this approach
Nash and 1-Step counter-strategies overfit the model, while
by considering the framework of Bayesian decision mak-
10-Step, 0-10 Linear, and 1-Curve counter-strategies do
ing. In the Bayesian framework we choose a prior den-
not. 2b shows that the 0-10 Linear counter-strategies are
effective with any quantity of training data. 2c shows that
strategy. Given observations of the opponent’s decisions
the 0-10 Linear counter-strategies can accept any type of
Z we can talk about the posterior probability Pr(σ2|Z, f ).
training data. Note that the red, solid curve is the same in
If only one more hand is to be played, decision theory in-
structs us to maximize our expected utility given our be-
The problem of exploiting information about a suspected
tendency in an environment while minimizing worst-case
Since utility is linear in the sequence form representation of
performance occurs in several domains, and becomes more
strategy, we can move the integral inside the utility function
difficult when the information may be limited or inaccurate.
allowing us to solve the optimization as the best-response
We reviewed restricted Nash response counter-strategies, a
to the expected posterior strategy (see Footnote 1).
recent work on the opponent modelling interpretation of
However, instead of choosing a single prior density, sup-
this problem in the Poker domain, and highlighted three
pose we choose a set of priors (F ), and we want to play
shortcomings in that approach. We proposed a new tech-
a strategy that would have large utility for anything in this
nique, data biased responses, for generating robust counter-
set. A traditional Bayesian approach might require us to
strategies that provide good compromises between exploit-
specify our uncertainty over priors from this set, and then
ing a tendency and limiting the worst case exploitability
maximize expected utility given such a hierarchical prior.
of the resulting counter-strategy. We demonstrated that the
Suppose, though, that we have no basis for specifying such
new technique avoids the three shortcomings of existing
a distribution over distributions. An alternative then is to
approaches, while providing better performance in the most
favourable conditions for the existing approaches.
In other words, employ a strategy that is robust to the
We would like to thank the members of the University of
choice of prior. Notice that if F contains a singleton prior,
Alberta Computer Poker Research Group. This research
this optimization is equivalent to the original decision the-
oretic approach, i.e., a best response strategy. If F con-tains all possible prior distributions, then the optimizationis identical to the game theoretic approach, i.e., a Nash
equilibrium strategy. Other choices of the set F admit op-timizations that trade-off exploiting data with avoiding ex-
[Gilpin and Sandholm, 2006] A. Gilpin and T. Sandholm. Find-
ing equilibria in large sequential games of imperfect informa-
tion. In ACM Conference on Electronic Commerce, 2006.
Theorem 1 Consider F to be the set of priors composed
M. Bowling. Computing robust counter-strategies. In Neural
of independent Dirichlet distributions for each information
Information Processing Systems 21, 2008.
set, where the strength (sum of the Dirichlet parameters) isat most s. The strategy computed by data biased response
[Johanson, 2007] M. Johanson. Robust strategies and counter-
strategies: Building a champion level computer poker player,
conf (I ) = nI /(s + nI ) is the solution to the opti-
[McCracken and Bowling, 2004] Peter McCracken and Michael
Bowling. Safe strategies for agent modelling in games. In
ROOF. (Sketch) If we let Σs2 be the set of resulting ex-
AAAI Fall Symposium on Artificial Multi-agent Learning, Oc-
pected posterior strategies for all choices of priors f ∈ F .
It suffices to show that Σs = ΣPconf ,σfix
[Zinkevich and Littman, 2006] M. Zinkevich and M. Littman.
The AAAI computer poker competition. Journal of the Inter-
a at information set I. Let σfix(I, a) = αf /
national Computer Games Association, 29, 2006. News item.
in other words the strategy where the opponent plays theexpected prior strategy when given the opportunity. The
[Zinkevich et al., 2008] M. Zinkevich, M. Johanson, M. Bowl-
ing, and C. Piccione. Regret minimization in games with in-
resulting expected posterior strategy is the the same as σ2
complete information. In Neural Information Processing Sys-
from Equation 1 and so is in the set ΣPconf ,σfix . Similarly,
given σfix associated with a strategy σ2 in ΣPconf,σfix , letαI,a = sσfix(I, a). The resulting expected posterior strat-egy is the same as σ2. The available strategies to player 2are equivalent, and so the resulting min-max optimizationsare equivalent.
In summary, we can choose Pconf in data biased responseso that it is equivalent to finding strategies that are robustto a set of independent Dirichlet priors.
Cenkos News 13th April 2012 The Technology Theme: An Introduction What do we mean by technology and why is it a long term theme? Technology is often defined as: “ the application of scientific knowledge for practical purposes ”. While information technology largely dominates this field in our (President of the Michigan Savings Bank advising Henry Ford’s ti
CLINICIAN’S CORNER The Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure The JNC 7 Report “The Seventh Report of the Joint National Committee on Prevention, De- tection, Evaluation, and Treatment of High Blood Pressure” provides a new guideline for hypertension prevention and management. The following are