Hi all

I've started toying with reinforcement learning (using the Sutton book). I fail to fully understand is the paradox between having to reduce the markov state space while on the other hand not making assumptions about what's important and what's not.

background

Eg. the checkers example, Sutton says that one should not assign rewards to certain actions in the game, such as defeating an opponents piece. He claims this may optimize the AI for taking pieces not win the game. Thus, rewards should only be given to the result you want to achieve (eg win the game).

Question 1

Assume a (Texas hold'em) Poker AI with a markov state only of the players hand and the cards on the table. This has around 52515049484746/1234567 states. Now assume we want the AI to take players money pool + their bets into consideration. This will make the Markov state space approach "infinite number of combinations" if we assume 8 players each having between $1-200.000.

Question 2

One state-reducing-strategy could be to divide players cash into either poor, medium or rich. This seriously reduces our state space, however, how do I know that a) 3 groups is sufficient? b) what are the discriminating limits for each group?

cheers,

asked Feb 15 '11 at 17:38

Kasper%20Graversen's gravatar image

Kasper Graversen
1111


2 Answers:

Here are some approaches:

  • You can factor the representation, which can give an exponential decrease if the size of the state space. There are special purpose algorithms that work with factored representation.
  • You can cluster states based on similar value. You can extend this idea of MDP homomorphisms.
  • You can learn a policy directly without a model. See the recent work on natural gradient policy search.

That's what comes immediately to mind. It's been a while since I looked at this area of MDPs, so there are probably more techniques.

answered Feb 17 '11 at 10:03

Noel%20Welsh's gravatar image

Noel Welsh
72631023

Question 1: Yes, the state space easily gets exponentially large when you enumerate all possible states, which is why people adopt either implicit or explicit state-pruning techniques.

Question 2: You'd have to determine the answer to (a) and (b) experimentally, as far as I know, or you can use any of the many techniques that use function approximation methods and/or temporal difference learning. Then you can think in terms of possibly relevant features and learn a policy that depends on them.

answered Feb 15 '11 at 18:02

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Could you please enumerate or explain some of the best-known state-pruning techniques?

(Feb 16 '11 at 16:32) Kasper Graversen

I think most people switch to function approximation methods instead of explicit state pruning, but this is not my usual area of work and I'm not very familiar with the existing methods for that. Don't you find them in Sutton's book?

(Feb 16 '11 at 16:34) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.