What is the differecne between a "stochastic world" and a "partially observable world", and how does that it change the problem that an RL agent has to solve?

I have read "Reinforcement Learning: An Introduction" by Sutton and Barto. In the textbooks question Noel Welsh comments about it:

[The book] covers the fundamentals of bandit algorithms and reinforcement learning in fully observable worlds (MDPs). Note it says very little about generalisation and practically nothing about acting in partially observable worlds (POMDPs).

I have noticed that many examples use deterministic environments, but I understood that almost all methods also work for stochastic environments. Does it really make a difference to the learning strategy (or the function approximator) of an agent if the environment is not fully observable? I have recently started to read about the concept of causality, is it maybe connected to that?

asked Apr 15 '12 at 07:45

maxy's gravatar image



Try to get your hands on "probabilistic robotics". It talks about POMDPs a lot more.

The short answer to your question is: If there are unobserved states, it is much harder to learn a good policy.yes, unobserved states are a lot more difficult to learn a good policy.

In an MDP, the reward you optain from getting from one state to another with a given action is fixed. In a POMDP it is not.

Also, in a stochastic environment, you can learn which action will lead to which state with which probability. This is not so easy in a POMDP, where the state you end up in will depend on unobserved variables.

(Apr 15 '12 at 08:16) Andreas Mueller

Partially observed is a world in which you don't observe everything there is to observe. A stochastic world, however, is a world that obeys a certain probability distribution, and is not, for example, behaving adversarially (that is, always choosing the worst possible reaction to your actions) or deterministically (that is, always doing the same thing when you do something).

(Apr 16 '12 at 18:07) Alexandre Passos ♦

Thanks for the comments. My current conclusion is: in a partially observable world (which can also be stochastic), you can use past observations to make a better guess about the current state of the world, and thus better decisions. In a stochastic (but fully observable) world, the past observations are completely useless (except for the learning process, of course).

I didn't consider adversarial worlds so far (I guess those would include games like chess).

(Apr 20 '12 at 02:25) maxy

As far as I know, in stochastic environments, there are no "observations". You know in which state you are.

(Apr 20 '12 at 03:57) Andreas Mueller

One Answer:

In a stochastic world there is a transition probability distribution which determines which state the agent moves to next given the current state and the action it takes e.g. in the GridWorld, the robot takes an action that is supposed to move it to a neighbouring square but with some probability, it ends up failing and lands in a random square. In general, MDPs have stochastic transitions but they are completely observable, the agent always knows exactly which state it's in.

In a partially observable setting, the current state is unknown and the best we can do is indirectly obtain information about it through an observation model (POMDPs) or a test (PSR's). This adds an extra level of complexity to a planning problem because we have to choose the best action according to our belief state (distribution over all possible states). Also we must take into account our uncertainty in future states when planning (e.g we might not be able to execute plans which depend on us knowing which state we will be in in the future).

answered Apr 26 '12 at 22:05

Deepak%20Ramachandran's gravatar image

Deepak Ramachandran

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.