|
I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors. I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor. Nearly all reinforcement learning algorithms don't seem to be suitable for this domain. First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step. Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors. How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm. |