Two Paths
This environment is part of the Grid World environments. Please read that page first for general information.
Possible Agents |
(‘0’, ‘1’) |
Action Spaces |
{‘0’: Discrete(4), ‘1’: Discrete(4)} |
Observation Spaces |
{‘0’: Tuple(Discrete(3), Discrete(3), Discrete(3), Discrete(3)), ‘1’: Tuple(Discrete(3), Discrete(3), Discrete(3), Discrete(3))} |
Symmetric |
False |
Import |
|
The Two-Paths Grid World Environment.
An adversarial 2D grid world problem involving two agents, a runner and a chaser. The runner’s goal is to reach one of two goal location, with each goal located at the end of a separate path. The lengths of the two paths have different lengths. The goal of the chaser is to intercept the runner before it reaches a goal. The runner is considered caught if it is observed by the chaser, or occupies the same location. However, the chaser is only able to effectively cover one of the two goal locations.
This environment requires each agent to reason about the which path the other agent will choose. It offers an ideal testbed for planning under finite-nested reasoning assumptions since it is possible to map reasoning level to the expected path choice.
The two agents start at opposite ends of the maps.
Possible Agents
Runner = ‘0’
Chaser = ‘1’
State Space
Each state contains the (x, y) (x=column, y=row, with origin at the
top-left square of the grid) of the runner and chaser agent. Specifically,
a state is ((x_runner, y_runner), (x_chaser, y_chaser))
Action Space
Each agent has 4 actions corresponding to moving in the 4 cardinal
directions: NORTH=0, EAST=1, SOUTH=2, WEST=3.
Observation Space
Each agent observes the adjacent cells in the four cardinal directions and
whether they are one of three things: OPPONENT=0, WALL=1, EMPTY=2.
Each observation is represented as a tuple:
(cell_north, cell_south, cell_east, cell_west)
Rewards
Both agents receive a penalty of -0.01 for each step. If the runner reaches the
goal then the runner receives a reward of 1.0, while the chaser receives a penalty
of -1.0. If the runner is observed by the chaser, then the runner receives a
penalty of -1.0, while the chaser receives a reward of 1.0.
The rewards make the environment adversarial, but not strictly zero-sum, due to the small penalty each step.
Dynamics
By default actions are deterministic and will lead to the agent moving one cell in the action’s direction, given the cell adjacent cell in that direction is empty and not out-of-bounds.
The environment can also be run in stochastic mode by changing the action_probs
parameter at initialization. This controls the probability the agent will move in
the desired direction each step, otherwise moving randomly in one of the other 3
other possible directions.
Starting State
Both runner and chaser agents start at the same location on opposite ends of the grid each episode, and the goal locations and grid layout are also the same. The specific starting state and configuration depends on the grid layout used.
Episodes End
Episode ends when either the runner is caught, or reaches a goal. By default a
max_episode_steps limit of 20 is also set.
Arguments
grid_size- the grid size to use. This can either3,4, or7, each sizencreate a TwoPaths Env with an-by-ngrid layout (default =7).action_probs- the action success probability for each agent. This can be a single float (same value for both runner and chaser agents) or a tuple with separate values for the runner and chaser (default =1.0).
Reference
Schwartz, Jonathon, Ruijia Zhou, and Hanna Kurniawati. “Online Planning for Interactive-POMDPs using Nested Monte Carlo Tree Search.” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8770-8777. IEEE, 2022.