Two Paths

../../_images/two_paths.gif

This environment is part of the Grid World environments. Please read that page first for general information.

Possible Agents

(‘0’, ‘1’)

Action Spaces

{‘0’: Discrete(4), ‘1’: Discrete(4)}

Observation Spaces

{‘0’: Tuple(Discrete(3), Discrete(3), Discrete(3), Discrete(3)), ‘1’: Tuple(Discrete(3), Discrete(3), Discrete(3), Discrete(3))}

Symmetric

False

Import

posggym.make("TwoPaths-v0")

The Two-Paths Grid World Environment.

An adversarial 2D grid world problem involving two agents, a runner and a chaser. The runner’s goal is to reach one of two goal location, with each goal located at the end of a separate path. The lengths of the two paths have different lengths. The goal of the chaser is to intercept the runner before it reaches a goal. The runner is considered caught if it is observed by the chaser, or occupies the same location. However, the chaser is only able to effectively cover one of the two goal locations.

This environment requires each agent to reason about the which path the other agent will choose. It offers an ideal testbed for planning under finite-nested reasoning assumptions since it is possible to map reasoning level to the expected path choice.

The two agents start at opposite ends of the maps.

Possible Agents

  • Runner = ‘0’

  • Chaser = ‘1’

State Space

Each state contains the (x, y) (x=column, y=row, with origin at the top-left square of the grid) of the runner and chaser agent. Specifically, a state is ((x_runner, y_runner), (x_chaser, y_chaser))

Action Space

Each agent has 4 actions corresponding to moving in the 4 cardinal directions: NORTH=0, EAST=1, SOUTH=2, WEST=3.

Observation Space

Each agent observes the adjacent cells in the four cardinal directions and whether they are one of three things: OPPONENT=0, WALL=1, EMPTY=2.

Each observation is represented as a tuple: (cell_north, cell_south, cell_east, cell_west)

Rewards

Both agents receive a penalty of -0.01 for each step. If the runner reaches the goal then the runner receives a reward of 1.0, while the chaser receives a penalty of -1.0. If the runner is observed by the chaser, then the runner receives a penalty of -1.0, while the chaser receives a reward of 1.0.

The rewards make the environment adversarial, but not strictly zero-sum, due to the small penalty each step.

Dynamics

By default actions are deterministic and will lead to the agent moving one cell in the action’s direction, given the cell adjacent cell in that direction is empty and not out-of-bounds.

The environment can also be run in stochastic mode by changing the action_probs parameter at initialization. This controls the probability the agent will move in the desired direction each step, otherwise moving randomly in one of the other 3 other possible directions.

Starting State

Both runner and chaser agents start at the same location on opposite ends of the grid each episode, and the goal locations and grid layout are also the same. The specific starting state and configuration depends on the grid layout used.

Episodes End

Episode ends when either the runner is caught, or reaches a goal. By default a max_episode_steps limit of 20 is also set.

Arguments

  • grid_size - the grid size to use. This can either 3, 4, or 7, each size n create a TwoPaths Env with a n-by-n grid layout (default = 7).

  • action_probs - the action success probability for each agent. This can be a single float (same value for both runner and chaser agents) or a tuple with separate values for the runner and chaser (default = 1.0).

Reference

Schwartz, Jonathon, Ruijia Zhou, and Hanna Kurniawati. “Online Planning for Interactive-POMDPs using Nested Monte Carlo Tree Search.” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8770-8777. IEEE, 2022.