Pursuit Evasion

This environment is part of the Grid World environments. Please read that page first for general information.


Possible Agents	(‘0’, ‘1’)
Action Spaces	{‘0’: Discrete(4), ‘1’: Discrete(4)}
Observation Spaces	{‘0’: Tuple(Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16))), ‘1’: Tuple(Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)))}
Symmetric	False
Import	`posggym.make("PursuitEvasion-v1")`

The Pursuit-Evasion Grid World Environment.

An adversarial 2D grid world problem involving two agents: an evader and a pursuer. The evader’s goal is to reach a goal location, on the other side of the grid, while the goal of the pursuer is to spot the evader before it reaches it’s goal. The evader is considered caught if it is observed by the pursuer, or occupies the same location. The evader and pursuer have knowledge of each others starting locations. However, only the evader has knowledge of it’s goal location. The pursuer only knowns that the evader’s goal location is somewhere on the opposite side of the grid to the evaders start location.

This environment requires each agent to reason about the which path the other agent will take through the dense grid environment.

Possible Agents

Evader = "0"
Pursuer = "1"

State Space

Each state is made up of:

the (x, y) coordinate of the evader
the direction the evader is facing
the (x, y) coordinate of the pursuer
the direction the pursuer is facing
the (x, y) coordinate of the evader’s start location
the (x, y) coordinate of the pursuers’s start location
the (x, y) coordinate of the evaders’s goal location
the minimum distance to it’s goal along the shortest path achieved by the evader in the current episode (this is needed to correctly reward the agent for making progress.)

Action Space

Each agent has 4 actions corresponding to moving in the 4 available directions with respect to the direction the agent is currently facing: FORWARD=0, BACKWARDS=1, LEFT=2, RIGHT=3.

Observation Space

Each agent observes:

whether there is a wall (1) or not (0) in the adjacent cells in the four cardinal directions,
whether they see the other agent in a cone in front of them (1) or not (0). The cone projects forward up to ‘max_obs_distance’ (default=12) cells in front of the agent.
whether they hear the other agent (1) or not (0). The other agent is heard if they are within distance 2 from the agent in any direction.
the (x, y) coordinate of the evader’s start location,
the (x, y) coordinate of the pursuer’s start location,
Evader: the (x, y) coordinate of the evader’s goal location. Pursuer: blank coordinate (0, 0).

Note, the goal and start coordinate observations do not change during a single episode, but they do change between episodes.

Rewards

The environment is zero-sum with the pursuer receiving the negative of the evader reward. Rewards are normalized so that returns are bounded between -1 and 1.

The evader receives a reward of 1 for reaching it’s goal location and a reward of -1 if it gets captured. Additionally, the evader receives a small reward of 0.01 each time it’s minimum distance achieved to it’s goal along the shortest path decreases for the current episode. This is to make it so the environment is no longer sparesely rewarded and helps with exploration and learning (it can be disabled by the use_progress_reward parameter). If progress reward is enabled, reward normalization still applies so the returns are bounded between -1 and 1.

Dynamics

Actions are deterministic and will move the agent one cell in the target direction if the cell is empty.

Starting State

At the start of each episode the start location of the evader is selected at random from all possible start locations. The evader’s goal location is then chosen randomly from the set of available goal locations given the evaders start location (in the default maps the goal location is always on the opposite side of the map from a start location). The pursuers start location is similarly chosen from it’s set of possible start locations.

Episode End

An episode ends when either the evader is caught or the evader reaches it’s goal. By default a max_episode_steps limit of 100 steps is also set. This may need to be adjusted when using larger grids (this can be done by manually specifying a value for max_episode_steps when creating the environment with posggym.make).

Arguments

grid - the grid layout to use. This can either be a string specifying one of the supported grids, or a custom :class:PEGrid object (default = "16x16").
max_obs_distance - the maximum number of cells in front each agent’s field of vision extends (default = 12).
use_progress_reward - whether to reward the evader agent for making progress towards it’s goal. If False the evader will only be rewarded when it reaches it’s goal, making it a sparse reward problem (default = ‘True`).

Available variants

The PursuitEvasion environment comes with a number of pre-built grid layouts which can be passed as an argument to posggym.make, to create different grids.

Grid name	Grid size
`5x5`	8x8
`16x16`	16x16
`32x32`	32x32

For example to use the PursuitEvasion environment with the 32x32 grid layout, and episode step limit of 200, and the default values for the other parameters you would use:

import posggym
env = posggym.make(
    'PursuitEvasion-v0',
    max_episode_steps=200,
    grid="32x32",
)

Version History

v1: Minor update, mainly removing unused parameters:
- removed action_probs parameter (actions are now always deterministic)
- removed normalize_reward parameter (rewards are now always normalized)
v0: Initial version

References

[This Pursuit-Evasion implementation is directly inspired by the problem] Seaman, Iris Rubi, Jan-Willem van de Meent, and David Wingate. 2018. “Nested Reasoning About Autonomous Agents Using Probabilistic Programs.” ArXiv Preprint ArXiv:1812.01569.
Schwartz, Jonathon, Ruijia Zhou, and Hanna Kurniawati. “Online Planning for Interactive-POMDPs using Nested Monte Carlo Tree Search.” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8770-8777. IEEE, 2022.