Pursuit Evasion
This environment is part of the Grid World environments. Please read that page first for general information.
Possible Agents |
(‘0’, ‘1’) |
Action Spaces |
{‘0’: Discrete(4), ‘1’: Discrete(4)} |
Observation Spaces |
{‘0’: Tuple(Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16))), ‘1’: Tuple(Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)), Tuple(Discrete(16), Discrete(16)))} |
Symmetric |
False |
Import |
|
The Pursuit-Evasion Grid World Environment.
An adversarial 2D grid world problem involving two agents: an evader and a pursuer. The evader’s goal is to reach a goal location, on the other side of the grid, while the goal of the pursuer is to spot the evader before it reaches it’s goal. The evader is considered caught if it is observed by the pursuer, or occupies the same location. The evader and pursuer have knowledge of each others starting locations. However, only the evader has knowledge of it’s goal location. The pursuer only knowns that the evader’s goal location is somewhere on the opposite side of the grid to the evaders start location.
This environment requires each agent to reason about the which path the other agent will take through the dense grid environment.
Possible Agents
Evader =
"0"Pursuer =
"1"
State Space
Each state is made up of:
the
(x, y)coordinate of the evaderthe direction the evader is facing
the
(x, y)coordinate of the pursuerthe direction the pursuer is facing
the
(x, y)coordinate of the evader’s start locationthe
(x, y)coordinate of the pursuers’s start locationthe
(x, y)coordinate of the evaders’s goal locationthe minimum distance to it’s goal along the shortest path achieved by the evader in the current episode (this is needed to correctly reward the agent for making progress.)
Action Space
Each agent has 4 actions corresponding to moving in the 4 available directions with
respect to the direction the agent is currently facing: FORWARD=0, BACKWARDS=1,
LEFT=2, RIGHT=3.
Observation Space
Each agent observes:
whether there is a wall (
1) or not (0) in the adjacent cells in the four cardinal directions,whether they see the other agent in a cone in front of them (
1) or not (0). The cone projects forward up to ‘max_obs_distance’ (default=12) cells in front of the agent.whether they hear the other agent (
1) or not (0). The other agent is heard if they are within distance 2 from the agent in any direction.the
(x, y)coordinate of the evader’s start location,the
(x, y)coordinate of the pursuer’s start location,Evader: the
(x, y)coordinate of the evader’s goal location. Pursuer: blank coordinate(0, 0).
Note, the goal and start coordinate observations do not change during a single episode, but they do change between episodes.
Rewards
The environment is zero-sum with the pursuer receiving the negative of the evader
reward. Rewards are normalized so that returns are bounded between -1 and 1.
The evader receives a reward of 1 for reaching it’s goal location and a
reward of -1 if it gets captured. Additionally, the evader receives a small
reward of 0.01 each time it’s minimum distance achieved to it’s goal along the
shortest path decreases for the current episode. This is to make it so the
environment is no longer sparesely rewarded and helps with exploration and learning
(it can be disabled by the use_progress_reward parameter). If progress reward
is enabled, reward normalization still applies so the returns are bounded between
-1 and 1.
Dynamics
Actions are deterministic and will move the agent one cell in the target direction if the cell is empty.
Starting State
At the start of each episode the start location of the evader is selected at random from all possible start locations. The evader’s goal location is then chosen randomly from the set of available goal locations given the evaders start location (in the default maps the goal location is always on the opposite side of the map from a start location). The pursuers start location is similarly chosen from it’s set of possible start locations.
Episode End
An episode ends when either the evader is caught or the evader reaches it’s goal.
By default a max_episode_steps limit of 100 steps is also set. This may need to
be adjusted when using larger grids (this can be done by manually specifying a value
for max_episode_steps when creating the environment with posggym.make).
Arguments
grid- the grid layout to use. This can either be a string specifying one of the supported grids, or a custom :class:PEGridobject (default ="16x16").max_obs_distance- the maximum number of cells in front each agent’s field of vision extends (default =12).use_progress_reward- whether to reward the evader agent for making progress towards it’s goal. If False the evader will only be rewarded when it reaches it’s goal, making it a sparse reward problem (default = ‘True`).
Available variants
The PursuitEvasion environment comes with a number of pre-built grid layouts which
can be passed as an argument to posggym.make, to create different grids.
Grid name |
Grid size |
|---|---|
|
8x8 |
|
16x16 |
|
32x32 |
For example to use the PursuitEvasion environment with the 32x32 grid layout, and
episode step limit of 200, and the default values for the other parameters you would
use:
import posggym
env = posggym.make(
'PursuitEvasion-v0',
max_episode_steps=200,
grid="32x32",
)
Version History
v1: Minor update, mainly removing unused parameters:removed
action_probsparameter (actions are now always deterministic)removed
normalize_rewardparameter (rewards are now always normalized)
v0: Initial version
References
[This Pursuit-Evasion implementation is directly inspired by the problem] Seaman, Iris Rubi, Jan-Willem van de Meent, and David Wingate. 2018. “Nested Reasoning About Autonomous Agents Using Probabilistic Programs.” ArXiv Preprint ArXiv:1812.01569.
Schwartz, Jonathon, Ruijia Zhou, and Hanna Kurniawati. “Online Planning for Interactive-POMDPs using Nested Monte Carlo Tree Search.” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8770-8777. IEEE, 2022.