Are there any examples in the current Gym where the reward is not Markovian/depends on not just the action at time t but all previous actions t_0 -> t? Does the current Environment interface generalize well to this problem (e.g. there’s no issue with storing past actions in the Environment and using them to calculate the reward at env.step, and obviously resetting on env.reset)?
In some environments the reward and next state depend only on the current state and action (e.g., in frozen lake), but there are environments where they also depend on hidden state information (e.g., all algorithmic environments).
The current Environment interface works well with both types. I am not sure if any environment stores past actions explicitly; usually an action just changes the internal state of the environment.