I have been playing around with the code from here. I slightly modified the code to see what exactly the env.reset() returns. Please see the code below
import numpy as np
def run_episode(env, parameters):
total_reward = 0
for _ in range(100):
observation_n = env.reset()
action = 0 if np.matmul(parameters, observation_n) < 0 else 1
_, reward, done, info = env.step(action)
total_reward += reward
env = gym.make('CartPole-v0')
parameters = np.random.rand(4)*2-1
learning_rate = 0.1
best_reward = 0
for _ in range(1000):
new_parameters = parameters + learning_rate*(np.random.rand(4)*2-1)
episode_reward = run_episode(env, new_parameters)
print("episode_reward %d best reward %d" %(episode_reward, best_reward))
if (best_reward < episode_reward):
best_reward = episode_reward
parameters = new_parameters
if episode_reward == 100:
So I take the observation directly from env.reset() rather than observing what an agent's action does. From my understanding these should be same. But when I run this code I always get a reward of 1.000.
Am I missing something ? Is my understanding of env.reset() wrong ?
Edit: I just realized that env.reset() creates a environment. So every time that I do env.reset() the environment is being created new. And since the pole is in upright position when ever it is created my reward is always 1.000. Is my understanding correct ?