Help reproducing benchmark scores


I trained the benchmark A2C gym algorithm on breakout. This is converting the game to deepmind format where the output score is a zero or one. what is the proper way load this trained model and play the game to see if the scores match the benchmark?

I tried the following but the agent is not performing well. I think in training it was doing better so I suspect something is wrong perhaps grayscale or number of frames beings skipped or something similar. Any suggestions or reference code would be appreciated.

def play_episode(env_name, model, seed):
    env = gym.make(env_name)
    env.seed = seed
    env = wrap_deepmind(env, frame_stack=True, scale=True)
    obs, states, done = env.reset(), None, False
    episode_rew = 0
    while True:
        while not done:

            obs = np.reshape(obs, (1,84,84,4))
            action, value, states, _ = model.step(obs, states, done)# states used for lstm model only
            obs, rew, done, _ = env.step(action)
            episode_rew += rew
        print("Episode reward", episode_rew)

    return episode_rew