How to save and load model parameters and evaluate the model in baslines


#1

I modified baslines A2C code in order to save and load model parameters and evaluate the trained model.
However, I cannot get the similar reward sum shown during training.
When I plot episode rewards during training, the value is around 300. But the mean rewards of 100 episodes in my evaluation code is 3.29.
I’m not sure which part of the code have problem.

For saving the model, I used the save method implemented in a2c.Model class
I add a few line inside for loop in learn function in a2c.py as below,

reward_cur = safemean(rewards)
if update==1: max_reward = reward_cur 
if update % log_interval == 0 and reward_cur > max_reward:
    checkdir = osp.join(logger.get_dir(), 'checkpoints')
    os.makedirs(checkdir, exist_ok=True)
    savepath = osp.join(checkdir, '%.5i'%update)
    print('Saving to', savepath)
    model.save(savepath)
    max_reward = reward_cur

Then for the evaluation, I wrote act function and Runner_Eval class by modifying learn function and Runner class.
In act function, I added a few lines for parameter loading and remove lines for training as following,

def act(policy, env, seed, nsteps=5, ...):    
	tf.reset_default_graph()
	set_global_seeds(seed)
	
	nenvs = 1
	ob_space = env.observation_space
	ac_space = env.action_space
	model = Model(policy=policy, ob_space=ob_space, ...)
	# load parameters
	if ckpt is not None:
		#checkdir = osp.join(logdir, 'checkpoints')
		#loadpath = osp.join(checkdir, '00100')
		print('Load ckpt', ckpt)
		model.load(ckpt)

	runner = Runner_Eval(env, model, nepi=num_epi)

	tstart = time.time()
	rewards = runner.run()
	nseconds = time.time()-tstart
	
	reward_cur = safemean(rewards)
	logger.record_tabular("avg rewards(%d epi)"%(num_epi), float(reward_cur))
	logger.dump_tabular()

‘Runner_Eval’ class is modified for running an agent for desired episodes as below,

class Runner_Eval(object):
	def __init__(self, env, model, nepi=100):
		self.env = env
		self.model = model
		nh, nw, nc = env.observation_space.shape
		nenv = env.num_envs
		self.obs = np.zeros((nenv, nh, nw, nc), dtype=np.uint8)
		self.nc = nc
		obs = env.reset()
		self.nepi = nepi
		self.states = model.initial_state
		self.dones = [False for _ in range(nenv)]

	def run(self):
		mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
		mb_states = self.states
		ndones = 0
		reward_sum = 0.0
		while ndones<self.nepi:
			actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
			obs, rewards, dones, _ = self.env.step(actions)
			reward_sum += rewards
			self.states = states
			self.dones = dones
			self.obs = obs
			if dones:
				#print(ndones, reward_sum)
				mb_rewards.append(reward_sum)
				reward_sum = 0.0
				ndones += 1

		return mb_rewards

Is it right to use save and load methods in Model class to keep the trained model?
And is the evaluation code in Runner_Eval proper way?