?openAI framework for POMDP and reward discounting problem


#1

Hi all,
I created a simple gym environment with a mock game think scrabble, but with betting like blackjack.
(i know brilliant).

Overview:
Essentially the observable state is what the scrabble board shows, what YOUR private tiles are and how much can you win if you play the winning tile combo. Action space is: Discrete(10) and Box(0,100)
The game is essentially partially observable/incomplete markov process. You get no reward until game is done. Each game has a different number of moves.

_reset: sets the board to blank, gives each player a tile, and restores available cash
_step(action): if action is valid, tile is played and x amount $ are wagered.
^
L returns nextState (tiles, money, board), reward (only if game is done you get the cash if you won or -cash if you lost) , isGameDone

My Problem
You can’t simply sample states (from which you then choose a random action) and see what happens.
The reason you can’t randomly sample is that you have to play up to a state in order for it to be "realistic"
Does that make sense?

My Question

  1. In the expert opinion, what do you think would be best approach in training such an agent that operates in the realm of incomplete information such as in my game?
  2. Is there a clever way to discount rewards over moves that is not linear?
    • in my game your reward is money-you-wagered x2 during game, where each move played has a differing amount of money wagered (i.e. 5 moves, money wagered 0,10,0,0,20 for a total of 30 wagered).

Incoherent thought (skip this)
I have massive amounts of compute power (4x 1080 and 32 threads) on my desktop. I started writing a coda kernel that uses a simple neural network, with the goal of doing monte carlo.

ie. with given policy play x moves vs opponent using same policy, permuting the tiles he was given.
then play move that has best score, and continue until game over.
=> when x monte carlo sampled games are played, retrain the policy on new games as the desired
optimal moves.

(after reading Dr Sutton’s Book, in record time might I add, I realized I am not the smartest man in world). Hence I would like to work with the openAI package if possible, but I am not exactly sure if it would be applicable to my problem above, but CUDA is oh so very annoying please help a newb out.


#2

Addendum:
figured out the solution for my problem.

Ultimately, the results of trying to solve a continuous action-space were infinitely less impressive than chopping up the action space into n-available bins.

IMHO (not doing calculus aint nobody got time for that) as the number of bins goes towards infinity, you should be able to have a solution for a continuous control problem using discrete steps of 1/n.

And it also seems possible to chop up the action space into a very small number of bins initially and then as you discover the most promising/hot-zone bin, you can divide it up into more bins/buckets, meaning you don’t have to sample the action space uniformly. Will this get you stuck in local minima? Spoiler (oh yea). However, the problem becomes small enough that you can re-assure yourself that you are not in a local minima in real-time.