?openAI framework for POMDP and reward discounting problem

#1

Hi all,
I created a simple gym environment with a mock game think scrabble, but with betting like blackjack.
(i know brilliant).

Overview:
Essentially the observable state is what the scrabble board shows, what YOUR private tiles are and how much can you win if you play the winning tile combo. Action space is: Discrete(10) and Box(0,100)
The game is essentially partially observable/incomplete markov process. You get no reward until game is done. Each game has a different number of moves.

_reset: sets the board to blank, gives each player a tile, and restores available cash
_step(action): if action is valid, tile is played and x amount \$ are wagered.
^
L returns nextState (tiles, money, board), reward (only if game is done you get the cash if you won or -cash if you lost) , isGameDone

My Problem
You can’t simply sample states (from which you then choose a random action) and see what happens.
The reason you can’t randomly sample is that you have to play up to a state in order for it to be "realistic"
Does that make sense?

My Question

1. In the expert opinion, what do you think would be best approach in training such an agent that operates in the realm of incomplete information such as in my game?
2. Is there a clever way to discount rewards over moves that is not linear?
• in my game your reward is money-you-wagered x2 during game, where each move played has a differing amount of money wagered (i.e. 5 moves, money wagered 0,10,0,0,20 for a total of 30 wagered).

Incoherent thought (skip this)
I have massive amounts of compute power (4x 1080 and 32 threads) on my desktop. I started writing a coda kernel that uses a simple neural network, with the goal of doing monte carlo.

ie. with given policy play x moves vs opponent using same policy, permuting the tiles he was given.
then play move that has best score, and continue until game over.
=> when x monte carlo sampled games are played, retrain the policy on new games as the desired
optimal moves.

(after reading Dr Sutton’s Book, in record time might I add, I realized I am not the smartest man in world). Hence I would like to work with the openAI package if possible, but I am not exactly sure if it would be applicable to my problem above, but CUDA is oh so very annoying please help a newb out.

#2