An important detail used for most value-based RL algorithms is that you need F() to be a prediction not of expected reward from a 4-frame sequence, but of expected return, which is defined as the discounted sum of all future rewards.
As you do not have this value in the table, the solution to that is to “boostrap” it from your predictor. If S, A are your current state and action and S’, a’ are your next state and possible action choice, then in Q-learning your target for training each F() - known as the TD target would be
F(S, A) target = R + discount * max_a’[F(S’,a’)]
You have to re-calculate this every time you train, as the agent gets better at predicting the true return, the training targets for F() in the mini-batch will change. Discount is often represented by Greek letter lowercase gamma, and should be between 0 and 1. Typical values for continuous problems with small time steps are close to 1, e.g. 0.99 or 0.999
Unfortunately this is not a stable algorithm. I suggest you read the DQN paper to see what they did about the instability - one thing is experience replay, not too different from your idea of saving up very many iterations and then training the agent at the end (your algorithm may even work in this regard, but 10,000,000 iterations, played randomly before training would take a very long time, and the data quality from completely random play might not be enough to train an expert player in one step - ideally you want to sample data from slightly better and better players as time progresses, in order to make more continuous improvements to the agent).