New algorithm solves most Atari envs


Hi, some weeks ago we released the code to beat almost any Atari environment in OpenAI. No learning, no DNN, just a Monte Carlo approach (a “fractal” Monte Carlo or FMC) beating A3C or DQN.

If you wish to try it out:


This is a big caveat:

when we have a perfect model of the environment

Given the need for a queryable model, I’m wondering if this has not simply reinvented MCTS planning algorithms, so the comparisons being made to state of the art learners are actually comparing between learning algorithms and look-ahead planning.

So far though I have only scanned the paper, so perhaps this is not the case?


You are right Neil, it is just an MCTS replacement, but it is so efficient that, surprisingly, you can compare it with learning-based algorithms (and win) but you are still comparing apples with bananas.

About “when we have a perfect model”, well, all MCTS approaches uses some kind of model simulation (perfect or approximated, deterministic or stochastic) and FMC is the same stuff: if you have an “informative” simulation of your model (a DNN for instance) it is ok, but if it is “perfect” then it is just better, and in this very particular case of Atari, we have a perfect one to play with, so we use it.

So it is an MCTS replacement that is some orders of magnitude better performer (4 or 5 orders of magnitude) and that can be used on continuous decision spaces too.

The readme in the GitHub ( is surely the best place to read about, the arXiv paper is more dense and theoretical.


So the novelty in your paper is essentially an upgrade to MCTS performance, meaning you can scale up the effective search space, making the approach more feasible in practice?

Really 4 or 5 orders of magnitude increase in performance compared to a baseline MCTS implementation? That’s huge! Are there any caveats to that, that could affect the use of this planning algorithm on real world, real-time problems (using a learned next-state predictor as the model) - e.g. are you switching the main environment with a good downstream iteration, instead of single step and re-plan?

Is the performance comparison theoretical (e.g. number of samples required to achieve similar statistical bounds) or experimental, such as running your iteration vs perhaps the one from ?


The real novelty is that this paper defines intelligence as an entropic process, and then builds a Monte Carlo that follows the theory searching the space in such a way that some entropies minimize. The FMC is then a proof of concept of the theory, and the theory says this is an optimal implementation of the MC approach, so we claim quite a lot, to be honest.

The paper (here live as I modify it: contains a comparison against MCTS, the offline UTC version is I recall it well, the one in the paper you mention. They use about 3M samples of the model per action taken, we use between 300, in some cases 15 (boxing) and in some others, 2000 (bowling), and they get low scores and we beat human records, but the comparison is only about samples per action, scores apart.

Drawbacks? None, just unplug MCTS and replace with FMC, it should increase performance magically, entropy things. You can attach a DNN as a prior to the MC algorithm and feed it with good examples to train (rollouts from winner games only), in fact, we are also working on that part and it is mentioned in both the paper and the readme and tech docs of the repo… just don’t have time to work it out at the same time.

But remember, it is a simple use-case of the theory, not the thing itself.




Thanks so much for your work and article!

I created what I’m calling a Mandel Number Generator (MNG) to seed inputs of an Arduino (Nano) robot controller (H-Bridge). I feed back into the system, information from a single ping sensor. The machine is only told to iterate it’s recursive formula to:
Move forward
Move backward
Sharp turn left
Sharp turn right
Slight turn left
slight turn right

It makes more sense to me that a fractal generated number would Naturally end up having better hit rates than a memoryless random function.

Here’s a link to the demonstration:

The MNG ranges from -1.5 to 1. I’ve mapped the range of output to the above movement choices. Each movement choice has it’s own speed of which this vector is derived from the absolute value of the MNG.

I feedback into the system data from the single ping sensor, the current distance and the last distance information.
There are no other if/then functions regarding collision avoidance. There is no actual mapping procedures etc. The machine isn’t even told the sensor is on the front, yet it happens to move in the direction of the sensor.

My next video will demonstrate the differences of a random process vs. the MNG. It might be valuable for people to know that the Arduino 101 has 128 ML Intel Curie Neurons integrated with an accelerometer and gyro on board. This robot doesn’t yet use the ML but future iterations will.

Arduino source code :
#include <MapFloat.h>
//#include “CurieIMU.h” Future version will use Arduino 101 which contains 128 Intel Curie ML neurons. F#^%$ awesome board.

//Left side motor control
int ctrl_IN1 = 7;
int ctrl_IN2 = 6;
int ENABLE1 = 9; //enable Left back wheel
//Left side motor control
int ctrl_IN3 = 5;
int ctrl_IN4 = 4;
int ENABLE2 = 3; //enable Left Front wheel

//Right Side Motor Control
int ENABLE3 = 10; //enable Left Front wheel
int ENABLE4 = 11; //enable Left Front wheel
const int trigPin=13;
const int echoPin=12;

long duration;
int distance;

int ax, ay, az; // Future use to add Shannon Entropy data from accelerometer values into the Fractal landscape
int gx, gy, gz; // gyrometer values

int N;
float Left;
float Right;
float LeftSpeed;
float RightSpeed;
//float FrontPing;
int Ping1;
float Sensor1;
float SensorOld1;
float LeftMin=-1;
float RightMin=-1;
float LeftMax=1;
float RightMax=1;
int LeftChoice;
int RightChoice;

void setup() {
// put your setup code here, to run once:
pinMode(trigPin, OUTPUT); // Sets the trigPin as an Output
pinMode(echoPin, INPUT); // Sets the echoPin as an Input
// CurieIMU.begin();
delay(5000); // Allow the user to set everything down
//CurieIMU.autoCalibrateAccelerometerOffset(X_AXIS, 0);
//CurieIMU.autoCalibrateAccelerometerOffset(Y_AXIS, 0);
//CurieIMU.autoCalibrateAccelerometerOffset(Z_AXIS, 0);
pinMode(ctrl_IN1, OUTPUT);
pinMode(ctrl_IN2, OUTPUT);
pinMode(ctrl_IN3, OUTPUT);
pinMode(ctrl_IN4, OUTPUT);

//Below are Right side wheels
pinMode(ENABLE3,OUTPUT); //Enable 10 PWM
pinMode(ENABLE4,OUTPUT); //Enable 11 PWM
pinMode(A2, OUTPUT);
pinMode(A3, OUTPUT);
pinMode(A4, OUTPUT);
pinMode(A5, OUTPUT);// Using analog as digital


void loop() {
for (float I = 0; I < 63; I++) {
for (float J = 0; J < 63; J++) {
float X=(I-52)/31;
float Y=(J-42)/31;
float XA=0; // Consider resetting XA current sensor1 number instead of zero.
float YA=0; // Consider resetting YA to prior Sensor1 number instead of zero.
int ITER=0;

if (XA!=0){
float Mand=XA;// Adding +0.25 here would create more equal number ranges for the MNG but wouldn't be Natural but possibly interesting to put it on a more of a point of criticallity.

// delay(1000);
if (N==1){Left=Mand;}
if (N==2){Right=Mand;}
if (N>2){
// print serial the last 2 fractal numbers
// make both motors run in their directions and speeds
// map their ranges for speed and directions appropriately

digitalWrite(trigPin, LOW);
// Sets the trigPin on HIGH state for 10 micro seconds
digitalWrite(trigPin, HIGH);
digitalWrite(trigPin, LOW);
// Reads the echoPin, returns the sound wave travel time in microseconds
duration = pulseIn(echoPin, HIGH);
// Calculating the distance
distance= duration*0.034/2;
if (distance>100){distance=100;}

SensorOld1=Sensor1; //Retains previous ping distance feedback.
float Sensor1=mapFloat(distance,1,100,0.5,0.01); // Maps distance information to something not so overwhelming for the fractal landscape.

// Next 4 lines create minimum and maximum ranges from ping distance + fractal landscape.
if (Left>LeftMax){LeftMax=Left;};
if (Left<LeftMin){LeftMin=Left;}
if (Right>RightMax){RightMax=Right;}
if (Right<RightMin){RightMin=Right;}

LeftSpeed=abs(Left150));// creates vector from MNG range -1.5 to 1

analogWrite(ENABLE1, LeftSpeed);
analogWrite(ENABLE2, LeftSpeed);
analogWrite(ENABLE3, RightSpeed);
analogWrite(ENABLE4, RightSpeed);

if (LeftChoice==1){
if (LeftChoice==2){
if (LeftChoice==3){

if (RightChoice==1){
if (RightChoice==2){
if (RightChoice==3){


Serial.println(“LR Choices Above”);
Serial.println(“LR Numbers Above”);
Serial.print("Distance: ");

}// XA<>0
//Mandelbrot set maths below (no I’m not English) :slight_smile: :
float XTEMP=XAXA-YAYA+(X-Sensor1);
if (XAXA+YAYA<=4 && ITER<5200){goto HereA;}

}//end for J
}// end of for I


//Below are the functions to control the outputs to the H-Bridge.
void startCar(){
digitalWrite(ctrl_IN1, HIGH); //7,6,5,4
digitalWrite(ctrl_IN2, LOW);
digitalWrite(ctrl_IN3, HIGH);
digitalWrite(ctrl_IN4, LOW);
void stopCarLeft(){
digitalWrite(ctrl_IN1, LOW);
digitalWrite(ctrl_IN2, LOW);
digitalWrite(ctrl_IN3, LOW);
digitalWrite(ctrl_IN4, LOW);

void stopCarRight(){
digitalWrite(A2, LOW);
digitalWrite(A3, LOW);
digitalWrite(A4, LOW);
digitalWrite(A5, LOW);
void LeftForward(){
digitalWrite(ctrl_IN1, LOW);
digitalWrite(ctrl_IN2, HIGH);
digitalWrite(ctrl_IN3, HIGH);
digitalWrite(ctrl_IN4, LOW);
void LeftBackward(){
digitalWrite(ctrl_IN1, HIGH);
digitalWrite(ctrl_IN2, LOW);
digitalWrite(ctrl_IN3, LOW);
digitalWrite(ctrl_IN4, HIGH);

void RightBackward(){
digitalWrite(A2, LOW);
digitalWrite(A3, HIGH);
digitalWrite(A4, HIGH);
digitalWrite(A5, LOW);
void RightForward(){
digitalWrite(A2, HIGH);
digitalWrite(A3, LOW);
digitalWrite(A4, LOW);
digitalWrite(A5, HIGH);

void SharpRightTurn(){

void SharpLeftTurn(){

void Forward(){

void Backward(){

void Stop(){

Here are my parts lists:
Robot Chassis:

L298N H-Bridges:

Arduino Nano:

But preferably use for future research the Arduino 101 with Curie Neurons. It just tripled in price:

I would be delighted to experiment with any other ideas you have?

Bill Scott


With one sensor, if it avoids obstacles it is ok, there is not much room for improvements! Your idea is too immature to think about adding anything on top of it, my algorithm uses some more high-level input that a distance sensor reading, and needs some sort of simulation of what will happen next, so not much to be done here.


On a second review, another crude analogy I might make is of a “smart BEAM search”. In normal BEAM search, the keep metric is overall likelihood. In your case, it is a more complex measure which also encourages diversity.

Something about the entropy metric you are using to decide which search paths to keep makes it highly efficient, and I wonder what problem domains this efficiency is targeted at. The amazing results on games which are essentially simplified physics suggests that the technique could work in real-world robotics, although experiments are required to see how reliant it is on accuracy in the environment model. But if it is robust enough with current state of the art predictors, it could make for a great real-time planning stage.

I would be interested to see comparisons on strategy board games. How would it fare substituted in to Go players based on MCTS for instance? I suspect we would see some limitations in dense strategic environments with many critical decisions, as the entropy measure used to prune the search tree down might not align with the game environment.

Another thing worth trying IMO is comparisons with BEAM search in machine translation. If you can balance likelihood versus diversity measures when pruning, then I think that part of the algorithm may be more likely to find good word re-ordering when looking for alternative phrasing, where regular BEAM search could cull decent alternatives before they demonstrate their true value during generation.