What are the Best Hardware that matches the TPU?
The Best GPUS match the TPU Hardware
Honestly, tensorflow is not optimized for learning across multiple GPUs.
Keras now supports a primitive way of using multiple GPUs;
I think MxNet will sweep the market real quick if tensorflow does not address this issue and if it wasn’t for their 190mb library I would be jumping ship. Tensorflow on its own has state of the art design ideas, it is clunky, unoptimized and quite frankly is not high-level enough on its own. With keras it becomes pleasant. There is definitely a void to be filled in the market share. Currently the users are divided into researchers, software guys who are swearing and knocking their heads against the desk while working with current frameworks, people who are able to leverage minimal amounts of DL for large benefits, think any science field, and people who like the idea but every time they sit down they end up googling what certain error means. Unfortunately, the largest camp is the last group of people ( i would estimate about 85%).
p.s. write your software first and run on gfx card. do cost benefit analysis of building system vs buying compute time on the cloud given you know what your project entails. No sense in dropping $$$$ if you don’t know what $$$$ you get in return. And while in the papers they talk about running certain sim for 2 weeks on a bazillion gpus, the secret is that you can get meaningful results on 1 gpu in like 2 hours for almost any problem. I have 4 1080 tis and I haven’t used more than 2 ever for any openai/tensorflow task, although i have used all 4 with keras and some cuda stuff I do. just my 2 cents
frenk1981, thanks for the info! I’m thinking of doing some stuff with a simple proof-of-concept neural network written in C that uses CUDA matrix calculations, but I wasn’t sure whether this would be such a great idea if tensorflow already included support for a lot of this stuff. This helped me make up my mind. Keras looks neat too
it is articles like these https://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi
that you will find very helpful.
NNs were around for much longer than TF and this latest craze.
ps IMHO if you are building a pc now for ai/ml stuff, I would strongly suggest something like the 16 core threadripper (32 threads) 60 pci lanes, currently is only $700 and grabbing 2 x titans/ 1080ti with some m.2 storage, all you really need to beat the TPU performance by a hefty a clip and at like no cost. I strongly favor having 32 threads on a CPU as that can provide immense speed up for certain program/algo types in learning, much more impressively than the gpus. The workflow essentially needs to have rollout-MC type calculations on the gpu, and anything that requires inter-agent/algo interaction on the CPU. The reason why any GPU implementation of algos sucks is the price you pay is waiting for memory to be copied between cpu and gpu land. Therefore, if TF or some other software (look at openacc hint hint) figures out how to take any program and recompile it so it runs 100% on the gpu with maximal elimination of memory transfers , it aint worth investing in too many gpus.
I have seen in this site https://www.extremetech.com/computing/247199-googles-dedicated-tensorflow-processor-tpu-makes-hash-intel-nvidia-inference-workloads . Maybe someday OpenAI can create there own hardware and I found this https://www.hpcwire.com/2017/11/27/v100-good-not-great-select-deep-learning-aps-says-xcelerit/. Probaly this is interesting https://www.hpcwire.com/2017/11/22/japan-unveils-first-quantum-computer-prototype/
the google “benchmarks” are pretty cooked and nvidia posted their press release at some point.
Bottom line is that P100 is pretty powerful and for a small time researcher/enthusiast/wise guy you ca be very competitive with just 4x1080tis and 32 threads which you can build for around 5k-6k depending on how lucky you are.
I built something like this just2 months ago and the prices have gone down like 10% over the black friday
I put in the amd 1950x into an asus zenith motherboard with 32gigs of ram (dont’ need more then this and 64 would not post at speeds I paid for) with all m.2 nvme storage and 4x 1080tis for 8K canadian. That is $6k usd with all premium components, liquid cooling and it runs peep quiet on my desktop. FYI this is the original devbox idea that nvidia was selling for 15k now with 32 threads to boot @4ghz /13k cuda cores with peak 100% load around 1200 watts.
This is what you should get unless you are an institution/agency/have grant money. As an fyi I had to remove the backplates (both back of the card and the exhaust grills off the cards so they can run 100%x4 without overheating with stock cooling otherwise. Running at 100% the 4 GTXs will pull about 1000 watts from the wall and you will be able to dry your hair behind the box. If this becomes an issue (which it won’t as currently it is really hard to have 100% load on gpu for more than 1-2 minutes) you can always water cool the gpus or buy the ones with that stuff pre-installed. The thing about watercooling is counter-intuitive as air cooling actually is better at cooling down a system once it heats up periodically.
Linus did a video on linus tech tips revealing that they setup the latest canadian supercomputer similarly to this going with amd epyc (thredripperx2 )and vega cards. This will be the way to go for next 4-5 years until we get graph processors. Remember the GPU acceleration of general computing algos is an afterthought and hence the primary problem will remain (on our PCs which don’t have the fancy NVLink/high bus speeds/capacities) . Google is doing good things overall, but I would not go drooling over the TPU tech as the benchmark is better than it seems and really the silent benchmarks Amazon MXnet released are imho more impressive (linear acceleration upto 256x gpus is the punchline).
Anyhow, back to trying to figure out how to sample policies in my stupid self-play toy (it woulda been nice if the open-ai wizards that wrote the https://arxiv.org/pdf/1710.03748.pdf could tell me if they considered re-training previous policies and then sampling them randomly as that would lead to a monte-carlo minimization type self-play policy )