My time as graduate student is slowly coming to an end (or at least should be coming to an end). If I had to reflect on what I learned in the course and what I would wish to have known from the beginning then it would be this: how do you scale up computation.
If you take the study of human mind seriously you will inevitably end up simulating complex cognitive models and evaluating large datasets. In recent months I ended up with two desktop pcs (intel i7 machines) that crunched numbers day and night and throughout the week and I sat in the office with nothing else to do waiting for results (and writing blog actually).
To scale up I needed to accept a different work-flow from how I did things before. Unfortunately, it's almost impossible to find any advice on this topic. This is mainly because the large scale computing is a novel addition to the tool inventory of the cognitive scientist. This post is my write-up of what I learned and what I would recommend. I start with discussing solution that I discarded.
There is probably a computing cluster at your institution or computer cluster that serves scientists from multiple universities. For instance University Heidelberg was served by bwGrid which is cluster of computers at universities in Baden-Württemberg. The grid consists of hundred IBM blade centers. A blade center is basically a refrigerator that houses several computers mounted inside a rack chassis. Specifically, each blade center houses 14 computers. Each computer has 2 Intel Xeon processors with four cores each run @2.8mhz. If you wish to use the cluster you are asked to supply your programs and data along with a script that determines how the computing jobs are allocated to the machines. In theory this is the place to go if you have advanced computing needs. However I quickly found that the computing cluster doesn't satisfy my needs in several respects. Foremostly, your script can take at most three days to run and is terminated if it does not. That means
you need to make your programs parallel. In many of my programs this is difficult/impossible because the algorithms are inherently sequential.
you need to estimate how long your programs will run and parallelize them accordingly.
you can't alter/refine your program once the script was started. For instance I use to check the intermediate results and often I find bugs so that I need to restart the simulation. The computing experience with the grid simply is not meant to be interactive.
Consider what this means in practice. A one month computation with my two computers with more powerful CPUs (@4.2mhz and @3.8mhz) would take 1-2 days for one blade center. That assumes that I'm able to split my job that runs at 8 cores to a job that runs at 100 cores. Furthermore, each of the 100 units has to finish within the 3 days time-window. I find these requirements rather tough. Finally, it simply doesn't makes sense to run smaller or interactive jobs on the grid so in the end you will nevertheless need some additional computing solution for this. An alternative solution is to build your own computing cluster where only you (and your colleagues) determine how the programs are scheduled and run. If you can afford the investment this is I think a way to go.
The first decision you need to make is to select the hardware. There are basically three options, with their advantages and disadvantages:
Get powerful intel desktop CPUs with 4 cores.
- Example: i7-4790 with H97 motherboard
- Advantages: fast single core performance, great compatibility (since these are use by the popular desktop computers), inexpensive
- Disadvantages: small number of cores = cores are disperse across multiple machines
High end intel CPUs with 6/8 cores
- Example: 5960 or 5820 with X99 motherboard
- Advantages: fast single core performance, high number of cores
- Disadvantages: expensive (especially, X99 and DDR4 are currently expensive)
- Server CPU
- Example: Xeon CPU with 8/12/16 cores on a multi-socket motherboard
- Advantages: high number of cores per machine
- Disadvantages: super-expensive, slow single core performance
Somewhat surprisingly, I figured out that the first option will be actually the best one, especially if we replace i7 with the E3-1231 Xeon that also uses the popular 1550 socket and offers similar performance (4 cores plus multithreading) for less money. The only disadvantage - that the cores are dispersed across multiple machines, can be taken care of by appropriate choice of software.
The most naive way to operate your cluster is to set up a desktop OS on each pc, then bring your programs and data with a USB medium to each of the machines and run it there. Our first step is to make it unnecessary to physically approach the machines. Instead we will transfer the data and run the programs remotely over LAN network (which is most probably already available at you institution). This can be done with protocols such as FTP or SSH and the respective clients. Here we will focus on SSH. This means that we do not need peripheral devices (screen, mouse and keyboard) connected to each of the machines in our cluster. In turn, we can spare computational resources if we select an OS that does not set up the desktop environment at all. Do you recall MS-Dos? Conveniently various Linux distributions offer so-called minimal installations. Such install only sets up the necessary tools to run a Linux shell. That's actually all we need, SSH is just a shell after all. You could run a remote desktop but we don't need it (unless you wish to do some graphical rendering). To get fancy file browser you can just use clients like fireFTP (a Firefox Add-on).
This page describes how to set up minimal Ubuntu and this is what the minimal install of Ubuntu looks like.
It takes only a while to realize how desolate the Mini OS is. SSH server installed? Nope. Fortran and C compilers installed? Nope. USB driver? Nope. Sorry USB will not work. Fortunately, LAN connection is working so we can use apt-get to install additional packages. First however we will need to physically visit the PC and setup some minimum of things. For this you will need a Monitor and a Video card. (Note, the integrated GPU on your Motherboard will not work since the Xeon processor does not support this.) First, we install a SSH server and check that it is running.
sudo apt-get install openssh service ssh status
We can't perform the initial OS login remotely. We setup the OS to auto-login. To do this do
sudo nano /etc/init/tty1.conf
and change the last line to
exec /bin/login -f USERNAME < /dev/tty1 > /dev/tty1 2>&1
While you are at the machine it is also worth to check with the BIOS. For instance I checked that multithreading is on, CPU boost is on, that the drive is properly detected as SSD and I set the fans at full speed. If your BIOS supports it, you may wish to enable the Wake-On-Lan functionality.
Now we can say good bye to the machine's physical shell. From now on we will talk to it only remotely via LAN.
This depends on your choice of programming language. I guess you will want to install programming environment with libraries for working with vectors, linear algebra, numeric optimization and statistics. Here is a tip for you. You should get an optimized linear algebra package such as OpenBLAS and compile it from source on the target machine. The compilation process will detect the latest cpu features such as multithreading, AVX2 or SSE4 and compile a version that utilizes these features. Conveniently, the Xeon CPU that I recommended offers all of these latest features. For instance I tested the matrix multiplication. Numpy with OpenBLAS is running 35 times faster than the Vanilla Numpy with precompiled binaries. Custom BLAS implementation will also speed up R or Octave. (Matlab uses Intel's optimized MKL by default.)
I should also add that to get this kind of performance an optimized BLAS implementation will stress your CPU beyond what is expected. With Intel's Boxed CPU cooler you will get dangerously high temperatures and you should get a better after-market cooler.
We now have a computer cluster that can be approached remotely with SSH. IPython offers additional parallel computing functionality that makes it possible to avoid Shell environment altogether. Instead we can approach the remote machines from client's IPython directly. I will describe how this can be done in a separate post.