Código que no hace lo que debe, código que después de leerlo deja mal sabor de boca incluso código que funcionando esta a años luz de ser buen código. Y mi propio código.
A high performance computer (HPC) is a built from many ordinary computers connected together with a network and usually controlled by a special software in charge of assign resources to users. This is a special case of distributed-memory computing environment. The examples of the Cloud Haskell available in both references use a controlled local cluster of computers and cloud platforms (Infrastructure as a Service type-of) to execute the programs.
In this blog I show who to execute Cloud Haskell applications in a HPC Supercomputer that use a job system software to manage resources (compute nodes).
How I do it
I work as Supercomputer System Admin of a node called Altamira [1] in the Instituto de Física of Cantabria (IFCA). This supercomputer is composed of almost 250 servers with 16 real cores, 64 GB of RAM of memory and an Infiniband connection. All the system is managed with the Slurm resource manager, tuned with a few scripts for the RES-Red Española de Supercomputación.
I don't want to replicate all the tutorials again, I only want to execute one again of them but test if I can execute in a Supercomputer like Altamira. So I decided to use only the WorkStealing example and implemented it again in a local project.
The Slurm system required that all the programs executed in a cluster should describe the resources they want. With this description (usually in a file, called job description file), the manager enqueue the jobs until it found free space to execute in. This file is also a script file that will be interpreted by a shell (the Bash shell in my case), with a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script. Example, a simple job that requires to use 32 cpus should have this text:
One we get the reserved cpus, we should split them in one for the server and (n-1) for the slaves. We can do it easily using the variables that Slurm define for our job. Then we use the Slurm program srun to spawn the (n-1) slaves in background in the right servers and then we spawn the master program in the last one. We don't know the names of the nodes assigned, nor how many programs per node should be spawned, srun know it to spawn the slaves:
The program srun call two scripts that stars the Cloud Haskell program instances. In this scripts, using the Slurm variables that srun set, we can pass the right parameter to the application. The client/slave script is (file run-client.sh):
For a unknown (for me) reason, the application starts with a awful block buffering. I made the change to line buffering in order to control the output during the execution.
main = do
hSetBuffering stdout LineBuffering
args <- getArgs
With the a job system, I found the problem with several simultaneous jobs in execution, that one master could steal the slaves of other jobs. In order to solve this I pass a list of assigned nodes to the master task. In the job script, we only need to add:
And in the Haskell program, we filter the founded slaves using the assigned list of nodes for that job:
["master", host, port, assigned, n] -> do
backend <- initializeBackend host port rtable
startMaster backend $ \slaves -> do
let used = filterNodes slaves assigned
result <- master (read n) used
liftIO $ print result
Unresolved problems
After it worked for the first runs, I start to ask for more slaves (Ha ha Ha). But the jobs begin to output this error:
Maybe is a problem of ports already binded, or a timeout during creation, I don't know, but a better error feedback can be great. I also have missed a way to start processes without specifying the port, that the package search for the first free port in a desired range.
Also, several times I needed the server name to be obtained from the NodeId type. I implemented a coarse function to get it. I think the functionality should be put on distributed-process package.
nidNode :: NodeId -> String
nidNode nid = case regres of
[_:nn:_] -> nn
_ -> "*"
where
regres = show nid =~ "nid://(.+):(.+):(.+)" :: [[String]]
Last words
This is a example of executed job, using 64 cores, 63 slaves spawned in four servers and 1 master:
I found the Cloud Haskell good enough (I doubt they should call it, currently is more distributed than cluoded ). And It add a practical alternative to the kind of distribution frameworks usually seen in HPC clusters. The performance should be tested with real problems, but, again, for user accustomed to the benefits of Haskell is a good start.
Llevaba mucho tiempo sin escribir en el blog, pero uno de los proyectos en los que he estado últimamente merecía toda mi atención. Y el resultado es este:
Un juego de Mutant-Games, del que ya tenían versión en iPhone, y del que yo solo me he encargado de transformar a un juego de Android. Usando todo el arte anterior y con la ayuda de la librería andEngine:
En próximos posts me gustaría comentar un poco como ha sido el proceso de desarrollo, solo voy a decir que he usado darcs como sistema de control de versiones, trac para gestionar las tareas y dropbox para compartir archivos con la gente de mutant-games.