25 de octubre de 2012

Using Cloud Haskell in HPC Cluster

Cloud Haskell is a domain-specific language for developing programs for a distributed-memory computing environment, as it is explained on Towards Haskell in the Cloud paper, and put in a working example in Communication Patterns in Cloud Haskell tutorials.

A high performance computer (HPC) is a built from many ordinary computers connected together with a network and usually controlled by a special software in charge of assign resources to users. This is a special case of distributed-memory computing environment. The examples of the Cloud Haskell available in both references use a controlled local cluster of computers and cloud platforms (Infrastructure as a Service type-of) to execute the programs.

In this blog I show who to execute Cloud Haskell applications in a HPC Supercomputer that use a job system software to manage resources (compute nodes).

How I do it

I work as Supercomputer System Admin of a node called Altamira [1] in the Instituto de Física of Cantabria (IFCA). This supercomputer is composed of almost 250 servers with 16 real cores, 64 GB of RAM of memory and an Infiniband connection. All the system is managed with the Slurm resource manager, tuned with a few scripts for the RES-Red Española de Supercomputación.

I don't want to replicate all the tutorials again, I only want to execute one again of them but test if I can execute in a Supercomputer like Altamira. So I decided to use only the WorkStealing example and implemented it again in a local project.

The Slurm system required that all the programs executed in a cluster should describe the resources they want. With this description (usually in a file, called job description file), the manager enqueue the jobs until it found free space to execute in. This file is also a script file that will be interpreted by a shell (the Bash shell in my case), with a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script. Example, a simple job that requires to use 32 cpus should have this text:

#@ job_name = cloud_haskell
#@ initialdir = .
#@ output = cloud_haskell_%j.out
#@ error = cloud_haskell_%j.err
#@ total_tasks = 32
#@ wall_clock_limit = 1:00:00 

echo "start running"

One we get the reserved cpus, we should split them in one for the server and (n-1) for the slaves. We can do it easily using the variables that Slurm define for our job. Then we use the Slurm program srun to spawn the (n-1) slaves in background in the right servers and then we spawn the master program in the last one. We don't know the names of the nodes assigned, nor how many programs per node should be spawned, srun know it to spawn the slaves:

srun --exclusive -c1 -n$((SLURM_NTASKS-1)) run-client.sh
sleep 30
srun --exclusive -c1 -n1 run-server.sh

The program srun call two scripts that stars the Cloud Haskell program instances. In this scripts, using the Slurm variables that srun set, we can pass the right parameter to the application. The client/slave script is (file run-client.sh):

~/.cabal/bin/demo-worksteal slave $HOST $((SLURM_LOCALID + 7766))

And for the master (file run-server.sh):

~/.cabal/bin/demo-worksteal master $HOST $((SLURM_LOCALID + 7866)) 1000

For a unknown (for me) reason, the application starts with a awful block buffering. I made the change to line buffering in order to control the output during the execution.

main = do
  hSetBuffering stdout LineBuffering
  args <- getArgs
With the a job system, I found the problem with several simultaneous jobs in execution, that one master could steal the slaves of other jobs. In order to solve this I pass a list of assigned nodes to the master task. In the job script, we only need to add:
srun --exclusive -c1 -n1 run-server.sh $SLURM_NODELIST
And in the Haskell program, we filter the founded slaves using the assigned list of nodes for that job:
["master", host, port, assigned, n] -> do
  backend <- initializeBackend host port rtable
  startMaster backend $ \slaves -> do
    let used = filterNodes slaves assigned
    result <- master (read n) used
    liftIO $ print result

Unresolved problems

After it worked for the first runs, I start to ask for more slaves (Ha ha Ha). But the jobs begin to output this error:
 addMembership: failed (Unknown error 18446744073709551615) 
Maybe is a problem of ports already binded, or a timeout during creation, I don't know, but a better error feedback can be great. I also have missed a way to start processes without specifying the port, that the package search for the first free port in a desired range. Also, several times I needed the server name to be obtained from the NodeId type. I implemented a coarse function to get it. I think the functionality should be put on distributed-process package.
nidNode :: NodeId -> String
nidNode nid = case regres of
  [_:nn:_] -> nn
  _ -> "*"
    regres = show nid =~ "nid://(.+):(.+):(.+)" :: [[String]]

Last words

This is a example of executed job, using 64 cores, 63 slaves spawned in four servers and 1 master:
START Master on: node60 7866
 + [node60:7866] 63 slave/s
 + [node60:7866] -> node57 = 16 slaves
 + [node60:7866] -> node58 = 16 slaves
 + [node60:7866] -> node59 = 16 slaves
 + [node60:7866] -> node60 = 15 slaves
END Master
I found the Cloud Haskell good enough (I doubt they should call it, currently is more distributed than cluoded ). And It add a practical alternative to the kind of distribution frameworks usually seen in HPC clusters. The performance should be tested with real problems, but, again, for user accustomed to the benefits of Haskell is a good start.

2 comentarios:

Edsko de Vries dijo...

First of all, nice blog post!

The addMembership error you are getting comes from the network-multicast package, and is related to UDP broadcast. UDP broadcast is used by the simplelocalnet package to discover slave nodes. In a setup such as yours this is probably not what you want.

If you are serious about taking this further, I would suggest to develop a backend specifically for this kind of application. Backends are relatively simple to develop; have a look at the SimpleLocalnet backend for a starting point. You could then include functionality such as "search for a port" as part of the backend. The functionality of obtaining the server address from a Node ID belongs there too. After all, it depends on the specific backend how servers are addressed (if you are using TCP server addresses might look different than when you are using a different protocol).

Luis Cabellos dijo...

Thanks Edsko. So, It sounds easy to fix the problems I found on tests. I don't know if I'll try it, but I think it's important to show examples of Haskell, because Haskell is a great tool and people must know. :)