Page 1 of 1

Progress with GPU Cluster

Unread postPosted: Thu Apr 01, 2010 3:11 pm
by HammadM
Information on problems I encountered and any solutions I found will go here

Basic Outline [details to be added later]
(Restart Computer As Required):
Install Windows Server 2008 R2 on the Head Node
Activate Windows, apply updates as needed

Install Active Directory
Install HPC Cluster Manager
Install Remote Desktop

Add Nodes

Configuring Cluster:
Once all compute nodes have been added to cluster the next step is to install software
Begin by going into the cluster manager and creating a new Group
Call this group "Nodes", only add the compute nodes (and not the head node) to this group.
This group will be used as a way to specify that only the compute nodes will be used for a job

Installation of software will be done using the "clusrun" command from a HPC powershell

clusrun /nodegroup:Nodes /workdir:\\SBELHPCHEAD\TestFolder Example.exe /s

clusrun               -  run a job on cluster
/nodegroup:Nodes         -  job wil be run on group "Nodes" tha was just created
/workdir:\\SBELHPCHEAD\TestFolder   -  specifies working directory where code will run from, note that SBELHPCHEAD is the name of the head node and the folder "TestFolder" is a shared directory
Example.exe            -  code that jab will execute
/s               -  flag for executable

Installing CUDA:
Install the CUDA Toolkit and SDK on Head node
Install CUDA Tesla Server Driver on compute nodes:
clusrun /nodegroup:Nodes /workdir:\\SBELHPCHEAD\Downloads 197.03_Tesla_winserv2008R2_64bit_international_whql.exe /s

where: 197.03_Tesla_winserv2008R2_64bit_international_whql.exe is the driver to be installed
Note that in this case the working directory is called "Downloads" and is a shared folder

PS C:\Program Files\Microsoft HPC Pack 2008 R2\Bin> clusrun /nodegroup:Nodes /workdir:\\SBELHPCHEAD\Downloads 197.03_Tesla_winserv2008R2_64bit_international_whql.exe /s

-------------------------- COMPUTENODE003 returns 0 --------------------------
-------------------------- COMPUTENODE005 returns 0 --------------------------
-------------------------- COMPUTENODE006 returns 0 --------------------------
-------------------------- COMPUTENODE004 returns 0 --------------------------
-------------------------- COMPUTENODE001 returns 0 --------------------------
-------------------------- COMPUTENODE002 returns 0 --------------------------

-------------------------- Summary --------------------------
6 Nodes succeeded
0 Nodes failed
PS C:\Program Files\Microsoft HPC Pack 2008 R2\Bin>

Testing if CUDA works:
PS C:\Program Files\Microsoft HPC Pack 2008 R2\Bin> clusrun /nodes:Computenode001 /workdir:\\SBELHPCHEAD\Downloads\TestingCUDA deviceQuery

This job only runs on one of the compute nodes

Note that the clusrun command can only be run by an admin.

When compiling programs for the cluster, make sure that sm_13 GPU architecture is selected in the CUDA build rule, unexpected things may happen (crashing programs) if this is not done.

Re: Progress with GPU Cluster

Unread postPosted: Thu Apr 01, 2010 3:38 pm
by Dan Negrut
Hammad - please post the two emails that you sent out to Frank, and also the reply you got from him.
thank you for starting this up,