FAQ  •  Login

Euler Documentation

<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Mon Sep 09, 2013 10:23 am

Euler Documentation

As mentioned in class and in the account creation emails, the documentation for using Euler is available at: http://wacc.wisc.edu/docs/

If you come across any errors or have suggestions for new pages, please submit an issue at: https://github.com/wiscacc/docs.wiscacc.org/issues.
<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Fri Sep 27, 2013 9:55 am

Re: Euler Documentation

To summarize where and when to run CUDA jobs on Euler:

If you need results from your code *now* and it only takes a few minutes to run (like when you're debugging), just ssh to euler01 / euler99.

If you're able to wait a few days to get your results ("production runs" of your code), submit to the rest of the cluster via qsub.

If your code takes more than a few days to run, you'll probably want to make sure your code checkpoints itself on occasion (and there aren't any strange bottlenecks), then submit to the cluster via qsub (or some of the big iron if you have access).

Lastly, you can see the current cluster status at: http://euler.wacc.wisc.edu/cgi-bin/pbswebmon-nojobs.py. Usernames in green indicate that that job is using GPUs; each node has a total of four GPUs each (though someone might request more than one GPU per job).
<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Thu Oct 10, 2013 12:27 pm

Re: Euler Documentation

As a reminder, for the time being please either directly run your code on euler01, or submit a job via qsub. Do not ssh in to other nodes to run code unless explicitly given permission to do so - other researchers are most likely running code on those nodes, so running your own code on them might interfere with their simulations (cause strange issues with your own).

If the cluster doesn't appear to be running your jobs submitted via qsub, the most likely explanation is that all the GPUs are currently in use. Your jobs will automatically run once a GPU becomes available, which currently can be a day or more. For now, it's suggested that you test your code on euler01 (or your own machine, if possible) and use that to gather preliminary timing results if required. Once everything is working, you can submit your code to the rest of the cluster via qsub in order to get more accurate timing results.

I am currently rewriting a portion of Torque (the job manager, which includes qsub) so that shorter GPU jobs will have a dedicated pool of GPUs available, thereby reducing the wait time from 1-2 days to at most 30-60 minutes. With luck that will be ready in the next week or so. At that point all your jobs (including debugging and test runs) will need to go through qsub.

Return to ME759 Fall 2013: High Performance Computing

Who is online

Users browsing this forum: No registered users and 1 guest

cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software.