FAQ  •  Login

HW11 multi-node problems

<<

f13-759-rmays

Newbie
Newbie

Posts: 49

Joined: Mon Sep 09, 2013 9:12 am

Unread post Mon Dec 02, 2013 12:59 pm

HW11 multi-node problems

Hi all,

I'm working on the integral problem for HW11 (Problem 3), and I have it working with one node and 4 cores, one node and 8 cores, and 2 nodes and 4 cores each, but 4 nodes with 2 cores doesn't work. It sits there running for a whileand then either segfaults or gets killed by PBS (I'm not sure whether it's actually segfaulting or that's a result of PBS killing it).

Here's the error I'm getting:

  Code:


[euler09][[32777,1],2][btl_tcp_endpoint.c:656:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.0.8 failed: No route to host (113)
=>> PBS: job killed: walltime 605 exceeded limit 600
[euler08:24309] *** Process received signal ***
[euler08:24309] Signal: Segmentation fault (11)
[euler08:24309] Signal code: Address not mapped (1)
[euler08:24309] Failing at address: 0x2b5312fb1b78
[euler08:24309] [ 0] /lib64/libpthread.so.0() [0x3255c0f500]
[euler08:24309] [ 1] /lib64/ld-linux-x86-64.so.2() [0x3254c0eb83]
[euler08:24309] [ 2] /lib64/libc.so.6(exit+0xe2) [0x3255035db2]
[euler08:24309] [ 3] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/openmpi/mca_ess_hnp.so(+0x2e09) [0x2b531035fe09]
[euler08:24309] [ 4] /lib64/libc.so.6() [0x3255032920]
[euler08:24309] [ 5] /lib64/ld-linux-x86-64.so.2() [0x3254c16f17]
[euler08:24309] [ 6] /lib64/ld-linux-x86-64.so.2() [0x3254c1625d]
[euler08:24309] [ 7] /lib64/ld-linux-x86-64.so.2() [0x3254c13de2]
[euler08:24309] [ 8] /lib64/ld-linux-x86-64.so.2() [0x3254c1462e]
[euler08:24309] [ 9] /lib64/ld-linux-x86-64.so.2() [0x3254c0e196]
[euler08:24309] [10] /lib64/libdl.so.2() [0x325580129c]
[euler08:24309] [11] /lib64/libdl.so.2(dlclose+0x1f) [0x325580100f]
[euler08:24309] [12] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(+0x3722c) [0x2b530fc5f22c]
[euler08:24309] [13] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(+0x34c11) [0x2b530fc5cc11]
[euler08:24309] [14] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(+0x421d1) [0x2b530fc6a1d1]
[euler08:24309] [15] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(mca_base_component_repository_release+0xa1) [0x2b530fc6a7c1]
[euler08:24309] [16] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(mca_base_components_close+0x41) [0x2b530fc6ac51]
[euler08:24309] [17] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-pal.so.6(mca_base_framework_close+0x5e) [0x2b530fc7398e]
[euler08:24309] [18] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/openmpi/mca_ess_hnp.so(+0x2b83) [0x2b531035fb83]
[euler08:24309] [19] /usr/local/mpi/gcc-4.4/openmpi-1.7.3/lib/libopen-rte.so.6(orte_finalize+0x49) [0x2b530f9d0339]
[euler08:24309] [20] mpiexec(orterun+0xe52) [0x404592]
[euler08:24309] [21] mpiexec(main+0x20) [0x403594]
[euler08:24309] [22] /lib64/libc.so.6(__libc_start_main+0xfd) [0x325501ecdd]
[euler08:24309] [23] mpiexec() [0x4034b9]
[euler08:24309] *** End of error message ***


Any idea what the issue is? The 2 nodes 4 cores job takes a minute to run, so this one taking 10 minutes is strange...

-Owen
<<

f13-759-wickre

Newbie
Newbie

Posts: 17

Joined: Mon Sep 09, 2013 9:13 am

Unread post Mon Dec 02, 2013 10:31 pm

Re: HW11 multi-node problems

Hi Owen,

I was working on the same problem, and it seemed to work for me, after a fashion. The immediate option on qsub (as shown in Lec26, slide 14) would SOMETIMES error out before the execution even started. Even when I was requesting only one node with one processor for node, it occasionally claimed there were insufficient resources available!

The qsub with the file option always, worked though. Here's my .sh file, for 4 nodes with 2 cores each:

#!/bin/bash

#PBS -l nodes=4:ppn=2,walltime=5:00
#PBS -d /home/wickre/Homework/HW11/

module load mpi/gcc/openmpi
mpiexec -np 8 ./IntegralMPI


Hopefully that provides something useful.
<<

Dan Negrut

Global Moderator
Global Moderator

Posts: 833

Joined: Wed Sep 03, 2008 12:24 pm

Unread post Mon Dec 02, 2013 11:24 pm

Re: HW11 multi-node problems

Paul - thank you for providing this solution, this is great...
Dan

f13-759-wickre wrote:Hi Owen,

I was working on the same problem, and it seemed to work for me, after a fashion. The immediate option on qsub (as shown in Lec26, slide 14) would SOMETIMES error out before the execution even started. Even when I was requesting only one node with one processor for node, it occasionally claimed there were insufficient resources available!

The qsub with the file option always, worked though. Here's my .sh file, for 4 nodes with 2 cores each:

#!/bin/bash

#PBS -l nodes=4:ppn=2,walltime=5:00
#PBS -d /home/wickre/Homework/HW11/

module load mpi/gcc/openmpi
mpiexec -np 8 ./IntegralMPI


Hopefully that provides something useful.
<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Tue Dec 03, 2013 12:03 am

Re: HW11 multi-node problems

f13-759-rmays wrote:
  Code:
[euler09][[32777,1],2][btl_tcp_endpoint.c:656:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.0.8 failed: No route to host (113)
=>> PBS: job killed: walltime 605 exceeded limit 600


Infiniband issues. Try now.
<<

f13-759-rmays

Newbie
Newbie

Posts: 49

Joined: Mon Sep 09, 2013 9:12 am

Unread post Tue Dec 03, 2013 1:47 am

Re: HW11 multi-node problems

Thanks for the replies. I had been using qsub with the file option, but without the module load for openmpi. Is it necessary?

I'm trying again, hopefully the infiniband fix will do the trick.

-Owen
<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Tue Dec 03, 2013 9:32 am

Re: HW11 multi-node problems

f13-759-rmays wrote:Thanks for the replies. I had been using qsub with the file option, but without the module load for openmpi. Is it necessary?

My best answer is 'probably'. If you leave it off you'll end up with whatever the system default MPI is, which is both ancient (1.4 something) and has changed in the past. Doing 'module load mpi/gcc/openmpi/1.7.3' will ensure you always have the same environment.

You could also initadd that module so it's loaded by default, though it's usually best to still define it in your job script in case you decide to switch your 'global' MPI version later (I frequently swap between multiple versions of OpenMPI and MVAPICH2).
<<

f13-759-rmays

Newbie
Newbie

Posts: 49

Joined: Mon Sep 09, 2013 9:12 am

Unread post Tue Dec 03, 2013 11:56 am

Re: HW11 multi-node problems

Hi Andrew,

I tried it again and it timed out after half an hour. This is the PBS script I ran:

  Code:
#!/bin/sh
#PBS -N owen-integrate
#PBS -l nodes=4:ppn=2,walltime=00:30:00
#PBS -d /home/rmays/759GPUs/11Homework/

module load mpi/gcc/openmpi
mpiexec /home/rmays/759GPUs/11Homework/mpiIntegrate.out


I just re-submitted it with the full mpi/gcc/openmpi/1.7.3 specification and upped the walltime to 1 hour. Any other ideas?

-Owen
<<

Andrew Seidl

Administrator
Administrator

Posts: 193

Joined: Thu Oct 28, 2010 11:54 am

Unread post Tue Dec 03, 2013 12:47 pm

Re: HW11 multi-node problems

Looks like I forgot to mention this: make sure you're running on the AMD nodes, the Infiniband switch on the GPU nodes is the one being problematic right now.

  Code:
#!/bin/sh
#PBS -N owen-integrate
#PBS -l nodes=4:ppn=2:amd,walltime=00:30:00
#PBS -d /home/rmays/759GPUs/11Homework/

module load mpi/gcc/openmpi
mpiexec /home/rmays/759GPUs/11Homework/mpiIntegrate.out
<<

f13-759-rmays

Newbie
Newbie

Posts: 49

Joined: Mon Sep 09, 2013 9:12 am

Unread post Tue Dec 03, 2013 5:34 pm

Re: HW11 multi-node problems

Thanks Andrew, restricting the job to the AMD nodes worked!

-Owen
<<

f13-759-sdeng9

Newbie
Newbie

Posts: 8

Joined: Mon Sep 09, 2013 9:13 am

Unread post Thu Dec 05, 2013 1:27 am

Re: HW11 multi-node problems

Hi,

The qsub works fine for me. But when i use "mpicxx integrate_mpi.cpp", it turns out "-bash: mpicxx: command not found". Any suggestion would be appreciated. Thanks

Deng
<<

f13-759-nsubramania2

Newbie
Newbie

Posts: 45

Joined: Mon Sep 09, 2013 9:12 am

Unread post Thu Dec 05, 2013 1:31 am

Re: HW11 multi-node problems

Deng,

Did you try the 'module load mpi/gcc/openmpi' first before mpicxx-ing?
- Naveen
<<

f13-759-sdeng9

Newbie
Newbie

Posts: 8

Joined: Mon Sep 09, 2013 9:13 am

Unread post Thu Dec 05, 2013 1:05 pm

Re: HW11 multi-node problems

Hi Naveen,

Thanks for your reply. It works ;)

Deng
<<

xiudongwu

Newbie
Newbie

Posts: 15

Joined: Fri Sep 04, 2015 12:51 pm

Unread post Sat Dec 12, 2015 11:15 am

Re: HW11 multi-node problems

Thanks, it also help me lot.

Return to ME759 Fall 2013: High Performance Computing

Who is online

Users browsing this forum: No registered users and 2 guests

cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software.