Hi all,

I ran into an issue with OMPI 1.3's routability calculations. On a Rocks cluster I use, 
there is a head node ("hpcgeek") and seven compute nodes ("compute-0-0..6"). The compute
nodes have interfaces configured as follows:

bond0     inet addr:192.168.180.253  Bcast:192.168.180.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
          (bonds together eth0+1)
eth0      inet6 addr: fe80::215:17ff:fe35:337a/64 Scope:Link
eth1      (no addresses)
ib0       inet addr:192.168.182.253  Bcast:192.168.182.255  Mask:255.255.255.0
          inet6 addr: fe80::202:c902:22:bb71/64 Scope:Link
lo        inet addr:127.0.0.1  Mask:255.0.0.0

The head node is configured as follows:

eth0      inet addr:192.168.180.1  Bcast:192.168.180.255  Mask:255.255.255.0
          inet6 addr: fe80::215:17ff:fe35:2c9e/64 Scope:Link
eth1      inet addr:128.148.64.180  Bcast:128.148.64.255  Mask:255.255.255.0
          inet6 addr: fe80::215:17ff:fe35:2c9f/64 Scope:Link
ib0       inet addr:192.168.182.128  Bcast:192.168.182.255  Mask:255.255.255.0
          inet6 addr: fe80::205:ad00:22:babd/64 Scope:Link
lo        inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host

*** (Sam, can you adjust these to the state where the hang occurs?)

Now if I launch an OMPI program with an MCA config of "btl = ^openib,udapl",
ie. forcing OMPI to ignore the IB interface, then I get a hang because, using
the new reachability code, the **** interface scores the highest for
the connection from **** to *****, but it is not actually routable.

*** (Sam, can you fill in the asterisk-marked bits?)

We believe that this behavior is in error and should be fixed.

Thanks,
Andreas

DraftOMPIBugReport (last edited 2009-06-22 14:40:13 by AndreasKloeckner)