Hi all,
I ran into an issue with OMPI 1.3's routability calculations. On a Rocks cluster I use,
there is a head node ("hpcgeek") and seven compute nodes ("compute-0-0..6"). The compute
nodes have interfaces configured as follows:
bond0 inet addr:192.168.180.253 Bcast:192.168.180.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
(bonds together eth0+1)
eth0 inet6 addr: fe80::215:17ff:fe35:337a/64 Scope:Link
eth1 (no addresses)
ib0 inet addr:192.168.182.253 Bcast:192.168.182.255 Mask:255.255.255.0
inet6 addr: fe80::202:c902:22:bb71/64 Scope:Link
lo inet addr:127.0.0.1 Mask:255.0.0.0
The head node is configured as follows:
eth0 inet addr:192.168.180.1 Bcast:192.168.180.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe35:2c9e/64 Scope:Link
eth1 inet addr:128.148.64.180 Bcast:128.148.64.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe35:2c9f/64 Scope:Link
ib0 inet addr:192.168.182.128 Bcast:192.168.182.255 Mask:255.255.255.0
inet6 addr: fe80::205:ad00:22:babd/64 Scope:Link
lo inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
*** (Sam, can you adjust these to the state where the hang occurs?)
Now if I launch an OMPI program with an MCA config of "btl = ^openib,udapl",
ie. forcing OMPI to ignore the IB interface, then I get a hang because, using
the new reachability code, the **** interface scores the highest for
the connection from **** to *****, but it is not actually routable.
*** (Sam, can you fill in the asterisk-marked bits?)
We believe that this behavior is in error and should be fixed.
Thanks,
Andreas