Discussion:
Network availability during shutdown
Andrew Deason
2011-03-09 23:36:08 UTC
Permalink
Hi,

Recently I've become aware that Solaris does not seem to like it when
OpenAFS tries to access the network during reboot/halt/poweroff. That
is, when the fs is unmounted during uadmin() -> kadmin() ->
vfs_unmountall() codepath, AFS kernel code cannot access the net.

Now, as of right now I don't think there are any releases of OpenAFS
that do try to hit the net on shutdown on Solaris (except prereleases),
but we will try to do that soon. The main reason for this being that we
want to notify fileservers that we are going away, so the don't try to
contact us.

When shutting down a Solaris box with a running a bleeding-edge
development version of OpenAFS with, say, 'reboot', I notice that it
takes quite a long time. The reason is that we are timing out on trying
to contact the fileservers. This could be a bug in OpenAFS's network
handling code, but I haven't seen anything like that yet.

So first of all, is it intentional that the network is not available at
this point in the shutdown process? I do not even see any errors given;
I can see that we are calling sosendmsg(), and it returns with no error
code, and no uio_resid. However, I have not been able to see any packets
on the wire that we're trying to send.

Assuming that's all intended and correct, is there any way for us to be
able to run something before the net is shut down? In the OpenSolaris
codebase, I see callbacks registered with the CB_CL_UADMIN_PRE_VFS class
are fired right before this, but I assume that is not helpful, since
that's right before the vfs_unmountall() call, so the net is probably
still not available then.

Assuming there is no way to do that, is there a good way to detect if
the network is available at this level? I can work around this by
preventing network access if the sys_shutdown global is nonzero, but I
don't know if that's the best way. I also assume that that is considered
not a public interface at all, since I can find no documentation on it.

Also, keep in mind this scenario I'm talking about is when someone turns
off or reboots the machine directly via reboot/halt/poweroff, and not
'shutdown'. So, it is not possible to prevent this via SMF/initscripts,
but it's also probably okay if whatever solution/workaround is
suboptimal compared to e.g. 'umount /afs'. I'd just like to avoid the
long delays, since if someone is running 'reboot', I'd expect they want
the machine to reboot quickly, even if it means some things shutting
down uncleanly.
--
Andrew Deason
***@sinenomine.net
Andrew Deason
2011-03-14 18:08:10 UTC
Permalink
On Wed, 9 Mar 2011 17:36:08 -0600
Post by Andrew Deason
Recently I've become aware that Solaris does not seem to like it when
OpenAFS tries to access the network during reboot/halt/poweroff. That
is, when the fs is unmounted during uadmin() -> kadmin() ->
vfs_unmountall() codepath, AFS kernel code cannot access the net.
To clear this up a bit, there's no kernel code that does anything like
this (e.g. explicitly taking the net down during a shutdown sequence).
The VM I was using has an interface handled by DHCP, and it gets taken
down when dhcpagent is killed with TERM. And the 'reboot' command tries
to kill ~everything with TERM before calling uadmin(). If there exists
an interface that is statically assigned an IP, this doesn't happen.

I still find it a little odd that sosendmsg doesn't give us an error
when there are no applicable routes available, but this should be much
easier to look at now that I know I don't need the kernel in the uadmin
syscall to reproduce what's going on.
--
Andrew Deason
***@sinenomine.net
Derrick Brashear
2011-03-14 18:16:33 UTC
Permalink
Post by Andrew Deason
On Wed, 9 Mar 2011 17:36:08 -0600
Post by Andrew Deason
Recently I've become aware that Solaris does not seem to like it when
OpenAFS tries to access the network during reboot/halt/poweroff. That
is, when the fs is unmounted during uadmin() -> kadmin() ->
vfs_unmountall() codepath, AFS kernel code cannot access the net.
To clear this up a bit, there's no kernel code that does anything like
this (e.g. explicitly taking the net down during a shutdown sequence).
The VM I was using has an interface handled by DHCP, and it gets taken
down when dhcpagent is killed with TERM. And the 'reboot' command tries
to kill ~everything with TERM before calling uadmin(). If there exists
an interface that is statically assigned an IP, this doesn't happen.
I still find it a little odd that sosendmsg doesn't give us an error
when there are no applicable routes available, but this should be much
easier to look at now that I know I don't need the kernel in the uadmin
syscall to reproduce what's going on.
i think previously we looked for such an error, so rx could get the
"instant timeout" it has on linux, macos, windows... and hadn't found
one.
it would be a simple matter to address if we could get one.
--
Derrick
Andrew Deason
2011-03-14 20:42:20 UTC
Permalink
On Mon, 14 Mar 2011 14:16:33 -0400
Post by Derrick Brashear
Post by Andrew Deason
I still find it a little odd that sosendmsg doesn't give us an error
when there are no applicable routes available, but this should be
much easier to look at now that I know I don't need the kernel in
the uadmin syscall to reproduce what's going on.
i think previously we looked for such an error, so rx could get the
"instant timeout" it has on linux, macos, windows... and hadn't found
one.
it would be a simple matter to address if we could get one.
connect() does issue an immediate failure, though. If we
soconnect();soconnect(NULL); before the sosendmsg(), we get an
ENETUNREACH right away. Or if that's too much overhead we could just
check every N sends, or only during certain conditions or something
(e.g. during AFS shutdown).

Or we could look at the interface list or route list a la NetIfPoller()
and try to determine outselves how available the net is during shutdown,
but ew. Probably not worth the effort.
--
Andrew Deason
***@sinenomine.net
Loading...