Post by Andrew DeasonOn Thu, 23 Jan 2014 11:43:50 +0000
Post by Simon WilkinsonThe real question here is how widely we should be applying the abort
threshold - should it apply to all aborts sent to a particular client,
or should we be more selective? There are a lot of competing views
here, as it depends on what you believe the abort threshold is
actually protecting us against.
Agreed. Personally I'm not in favor of abort thresholds but they have
prevented a large number of file servers from drowning under the load
applied by very broken clients in the wild.
Back in the 1.2.x days there were clients that were broken in some
extreme ways. For example a client that produced a new connection for
every RPC triggering a new TMAY and tying up a thread for a bit.
Another was a problem with users that obtained tokens in the morning and
left file browser open to their home directory. When the tokens expired
the file browser would read status for every file in the directory tree
it knew about and fail. Then repeat in rapid succession.
Take an org with a few hundred such desktops triggering the same
behavior ten hours after the work day starts and the file servers would
fall over each night.
Lest you say that EEXIST and ENOTEMPTY are different. Well, we have
seen cache corruption that makes the cache manager think the file does
exist when it does. This has triggered the application to retry
creation of the file over and over in a loop.
The primary purpose of the abort threshold is to protect the file server
from an abusive client whether or not the client intends to be abusive.
Post by Andrew DeasonWell, the issue also goes away if we do either of two other things that
- Don't issue RPCs we know will fail (mkdir EEXIST, rmdir ENOTEMPTY,
maybe others). Even without the abort threshold, this causes an
unnecessary delay waiting for the fileserver to respond. This is
really noticeable with git in AFS.
I am firmly in the camp that says the cache manager should avoid sending
any RPC it can assume with good justification will fail based upon known
state and file server rules. The EEXIST and ENOTEMPTY certainly fall
into that case. So do things like create file, create directory,
unlink, etc when it is known the permissions are wrong.
The same for writing to a file when it is known that quota has been
exceeded.
The Windows cache manager even takes things a step further by
maintaining a negative cache for EACCESS errors on {FID, user}. This
has avoided hitting the abort threshold limits triggered by Windows that
assumes that if it can list a directory it must be able to read the
status of all the objects within it.
Post by Andrew Deason- Don't raise aborts for every single kind of error (like
InlineBulkStatus). Aborts aren't secure, for one thing.
InlineBulkStatus can throw aborts. It just doesn't do so when an
individual FetchStatus on an included FID fails. The problem with this
approach is that OpenAFS cannot turn off the existing RPC support and
will always need abort thresholds to protect the file servers. At least
until it becomes possible for a file server to scale with the load so
that a pool of misbehaving clients cannot deny service to other clients.
Jeffrey Altman