SA Bugzilla – Bug 5981
Improved error detection and reporting
Last modified: 2009-11-26 11:07:13 UTC
This trouble ticket is about improved error detection and reporting in SpamAssassin. My guideline is: a missed opportunity to early problem detection & reporting leads to hard-to-troubleshoot secondary failures, which adds unnecessary work for maintainers and to user dissatisfaction. An extra bit of effort and discipline while coding brings tenfold benefits to program maintainability. First some background - you'll notice this has been my pet peeve since a long time: > Bug 4412: > Comment #24 From Mark Martinec 2005-08-18 > > [...] The reasoning is that if something unexpected happens, > it is better to be reported as soon as possible and with as much > information as possible (the $!), instead of being possibly masked > by further attempts. > [...] > Unfortunately it didn't report the failure, and wasted the whole round > of 64k attempts to not do anything useful but return undef in $sock. > > WHICH MADE ME FIX THE ERROR REPORTING FIRST. Ignoring status, or not > reporting the full error status is always bad, at least in the long run, > when one has to deal with hordes of users, and remote troubleshooting > is very hard to do. > > Comment #26 From Justin Mason 2005-08-18 > I initially sided with Sidney, but Mark's statements all make sense to me. > [...] So let's go with the improved error reporting, and only later see > if there's a problem left that we need to worry about... > > > Bug 5507: > Reporter: Mark Martinec 2007-06-10 > > I'm attaching a small patch to improve diagnostics when > site rules directory was present during install, but was > inaccessible to a process running SA. > It took us some running around in circles (on a mailing list) > when a user reported SA failing [...] even though the directory > was present and accessible to anyone. > It turned out that one directory above [...] was too strongly > protected, and a sloppy -e() test in Perl does not distinguish > between 'does not exist' and 'is inaccessible'. [...] > Please incorporate the patch to facilitate future troubleshooting. > > Comment #4 From Mark Martinec 2007-08-22 > Perhaps, although there are dozens of other cases which test > for -f, -d, -e and similar, without calling stat() or lstat() > first, and therefore can not distinguish between (for example) > not being a directory or not existing or an access violation, > and not being able to be specific whether a test should follow > a symbolic link or not. In the least this causes a misleading > diagnostics, or at worst can cause a wrong decision. > > If you think this can be a good idea, I'd like to undertake > another global sweep across modules and replace simple > tests like -f $file with a: stat($file); ... -f _, > along the lines of the patch above [...] > > The amount of change would make this suitable for 3.3.0, > and probably not worth fixing a single case for 3.2.* . > > Comment #8 From Tom Schulz 2007-10-18 > Should a new bug be opened for comment #4 ? > > Comment #9 From Mark Martinec 2007-10-18 > I won't forget about my intent anyway - have notes scribbled in > many places on the printed listing of SA code, which I reviewed > during my summer vacation. It took me a year (after my code audit in summer of 2007) to finally invest some time and go systematically through code, adding status checks and corresponding problem reporting throughout the code. With little exceptions I added a check to each end every system call: open, close, readline, sysread, syswrite, print, printf, binmode, opendir, closedir, stat, lstat, -f, -d, -e, (and the like), mkdir, rmdir, unlink, flush, seek, tell, waitpid, flock Also improved detection and reporting of spawned process termination failures, replacing manual ad-hoc testing of a $? by new routines in module Util: proc_status_ok and exit_status_str, which I borrowed from amavisd-new (it is my own code, I assume I can re-use it here regardless of a GPL license of amavisd-new). Where there already was a status check, I tried to keep it unchanged, except for possible unification of wording in related messages. To some extent I tried to keep a style of a local neighborhood code regarding use of TAB, indentation and the like, although the existing style is somewhat inconsistent and I didn't try much to unify or keep it. How a problem is reported varies from place to place. A choice between die / warn / dbg / info / print STDERR / ignore is often rather arbitrary in existing code. Where there is a good reason for the choice I adopted a choice from neighboring code. Otherwise my main choice was to use a die. Most code, especially the plugins, are wrapped in some outer level evals so a die does not imply a program termination, just a limited loss of functionality. When there is a serious problem in outer layers, it is better for a program to fail than for example to result in a partially chopped-off mail (e.g. in case of disk quota exceeded, disk full, I/O errors). It is all in a form of a single diff (single revision), so it is easy to back it off in case of serious objections or difficulties. The diff is quite hefty, consisting of 195 patch chunks and 2900 lines of a unified diff, although most of the changes follow the same idioms. There are some corners which I left mostly untouched: - spamd.raw and spamc (as I'm not using them, except for testing), - mass checks, tests, tools, rules - sa-update.raw (I had this finished, but then I lost changes so I must do it again) - file based Bayes and AWL stores (as I'm not using them, only SQL) I'm running this code for a week now on our production mailer, and it does pass self-tests. Still, as tests and our usage profile does not cover all setup scenarios, it is possible there might be some error still lurking somewhere, or that some reaction to an error is too strong. I'd like to deal with these on a case-by-case basis, while keeping the general direction. Barring strong objections, I'd like to roll-in the patch to HEAD and see how it goes, hoping to receive prompt feedback on problems that might show up, and deal with them as soon as possible. Perhaps tomorrow earlier in the day would be a good time to check it in, covering European afternoon and US daytime local times so that any problems could be promptly resolved. (or now, if you are impatient)
Created attachment 4369 [details] proposed patch (just to give an idea on what I have in mind)
wow, that's comprehensive! +1 for application to trunk. the relicensing of those 2 GPL functions is, as you say, fine -- since you're their author, you can reuse them as you like.
> wow, that's comprehensive! +1 for application to trunk. Thanks. Here it comes, let's see how it fares... Committed revision 697056.
I really like this. We should do our best to encourage people to test with this well before 3.3 is likely to be released. There's a lot of potential for old bugs to be revealed once people start running with this code.
> There's a lot of potential for old > bugs to be revealed once people start running with this code. I see we have the first old-time bug uncovered (see Hudson report). The ArchiveIterator is happily letting _run_file process directory files, which causes read(2) to fail with EISDIR on NFS. A question: should _run_file check by itself (stat) what type of a path it got as an argument, or should _scan_directory take care not to include directory names in a list of files, or should do a stat unconditionally?
< directory names in a list of files, or should do a stat unconditionally? > directory names in a list of files, or should _scan_file do a stat unconditionally?
(In reply to comment #6) > < directory names in a list of files, or should do a stat unconditionally? > > > directory names in a list of files, or should _scan_file do a stat unconditionally? _run_file is already statting the file/dir: if (! $self->{opt_all} && -s INPUT > BIG_BYTES) { so I say skip directories there. stat()s in the _scan_file method can be expensive if there are millions of files.
> _run_file is already statting the file/dir: > if (! $self->{opt_all} && -s INPUT > BIG_BYTES) { > so I say skip directories there. stat()s in the _scan_file method can be > expensive if there are millions of files. Ok, done (and some more). This should deal with failing bayesdbm.t test 38 on Hudson; don't know yet what happened at test 39. Committed revision 697147 : ArchiveIterator: prevent _scan_directory from passing directories to _scan_file (on NFS it would fail with EISDIR on read(2); let _run_file double-check what kind of a path it is dealing with, also allow it to deal with STDIN (which might become useful some day)
Closing, as the bulk of it is now done. Still have to keep the issues in mind and not no become sloppy in future...
SATest.pm: include errno of a failing copy command in error messages; SATest.pm and 20_saw_ampersand.t: add some missing status tests and report failures; turn off debugging in sa_compile.t to reduce noise Sending t/SATest.pm Sending t/sa_compile.t Sending xt/20_saw_ampersand.t Committed revision 884667.